Google accused of using novices to fact-check Gemini’s AI answers

There’s no arguing that AI still has some unbelievable moments, but one can hope that at least its evaluations will be accurate. However, last week Google reportedly instructed contract workers evaluating Gemini not to skip any prompts regardless of their expertise, TechCrunch reported based on internal guidance. Google shared a preview of Gemini 2.0 earlier this month.

Google also reportedly instructed GlobalLogic, an outsourcing firm whose contractors evaluate AI-generated outputs, to not let reviewers skip prompts that were outside their expertise. Previously, contractors could choose to skip any prompt that was far from their expertise — such as asking a doctor about laws.

The guidelines said, “If you do not have significant expertise (e.g. coding, math) to rate this prompt, please skip this task.”

Now, contractors are reportedly instructed, “You should not omit prompts that require special domain knowledge” and that they should “rate the parts of the prompt that you understand” while adding a note that it is not an area they have knowledge of. Apparently, contracts can now only be omitted if a large portion of information is missing or if it contains damaging material that requires specific consent forms to evaluate.

One contractor responded appropriately to the changes, saying, “I thought the purpose of omitting was to increase accuracy by letting someone better do it?”

Shortly after this article was first published, Google provided the following statement to Engadget: “Raters perform a variety of functions across many different Google products and platforms. They provide valuable feedback on more than just the content of answers, but also on style, format, and other factors.

The ratings they provide don’t directly affect our algorithms, but when taken as a whole, are a helpful data point to help us measure how well our systems work.” The Google spokesperson also said that the new language shouldn’t change Gemini’s accuracy, since they’re asking raters to specifically rate the parts of the prompts they understand.

This can provide feedback for things like formatting issues, even if the rater doesn’t have specific expertise in the topic. The company also pointed to the FACTS grounding benchmark released this week that can check LLM responses to ensure “they are not only factually accurate with respect to the inputs provided, but also sufficiently detailed to satisfactorily answer user questions.”

Leave a Comment Cancel Reply