10 votes

Google's Gemini 1.5 Pro is a new, more efficient AI model

1 comment

  1. skybrian
    (edited )
    Link
    From the article: ... From the technical report: Footnote 12 talks about the kinds of errors made during translation: They tested it by asking 100 questions about Les Misérables, a 1,462 page...

    From the article:

    Gemini 1.5 Pro can also handle up to one million tokens, or the units of data AI models can process in a single request. Google says Gemini 1.5 Pro can process over 700,000 words, an hour of video, 11 hours of audio and codebases with over 30,000 lines of code. The company says it’s even “successfully tested” a version that supports up to 10 million tokens.

    ...

    Google says Gemini 1.5 Pro can reason about various details from the 402-page Apollo 11 moon mission transcripts. In addition, it can analyze plot points and events from an uploaded 44-minute silent film starring Buster Keaton. “As 1.5 Pro’s long context window is the first of its kind among large-scale models, we’re continuously developing new evaluations and benchmarks for testing its novel capabilities,” Hassabis wrote.

    From the technical report:

    Finally, we qualitatively showcase the in-context learning abilities of Gemini 1.5 Pro enabled by very long context: for example, learning to translate a new language from a single set of linguistic documentation. With only instructional materials (500 pages of linguistic documentation, a dictionary, and ≈ 400 parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a language spoken by fewer than 200 speakers in western New Guinea in the east of Indonesian Papua2, and therefore almost no online presence. Moreover, we find that the quality of its translations is comparable to that of a person who has learned from the same materials.

    Footnote 12 talks about the kinds of errors made during translation:

    This is not to say that the task is solved; both the human and Gemini 1.5 Pro make avoidable errors, though typically of different kinds. The human errors tend to be retrieval failures, where they pick a suboptimal phrase because they could not find the ideal reference (because rereading the entire set of materials for each sentence is infeasible for a human). The model failures tend to be inconsistent application of rules, like that the word “se” is pronounced “he” after a vowel (this alternation is described in the phonology section of the grammar and reflected in the additional parallel sentence data, but the model may be confused by the fact that the underlying “se” form is used as the gloss throughout the examples within the grammar), or lack of reflection, like that the word “kabor”, although it is defined as “to be full” in the dictionary, is only used for stomachs/hunger in all examples of its use.

    They tested it by asking 100 questions about Les Misérables, a 1,462 page book, and compare it to Claude 2.1 (See table 4.) Since it's a famous book, it actually does pretty well without the text (75 questions answered, according to human evaluation) but this goes up to 80 with access to the book.

    They also say that testing on public benchmarks inflates results because of data contamination due to answers leaked onto the Internet. They recommend constructing a private benchmark:

    We invite researchers assessing coding abilities of these models head-to-head to always maintain a small set of truly held-out test functions that are written in-house, thereby minimizing the risk of leakage. The Natural2Code benchmark, which we announced and used in the evaluation of Gemini 1.0 series of models, was created to fill this gap. It follows the exact same format of HumanEval but with a different set of prompts and tests.

    1 vote