Google releases Gemma 4

[9]

pete_the_paper_boat (OP)

April 2 (edited April 2)

Link

... right, well anyway, seems to perform quite well on benchmarks, especially for the size. Though personally, I'm more interested in the memory constrained versions and the native audio input is...

Gemma 4 delivers an unprecedented level of intelligence-per-parameter.

... right, well anyway, seems to perform quite well on benchmarks, especially for the size.

Though personally, I'm more interested in the memory constrained versions and the native audio input is neat.

10 votes

[3]
teaearlgraycold
April 2
Link Parent
I think that’s a perfectly reasonable way to word it. You’re just saying the same thing

I think that’s a perfectly reasonable way to word it.

seems to perform quite well on benchmarks, especially for the size

You’re just saying the same thing

9 votes
1. [2]
  PendingKetchup
  April 3
  Link Parent
  I think it's a probably-deliberately misleading, and even dangerous, way to word it, because it presupposes that "intelligence" is something that can be quantified, that should be quantified, or...
  
  I think it's a probably-deliberately misleading, and even dangerous, way to word it, because it presupposes that "intelligence" is something that can be quantified, that should be quantified, or that people know how to quantify, to create a number you can divide by something.
  
  We didn't do all that work to kill the idea that human intelligence can be boiled down to an IQ score that can then be used to assign value to people, just to have it sneak back in as an idea about computers.
  
  If Google was announcing they had hired a new and "more intelligent per neuron" crop of human employees this year, we would recognize it as obvious eugenics nonsense. If they pitched their new cloud instance type as "more intelligent per dollar" than the previous generation, that would be obviously unhelpful when things about core counts and memory capacity can be said instead. Why put up with that rhetoric here?
  
  We always need to keep in mind that we're using a score on a particular test as an evaluation metric, designed as a proxy for the efficiency with which a particular class of tasks might be accomplished. We can never use the shorthand of calling this just "intelligence".
  
  7 votes
  1. skybrian
    April 3 (edited April 3)
    Link Parent
    IQ isn't being used as a benchmark by the AI companies. They publish results for a variety of more specific benchmarks and the results are improving for most of them. This is summarized as "more...
    
    IQ isn't being used as a benchmark by the AI companies. They publish results for a variety of more specific benchmarks and the results are improving for most of them. This is summarized as "more capable" or "more intelligent" but that's just the summary.
    
    Benchmarks can be rigged and researchers keep inventing new ones. There's clearly not a consensus yet for measuring AI capabilities, just a general consensus that some models are stronger than others.
    
    6 votes
[5]
kacey
April 2
Link Parent
Yeah, it feels a little self-congratulatory, especially that it seems close enough to the Qwen3.5 models to not be something extraordinary. But I guess that restraint and press releases are mutual...

Yeah, it feels a little self-congratulatory, especially that it seems close enough to the Qwen3.5 models to not be something extraordinary.

But I guess that restraint and press releases are mutual exclusive concepts 😅

7 votes
1. [4]
  creesch
  April 3 (edited April 3)
  Link Parent
  I am not sure that comparing it to Qwen3.5 is entirely fair. At least based on my own anecdotal experience the whole "seems to perform quite well, especially for the size." bit seems to be true...
  
  I am not sure that comparing it to Qwen3.5 is entirely fair. At least based on my own anecdotal experience the whole "seems to perform quite well, especially for the size." bit seems to be true for real scenarios as well. At least the ones I have tried so far.
  
  This week I actually decided to try my hand on running local models again using llama.cpp. For a bit of context I have an AMD Radeon 9070XT with 16gb of memory.
  
  There is no reasonable way for me to run Qwen3.5 models at a reasonable quality without CPU offloading which completely tanks output performance.
  
  On the other hand I am able to run the 26B version of gemma entirely on my gpu. More importantly, this is the first model I have been able to run on my gpu that also performs very decently so far.
  
  To the point that I have dusted of some ideas I had ages ago to experiment with LLMs but never did because of the costs involved using online providers.
  
  7 votes
  1. kacey
    April 3 (edited April 3)
    Link Parent
    Interesting! I found that CPU offloading of just the experts (-cpu-moe in llama.cpp) let me run 35b a3b at 4bit with pretty good speeds -- about 200 t/s prefill, and 30 t/s generation. I'm on a...
    
    Interesting! I found that CPU offloading of just the experts (-cpu-moe in llama.cpp) let me run 35b a3b at 4bit with pretty good speeds -- about 200 t/s prefill, and 30 t/s generation. I'm on a pretty new Zen 5 CPU though so maybe that's making the difference?
    
    Anyhow, I'm glad that you're able to give it a shot, now!
    
    4 votes
  2. [2]
    Wes
    April 4
    Link Parent
    llama.cpp is great, but note it can take a couple days or sometimes weeks to get all the initial issues worked out with a brand new model. eg. This tokenizer issue which was fixed yesterday, and...
    
    llama.cpp is great, but note it can take a couple days or sometimes weeks to get all the initial issues worked out with a brand new model. eg. This tokenizer issue which was fixed yesterday, and more pending. Especially when they introduce new modalities, like audio in Gemma's case. From there it then rolls into the frontends (ollama, LM Studio, etc).
    
    Qwen 3.5 also has a number of third-party quants that exceed the official ones, such as those by Unsloth. You might have better luck running them on your hardware. You can enter your specs into Hugging Face and it'll give you an estimate of what'll perform for you (though I'm not sure how accurate it is for MoE models).
    
    4 votes
    
    creesch
    April 4
    Link Parent
    Yup I am aware, which is among the reasons I still think gemma4 is fairly impressive given the size.
    
    Yup I am aware, which is among the reasons I still think gemma4 is fairly impressive given the size.
    
    2 votes

[2]

rich_27

April 4

Link

As someone who hasn't kept up with Google's offerings, what is the difference between Gemma and Gemini?

7 votes

FlippantGod
April 4
Link Parent
Gemini is Google's big closed-source cloud service models. Gemma is small open-source models that can be run locally and offline.

Gemini is Google's big closed-source cloud service models. Gemma is small open-source models that can be run locally and offline.

10 votes

[7]

pumpkin-eater

April 3

Link

Perhaps my use-case is too specific, but I've struggled to find a local model better at factual summarisation of transcripts than qwen2.5:14b. So many of the models I've tried try to hide their...

Perhaps my use-case is too specific, but I've struggled to find a local model better at factual summarisation of transcripts than qwen2.5:14b. So many of the models I've tried try to hide their lack of understanding with broad statements or implications about what's going on rather than stating concrete facts as requested in the prompt. gemma4:26b (e4b) is the same so far (I'm trying to provide a blow-by-blow of TTRPG sessions)

6 votes

[6]
Omnicrola
April 3
Link Parent
Are you willing to share more? This sounds similar to a pet project I'm currently working on; extracting specific data (not summarized) from podcast transcripts. Currently I'm splitting the...

(I'm trying to provide a blow-by-blow of TTRPG sessions)

Are you willing to share more? This sounds similar to a pet project I'm currently working on; extracting specific data (not summarized) from podcast transcripts. Currently I'm splitting the transcript into chunks and running it through a Qwen3 model, and getting pretty good results. On par with running the same transcript through Claude Opus.

4 votes
1. [5]
  pumpkin-eater
  April 3
  Link Parent
  Sure... firstly, if anybody is interested do DM me for access (the main pitch is privacy by avoiding cloud models, though, so whether there's any benefit to you in having your...
  
  Exemplary
  
  Sure... firstly, if anybody is interested do DM me for access (the main pitch is privacy by avoiding cloud models, though, so whether there's any benefit to you in having your transcripts/summaries living remotely on my server is another question... I am hoping to release it eventually as a free self-hostable project)
  
  It's a website and discord bot that records separate audio files for each participant in a call, clips up the recordings based on the start+end time (it starts recording as soon as everybody joins the chat, but you can DM it to cut off the start of a call, so that general opening banter isn't transcribed), then runs each recording file separately through whisperx (distil-large-v3.5, silero vad, then WAV2VEC2_ASR_LARGE_LV60K_960H to better-align to timestamps). It constructs a transcript, runs through a fixed pipeline to clean up noise words, produces a transcript that uses character names, then runs it through chunked summarisation with qwen2.5:14b using structured output (basically asking it to produce headings for sections and points within the sections).
  
  The LLM is being used both to summarise but also to remove off-topic / meta-discussion. It does an OK job, but my tests feeding transcripts into Opus are wildly better.
  
  At the end, after somebody has signed off on the summary, I turn the combined JSON summary into markdown and send it to Gemini Flash for summarisation.
  
  I've tried Qwen3, Phi4, etc. but struggled to find something better than Qwen2.5 (I'm limited to a 12GB 4070). The whisperx transcript has a bunch of errors (some of which may be cleared up if I ran a combined audio file through, but I don't like that I lose the completely accurate diarisation I get from having separate audio tracks for the different players). I've tried Parakeet too, but accuracy isn't great. Not sure if the problem is the separate audio tracks losing conversational context, accents (mix of different nationalities), or perhaps simply poor diction.
  
  At its core it's a pretty straight-forward transcribe+summarise process coordinated by typescript, complicated by the need to be local-only and my network setup with a linux server but a windows machine that has an RTX 4070 I use for CUDA (my groups don't want what they say to be sent to a cloud model, so that they don't have to worry about an off-colour joke coming back to bite them in the future).
  
  I'm not sure if it's generally useful, but I'm certainly happy to share/collaborate if there's anything there that sounds like it might be useful.
  
  The prompt that I've worked out that gives me the best results is:
  
  You are summarizing a TTRPG session transcript. System: (system name) Background: (1 sentence context for game) Player Characters: - (character name) - (1 sentence summary) Instructions: - Extract key narrative sections from this transcript chunk - Focus on in-game events, interesting character actions, story developments. If in doubt, assume the problems and achievements are interesting, err on the side of completeness - The GM plays all NPCs and describes scenes and results - Players are identified by their character name in the transcript - Describe the interesting things that CHARACTERS DO, places and problems they encounter, what NPCS DO THAT EFFECT CHARACTERS, and how they solve them - USE CHARACTER/NPC NAMES - Only report CONCRETE events that actually happened, not implications - Filter out off-topic chatter, technical issues, meta-discussion.
  
  (the biggest problem I've had with prompting is trying to block the tendency of models to say fluff like "raising questions about", or "make deep discoveries", which I don't want them to do, or even worse to summarise moods "after a feverishly pitched battle, the exhausted group..." - I want a simple digest of who did what to whom)
  
  8 votes
  1. [2]
    Omnicrola
    April 3
    Link Parent
    That's a really interesting project! What kind of errors is whisperx introducing that you're seeing? I haven't gotten to the cleanup phase of my project yet, but I'm sure I'm going to have to do...
    
    That's a really interesting project! What kind of errors is whisperx introducing that you're seeing? I haven't gotten to the cleanup phase of my project yet, but I'm sure I'm going to have to do some, since a lot of the words aren't "real" and basic phenoms detection can only go so far.
    
    re: your prompt - have you tried providing examples? I got very mediocre results from my model until I gave it a multishot example of "this transcript should result in this output" along with the base prompt.
    
    5 votes
    
    pumpkin-eater
    April 3
    Link Parent
    That's really helpful advice thank you! And encouraging to hear you think it's interesting... I keep doubting myself whether it's a worthwhile side-project to continue with (especially without...
    
    That's really helpful advice thank you! And encouraging to hear you think it's interesting... I keep doubting myself whether it's a worthwhile side-project to continue with (especially without buying a new card with more vram, which would be way over budget for a side project, this is really my only side-project that uses local inference so hard to justify)
    
    I haven't tried giving examples, tbh, I've been using quite small context windows for chunking and thought any sort of example would blow the context out, but you make a good point... I'm so used to the frontier models broad zero-shot capabilities. I'd originally imagined you'd need a representative example of the type of conversation in the chat, but maybe I'm presupposing what a solution looks like... I should try coming up with short examples.
    
    On whisper recognition errors: I'm not sure if my problems are broadly applicable, speech recognition tends to have problems with celtic accents, and on top of that we go off on tangents a lot, we aren't disciplined roleplayers. We're a majority Northern Irish group, and the American members are so used to us that we all speak at full speed everyday speech (just checked and the effective WPM after noise word filtering is 1.5x for Irish vs American players). Another part of our broader group has German and Afrikaans accents, they've just started their game and I'm really interested to find out how it'll cope with that.
    
    The more broadly applicable problem I did have, originally, was reassembling transcripts given that I split up recognition per-speaker - as you probably know whisper timecodes drift in longer recordings, whisperx with wav2vec alignment was a big help.
    
    LLMs do seem good at seeing through the recognition errors to get a general sense of what's going on (I assume because of how redundant English is).
    
    One thing I've been wanting to try is an audio model, to try blending text and audio tokens, i.e. a stream of "[speaker name] [audio tokens]" and see if it can do a better job at transcription/direct summarisation with more conversational context available.
    
    One experiment I tried originally was augmenting transcripts with structure prior to summarisation, but it didn't really pan out (unclear whether that was my prompting, or a flawed approach): my thought was that TTRPG sessions are a collection of scenes, where the GM establishes a scene, PCs and NPCs interact, and then the scene is resolved.
    
    My approach is a limiting factor with experiments, though, because one of the privacy goals I set is that transcripts get thrown away within 7 days (keeping only the summaries) so I don't have a large backlog of sessions to run experiments against. I wanted to do that to make sure that players didn't feel the need to self-censor with the idea that their words would live on in a database forever, and I think it's worthwhile, if the sessions being recorded changed how people behaved I think I'd stop the project, because it's supposed to be fun escapism with friends
    
    (apologies for the overly long response...)
    
    3 votes
  2. [2]
    plutonic
    April 3
    Link Parent
    This tendency of the models to constantly add this kind of bullshit fluff is so bizarre, if it was a real person saying things like that they would be completely insufferable.
    
    (the biggest problem I've had with prompting is trying to block the tendency of models to say fluff like "raising questions about", or "make deep discoveries", which I don't want them to do, or even worse to summarise moods "after a feverishly pitched battle, the exhausted group..." - I want a simple digest of who did what to whom)
    
    This tendency of the models to constantly add this kind of bullshit fluff is so bizarre, if it was a real person saying things like that they would be completely insufferable.
    
    3 votes
    
    creesch
    April 3
    Link Parent
    It starts to make sense once you take into account that these models have been trained on all data on the internet and how much of the internet is filled with corporate hype and fluff. I wouldn't...
    
    It starts to make sense once you take into account that these models have been trained on all data on the internet and how much of the internet is filled with corporate hype and fluff.
    
    I wouldn't even be surprised if responding in this overly fluffy corporate lingo is what these models have been (partially) trained for on purpose. Basically talk the language of the people you need to impress.
    
    5 votes

Link information

18 comments