I just ran one of these locally on a Mac like this:
uvx litert-lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
--backend=gpu \
--prompt="Generate an SVG of a pelican riding a bicycle"
The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lm
It can handle audio and image input too, which is pretty cool for a 3.2GB model. For images:
Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article.
Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.
you misunderstand what that chart shows - it shows BF16 QAT Q4_0, not BF16 regular.
meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.
Like storing small 8 bit numbers in full 32 bit integers.
So it's not close to 100% of unquantized BF16.
I'm curious if anybody can explain why Google released 4 bit QAT Q4_0 is not exactly 100% of BF16 QAT Q4_0? seems like it should be just bit twiddling, no further quantization to convert between these two packings. Unsloth talks about "lattice alignment" being an issue.
That being said I hate it that smol model makers, like Google, Qwen, ... only show the BF16 benchmarks when they release a new models, knowing that what people really run are 4-8 bit quantizations, so it's really hard to understand how much you lose when you run 4 bit vs 6 bit...
> meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.
You also misunderstand what is happening. Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit. The BF16 QAT is not an upscaled 4-bit model. When quantized to 4-bit, it should lose less accuracy than a typical 16-bit model loses when quantized to 4-bit, but the loss is not zero, because it is not based on a 4-bit model.
"Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0."
The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly.
Very impressed with how much the Gemma ecosystem has advanced just this week.
Gemma 12B, multitoken prediction, and official quants released. Feels like Google is putting real effort into this string of releases, and I'm very excited to see that!
It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.
Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.
I'm not sure why you think it's awkward to have multiple releases. It's better to release models and variations as they're ready, not withhold them all until everything is ready to release all at once.
The Q4_0 is a quantization aware training checkpoint. It's not a simple quantization of the original Gemma 4 12B.
It probably sounds silly and really whiny in the abstract. It just causes a ton of work / confusion downstream that feels unnecessary.
Extremely glad for the output, not glad to have to chase it.
ex. llama.cpp currently supports the originals but not the MTP predictors but there is a patch for the MTP predictors but not for the small MoE models and I think it supports the 12B but maybe not media for it yet and now we have these too and the blog says there's GGUFs (llama.cpp models) but there isn't in any of the 12? repos I clicked through. and ~every consumer-facing local LLM app is built on llama.cpp or a fork of it.
Also if anyone at Google is taking feedback over to b/ or product, pleaseeee stop the "E"2B "E"4B thing, unless it's actually taking up less RAM on Android during CPU inference. I can't tell if I need to treat the 4B like an 8B (i.e. beyond most consumer hardware without a GPU) or a 4B (i.e. will run on most consumer hardware since 2021)
EDIT: And, yes, the QAT 12B x mmproj does not work with llama.cpp. I'm glad there's people who have the luxury of not having to, well, actually use these and treat me as whining :) I'll need to schedule another 4-8 hours of work for the 4th time, no fun!
These models aren't products? They are open source ish (open weight I guess), research outputs. While the naming scheme may be confusing, it is relevant and important. I believe it's on you to understand it.
And you're absolutely right to point out they aren't products - I hoped that was clear - when you're building a product with them, you end up having to do the same build loop 4 times, in this instance :)
You can stop after the first one. Choosing to repeat the process is on you, and probably because you see some benefit in using the variant(s) you build on top of.
Yes my framing was a little confusing. You were clear in that you are building products on them.
I was more saying that because these gemma models are not products, and instead research outputs, the naming scheme should be more scientific rather than consumer friendly.
I was just testing Gemma E2B and E4B yesterday, and they are just too dumb to be useful outside of niche use cases.
Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough.
Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.
I think that's probably true for the vast majority of Android phones. But if you have a SOTA expensive beast, I wonder if Gemma 4 12B at 4 bit could work? Maybe something like a Redmagic 11 pro or OnePlus 13 running NanoClaw?
But also maybe a few Qwen 3.6 or Qwen 3.5 variants can fit and can handle some simple tasks.
I think Gemma 4 12B is definitely possible to run on high end phones, google claims you need 16GB of memory. But it's probably not very usable, you'll need to swap most stuff other than the LLM.
When I tried E2B and E4B with Google Edge Gallery, and added a web search skill from the skill list, E2B would fail (get stuck in a loop), E4B would need a very specific instruction, "weather in [city name]" would not call the web search tool, I'd need "web search weather in [city name]". And the result was completely hallucinated and impossible. It claimed 14c and feels like 4c (which is impossible), and 10% humidity (which is almost impossible in this city)
Asking wikipedia level history questions (without any tool use), the results were awful as well.
I don't get this obsession with smaller models. I've been using Claude and GPT models for years and have had zero issues with them.
I see absolutely no benefit to me as a end user for a local model which is going to take up more of my CPU and memory and slow down my machine. I almost always have Internet and if I don't then not having access to a AI model is the least of my concerns.
The entire universe of automation projects that can be run effectively for free relative to SoTA models?
I don't think many realize that most LLM embedded automation, pipelines, products will soon be able to run extremely cheaply on models < 100B parameters.
Frontier models will be used for coding/creation use cases, yes. But for all the pseudo-deterministic, pipeline, analysis style things there will be no practical benefit to running frontier models, only additional cost.
Gemma 4 26B outperforms most 100-200B models that I've tested for reasoning and structured output.
Gemma 4 12B can consistently select where to click on browser images given a minimal prompt, and do so very quickly.
Practically if you're running a small personal automation project you're not going to want to waste a lot of time configuring and tuning a local model. You want to build the automation and move on.
If you're building a automation as a company you definitely won't want to take on the long term maintenance overhead of running your own models for some automation project.
These small models exist in the cloud and are/will be priced commensurately to their size.
Your claim is effectively that companies don't care about operational/cloud costs. Even pre-LLM, companies regularly assessed and tried to pare down cloud spend.
There is tinfoil.sh as well but honestly running this stuff on an airgapped server allows a better peace of mind about the data being used for something else.
What's wrong with the data being used for something else? Someone is providing digital intelligence to us, saving us many hours a week, so the least we can do is provide them a little data so they are able to improve their service.
It would be selfish and unethical not to in my view. And ultimately the data is just being used in order to improve the models and benefit us, not for anything nefarious.
By that logic, any software you run that isn't fully built by yourself is "third party" therefore you shouldn't run anything at all on your machine, thus obviating the need for it entirely.
I don't like the gaslighting of paying Anthropic or Open(Closed)AI and it being said its unsustainable for them to take my payment while simultaneously they take my data (edit: which is incredibly valuable) and I cannot opt out of that.
The obsession is for leaving hostile and abusive entities, the corporations or the people who fund them that have a horrible track record in regards to ethicality, rights and respect & human dignity.
My view is, if you're going to use the service - you should give the data.
It's like using Gmail and expecting them not to train their AI models on your data - how can you expect that when they're giving you a secure, reliable, highly functional email client completely for free?
The digital economy only works if everyone pays their fair share. If you don't want to give your data then you are really harming everyone by slowing down AI development for everyone else.
If I pay you for a service, what implicit right should you have to then continue to profit in perpetuity by storing the data I paid you to process?
If LLMs were free your Gmail analogy might hold up. They aren’t, and so it doesn’t.
AI development can continue with the data folks opt into, or with the data AI companies incessantly scrape with reckless disregard for polite system loads. AI development does not require retaining all user inputs forever.
It can handle audio and image input too, which is pretty cool for a 3.2GB model. For images:
And for audio: (The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.
[0] https://huggingface.co/collections/unsloth/gemma-4-qat
[1] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis
meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.
Like storing small 8 bit numbers in full 32 bit integers.
So it's not close to 100% of unquantized BF16.
I'm curious if anybody can explain why Google released 4 bit QAT Q4_0 is not exactly 100% of BF16 QAT Q4_0? seems like it should be just bit twiddling, no further quantization to convert between these two packings. Unsloth talks about "lattice alignment" being an issue.
That being said I hate it that smol model makers, like Google, Qwen, ... only show the BF16 benchmarks when they release a new models, knowing that what people really run are 4-8 bit quantizations, so it's really hard to understand how much you lose when you run 4 bit vs 6 bit...
You also misunderstand what is happening. Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit. The BF16 QAT is not an upscaled 4-bit model. When quantized to 4-bit, it should lose less accuracy than a typical 16-bit model loses when quantized to 4-bit, but the loss is not zero, because it is not based on a 4-bit model.
The Gemma 3 QAT report was a bit clearer:
https://developers.googleblog.com/en/gemma-3-quantized-aware...
"Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0."
The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly.
Gemma 12B, multitoken prediction, and official quants released. Feels like Google is putting real effort into this string of releases, and I'm very excited to see that!
It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.
Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.
The Q4_0 is a quantization aware training checkpoint. It's not a simple quantization of the original Gemma 4 12B.
- Gemma 4 2B/4B/27BE3B/31B
- Gemma 4 2B/4B/27BE3B/31B x "assistant" / MTP drafter models (i.e. multitoken prediction)
- Gemma 4 12B (2 days ago? 1?)
- Gemma 4 QAT 2B/4B/12B/27BE3B/31B x "assistant" models (i.e. multitoken prediction)
It probably sounds silly and really whiny in the abstract. It just causes a ton of work / confusion downstream that feels unnecessary.
Extremely glad for the output, not glad to have to chase it.
ex. llama.cpp currently supports the originals but not the MTP predictors but there is a patch for the MTP predictors but not for the small MoE models and I think it supports the 12B but maybe not media for it yet and now we have these too and the blog says there's GGUFs (llama.cpp models) but there isn't in any of the 12? repos I clicked through. and ~every consumer-facing local LLM app is built on llama.cpp or a fork of it.
Also if anyone at Google is taking feedback over to b/ or product, pleaseeee stop the "E"2B "E"4B thing, unless it's actually taking up less RAM on Android during CPU inference. I can't tell if I need to treat the 4B like an 8B (i.e. beyond most consumer hardware without a GPU) or a 4B (i.e. will run on most consumer hardware since 2021)
EDIT: And, yes, the QAT 12B x mmproj does not work with llama.cpp. I'm glad there's people who have the luxury of not having to, well, actually use these and treat me as whining :) I'll need to schedule another 4-8 hours of work for the 4th time, no fun!
And you're absolutely right to point out they aren't products - I hoped that was clear - when you're building a product with them, you end up having to do the same build loop 4 times, in this instance :)
The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!
Would be interested in testing this on my pixel.
https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-un...
(Pardon my ignorance; this stuff moves so fast)
Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough.
Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.
But also maybe a few Qwen 3.6 or Qwen 3.5 variants can fit and can handle some simple tasks.
When I tried E2B and E4B with Google Edge Gallery, and added a web search skill from the skill list, E2B would fail (get stuck in a loop), E4B would need a very specific instruction, "weather in [city name]" would not call the web search tool, I'd need "web search weather in [city name]". And the result was completely hallucinated and impossible. It claimed 14c and feels like 4c (which is impossible), and 10% humidity (which is almost impossible in this city)
Asking wikipedia level history questions (without any tool use), the results were awful as well.
I see absolutely no benefit to me as a end user for a local model which is going to take up more of my CPU and memory and slow down my machine. I almost always have Internet and if I don't then not having access to a AI model is the least of my concerns.
I don't think many realize that most LLM embedded automation, pipelines, products will soon be able to run extremely cheaply on models < 100B parameters.
Frontier models will be used for coding/creation use cases, yes. But for all the pseudo-deterministic, pipeline, analysis style things there will be no practical benefit to running frontier models, only additional cost.
Gemma 4 26B outperforms most 100-200B models that I've tested for reasoning and structured output.
Gemma 4 12B can consistently select where to click on browser images given a minimal prompt, and do so very quickly.
If you're building a automation as a company you definitely won't want to take on the long term maintenance overhead of running your own models for some automation project.
Your claim is effectively that companies don't care about operational/cloud costs. Even pre-LLM, companies regularly assessed and tried to pare down cloud spend.
All 3 years?
It would be selfish and unethical not to in my view. And ultimately the data is just being used in order to improve the models and benefit us, not for anything nefarious.
I'd rather not have intensive compute needed shifted onto my personal machine which I want to use for something else.
The obsession is for leaving hostile and abusive entities, the corporations or the people who fund them that have a horrible track record in regards to ethicality, rights and respect & human dignity.
It's like using Gmail and expecting them not to train their AI models on your data - how can you expect that when they're giving you a secure, reliable, highly functional email client completely for free?
The digital economy only works if everyone pays their fair share. If you don't want to give your data then you are really harming everyone by slowing down AI development for everyone else.
If I pay you for a service, what implicit right should you have to then continue to profit in perpetuity by storing the data I paid you to process?
If LLMs were free your Gmail analogy might hold up. They aren’t, and so it doesn’t.
AI development can continue with the data folks opt into, or with the data AI companies incessantly scrape with reckless disregard for polite system loads. AI development does not require retaining all user inputs forever.