The Challenge
Google released Gemma 4 with impressive benchmarks and claims of "mobile readiness." Their TurboQuant research demonstrated advanced quantization techniques achieving excellent quality retention. But there was a problem: the smallest model was still 7.2 GB.
On a phone with 8GB of RAM, loading a 7.2 GB model leaves barely any room for the operating system, apps, or actual inference work. The result? Overheating, slowdowns, and thermal throttling. Google's "mobile-ready" promise didn't translate to real-world usability.
Could we compress Gemma 4 further without catastrophic quality loss, specifically targeting mobile devices with limited RAM?
The Hypothesis
Google's TurboQuant research focused on IQ4_XS (4-bit) quantization with sophisticated importance matrix techniques. We hypothesized that:
- Q3_K_S (3-bit) quantization could work if we accepted slightly lower precision
- Smaller models could be faster due to reduced memory bandwidth requirements
- Mobile thermal constraints matter more than benchmark scores
- Real-world testing would reveal issues lab benchmarks miss
Methodology
1. Quantization Strategy
We used Q3_K_S quantization from llama.cpp, which implements 3-bit quantization
with k-quants for structured compression. Key design decision: Remove vision and audio capabilities,
keep full 4.5B parameters and 128K context window.
- Full parameter retention — 4.5B params preserved for text inference quality
- 128K context maintained — no reduction in context window size
- Multimodal features removed — vision and audio dropped to reduce size
- Simpler dequantization path — fewer operations per token vs IQ4_XS
- Better cache utilization — smaller model fits better in L2/L3 cache
- Lower memory bandwidth — 3 bits vs 4 bits = 25% less data movement
This approach targets low-power systems where text-only AI is needed without GPU acceleration. By removing multimodal features rather than reducing parameters, we maintain text quality while achieving mobile-friendly sizes.
2. Model Variants
We quantized the entire Gemma 4 family:
| Model | Original Size | Nano Size | Reduction |
|---|---|---|---|
| gemma4:e2b (2B params) | 7.2 GB | 3.1 GB | 57% |
| gemma4:e4b (4B params) | 9.6 GB | 4.7 GB | 51% |
| gemma4:26b (26B params) | 18 GB | 12 GB | 33% |
| gemma4:31b (31B params) | 20 GB | 14 GB | 30% |
3. Real-World Testing
Unlike lab benchmarks, we tested on actual consumer hardware:
- Server testing — Intel Xeon E-2236 (6 cores @ 3.4GHz, no GPU)
- Mobile testing — High-end phone with 8GB RAM, 256GB storage
- Laptop testing — Consumer laptop without discrete GPU
Results
(including runtime)
(CPU only)
(mobile validated)
(vs stock models)
Performance Discovery
The most surprising finding: nano e2b was 2x faster than nano e4b (20.63 tok/s vs 11.31 tok/s) despite having fewer parameters. Why?
- Cache effects — 3.1 GB fits entirely in L3 cache on server
- Memory bandwidth — Smaller model = less data movement
- Simpler dequant — Q3_K_S has faster dequantization than IQ4_XS
- Zero network latency — Local inference beats cloud even at lower tok/s
Beyond a certain point, model size optimization matters more than parameter count. A 2B model that fits in cache outperforms a 9B model that doesn't.
Mobile Validation
The critical test: real phone deployment. Using the Offline AI app with Hugging Face models:
| Model | Size | Temperature | Performance | Usable? |
|---|---|---|---|---|
| Stock gemma4:e2b | 7.2 GB | 🔥 Overheating | Slow | ❌ No |
| gemma4-nano:e2b | 3.1 GB | ✅ Cool | Fast | ✅ Yes |
Why the stock model fails:
- 7.2 GB model + OS overhead = constant memory pressure
- Heavy memory bandwidth usage generates heat
- CPU throttles to prevent thermal damage
- Performance degrades into unusability
Why nano succeeds:
- 3.1 GB leaves 4+ GB for OS and apps
- Reduced memory bandwidth = less heat
- CPU sustains higher clocks without throttling
- Stays usable for extended sessions
Beyond Google's Research
Google's TurboQuant research is technically excellent, but it optimizes for the wrong constraints:
| Aspect | TurboQuant Focus | Nano Focus |
|---|---|---|
| Primary Goal | Benchmark scores | Real-world usability |
| Target Platform | Cloud/high-end devices | Budget phones/laptops |
| Constraint | Quality retention | Thermal limits |
| Optimization | Sophisticated IQ matrices | Aggressive size reduction |
Implications
For Mobile AI
Our research proves that truly mobile AI requires sub-4GB models. Anything larger will:
- Overheat on budget and mid-range phones
- Drain battery due to thermal throttling
- Provide poor user experience
- Fail to scale to billions of devices
For Edge Deployment
The nano approach enables:
- Offline-first applications — No internet required
- Privacy-preserving AI — Data never leaves device
- Zero inference cost — No API bills
- Instant response — Zero network latency
For the Industry
Google's "mobile-ready" marketing creates a gap between promise and reality. Our work demonstrates that:
- Technical capability ≠ practical usability
- Lab benchmarks miss thermal constraints
- Indie researchers can validate industry claims
- Real-world testing is essential
We're not competing with Google's research. We're delivering on their promise. They said "mobile-ready." We made it actually work.
Open Questions
Our research raises several questions for future work:
- Q2_K quantization — Can we go even smaller without catastrophic loss?
- Mixed precision — Can critical layers use 4-bit while most use 3-bit?
- Dynamic quantization — Can models adapt bitwidth based on available RAM?
- Thermal-aware inference — Can we throttle intelligently before overheating?
Availability
All gemma4-nano models are available:
- Ollama Hub: ollama.com/ssfdre38/gemma4-nano
- Hugging Face: huggingface.co/ssfdre38/gemma4-nano-gguf
- GitHub: github.com/ssfdre38/gemma4-turbo
This research was conducted independently as part of the Kaggle Gemma 4 Good Hackathon. All testing was performed on consumer hardware without sponsorship or endorsement.