Specifications
Size
4 GB to 12 GB
Architecture
Multi-modal LLM
Latency
Low (end-to-end)
Language
Multilingual
Developer / Creator
Google DeepMind
Download Source
Verified Repository Source
Hugging Face Hub / Google Model Registry
Open Model Repository (google/gemma-3)Model Overview
Gemma Audio is a native end-to-end audio-to-text model. It processes raw audio waveforms directly and produces transcription text without intermediate speech-to-text conversion. It runs via a persistent, localhost-only LiteRT-LM server. The model remains resident in memory for instant reuse during dictation sessions.
Available Model Variants
| Model Name | File Size | RAM Usage | Format/Quant | Languages | Description |
|---|---|---|---|---|---|
| Gemma 4 E2B | 2.41 GB | 1.7 GB | INT8 (LiteRT) | Multilingual | Google Gemma 4 audio-capable LiteRT-LM model. Highly efficient end-to-end model. |
| Gemma 4 E4B | 3.41 GB | 3.3 GB | INT8 (LiteRT) | Multilingual | Higher-capacity Google Gemma 4 audio-capable model. Advanced language parsing. |
| Gemma 4 12B | 6.10 GB | 12.0 GB | INT8 (LiteRT) | Multilingual | Large Google Gemma 4 audio-capable model for ultimate fidelity. Requires high RAM. |
| Gemma 3n | 3.40 GB | 4.5 GB | INT4 (LiteRT) | Multilingual | Google Gemma 3n audio-capable model. Int4 quantized version for balanced speed. |