Back to tapWhisper
Model Directory Profile

Google Gemma Audio

4 variants

Specifications

Size 4 GB to 12 GB
Architecture Multi-modal LLM
Latency Low (end-to-end)
Language Multilingual

Developer / Creator

Google DeepMind

Download Source

Verified Repository Source

Hugging Face Hub / Google Model Registry

Open Model Repository (google/gemma-3)

Model Overview

Gemma Audio is a native end-to-end audio-to-text model. It processes raw audio waveforms directly and produces transcription text without intermediate speech-to-text conversion. It runs via a persistent, localhost-only LiteRT-LM server. The model remains resident in memory for instant reuse during dictation sessions.

Available Model Variants

Model Name File Size RAM Usage Format/Quant Languages Description
Gemma 4 E2B 2.41 GB 1.7 GB INT8 (LiteRT) Multilingual Google Gemma 4 audio-capable LiteRT-LM model. Highly efficient end-to-end model.
Gemma 4 E4B 3.41 GB 3.3 GB INT8 (LiteRT) Multilingual Higher-capacity Google Gemma 4 audio-capable model. Advanced language parsing.
Gemma 4 12B 6.10 GB 12.0 GB INT8 (LiteRT) Multilingual Large Google Gemma 4 audio-capable model for ultimate fidelity. Requires high RAM.
Gemma 3n 3.40 GB 4.5 GB INT4 (LiteRT) Multilingual Google Gemma 3n audio-capable model. Int4 quantized version for balanced speed.