tapWhisper — Google Gemma Audio

Specifications

Size 4 GB to 12 GB

Architecture Multi-modal LLM

Latency Low (end-to-end)

Language Multilingual

Developer / Creator

Google DeepMind

License

Gemma Terms of Use; publicly downloadable community LiteRT-LM conversion.

Download Source

Verified Repository Source

Hugging Face Hub / Google Model Registry

litert-community Gemma 4 & Gemma 3n Mirror

Exact runtime artifacts

Model Overview

Gemma Audio is a native end-to-end audio-to-text model. It processes raw audio waveforms directly and produces transcription text without intermediate speech-to-text conversion. It runs via a persistent, localhost-only LiteRT-LM server. The model remains resident in memory for instant reuse during dictation sessions.

Available Model Variants

Model Name	File Size	RAM Usage	Format/Quant	Languages	Description
Gemma 3n	3.41 GB	3.8 GB	INT4 (LiteRT)	Multilingual	Gemma Terms of Use; publicly downloadable community LiteRT-LM conversion.
Gemma 4 E2B	2.41 GB	1.7 GB	INT8 (LiteRT)	Multilingual	Google Gemma 4 audio-capable LiteRT-LM model. Highly efficient end-to-end model.
Gemma 4 E4B	3.41 GB	3.3 GB	INT8 (LiteRT)	Multilingual	Higher-capacity Google Gemma 4 audio-capable model. Advanced language parsing.
Gemma 4 12B	6.10 GB	12.0 GB	INT8 (LiteRT)	Multilingual	Large Google Gemma 4 audio-capable model for ultimate fidelity. Requires high RAM.

Back to tapWhisper