Gemma 4 on Hugging Face: Google's Easter surprise for download

Just before Easter, Google dropped a major surprise on Hugging Face: the long-awaited Gemma 4 is now available for download. The launch features four primary size classes: E2B, E4B, 26B A4B, and 31B. All models feature an integrated "Thinking" mode, enabling them to process complex problems step-by-step before delivering a final answer. The excitement surrounding the release is evident, as Gemma 4 became locally usable in tools like LM Studio and Unsloth within hours of its debut.
According to Google, this new generation prioritizes efficiency over raw size. A standout improvement over the previous Gemma 3 iteration is that the smallest models in the current series already match the performance levels of the largest Gemma 3 model across various benchmarks. In practical terms, this means tasks that previously required high-end hardware can now be performed locally on a smartphone.
The architecture varies depending on the intended use case. While the 31B variant utilizes a relatively classic structure, the 26B-A4B model employs a Mixture-of-Experts (MoE) approach. During inference—the actual calculation process—only about four billion parameters are activated, despite the model possessing 26 billion in total. This ensures high speed and moderate resource consumption without sacrificing depth of knowledge. The smaller E2B and E4B models utilize Per-Layer Embeddings (PLE), which provide specialized information for each token at every layer of the model, optimizing performance specifically for mobile processors.
There are also significant advancements in the context window—the amount of data the model can keep "in mind" simultaneously. The E2B and E4B models support 128,000 tokens, while the larger variants (26B A4B and 31B) can handle up to 256,000 tokens. This capacity allows users to analyze massive documents or complex code structures in a single pass.
Multi-modality is deeply integrated into Gemma 4, allowing users to mix text and images seamlessly within a single prompt. The models are capable of object recognition, reading PDF documents, and Optical Character Recognition (OCR). Furthermore, the edge models (E2B and E4B) include native processing for video and audio formats, enabling features such as automatic speech recognition.
Another powerful feature is native support for "Function Calling." This allows the AI to act as a virtual assistant, independently executing software commands or using external tools to complete tasks. A clear example of this trend is the "OpenClaw" tool currently popular in China, which relies on this principle of AI agents. With Gemma 4, deploying such systems entirely on one's own device becomes significantly easier.
The legal framework is also a welcome change: the models are released under the Apache 2.0 license. This means they are not only free to use but can also be flexibly integrated into proprietary projects and used commercially—drastically lowering the barrier for developers. Previously, all Gemma models were released under a custom license authored by Google.
Initial hands-on testing underscores the impressive linguistic capabilities and increased efficiency of these models. Using LM Studio on a Bosgame M5, we achieved a response speed of just over 10 tokens per second (tok/s) with the Gemma 4 31B model—faster than the average reader can process information. The smaller models are even more agile: the E4B and 26B A4B variants easily exceed 40 tok/s, with the smallest model topping 60 tok/s. However, those wishing to utilize the full context size of the largest Gemma 4 model may find even 128 GB of RAM (as found in the Bosgame M5) to be tight; the AI can claim over 80 GB for itself, leaving little memory available for other tasks.








