Tencent has released a new suite of compact Hunyuan models: 0.5 billion, 1.8 billion, 4 billion, and 7 billion parameters; they’re aimed at low-power and edge deployments. All four configurations are now available on GitHub and Hugging Face, and each can run inference on a single consumer-grade graphics card, making them suitable for laptops, smartphones, smart-cabin systems, and other resource-constrained hardware.
Despite their small sizes, the models achieve leading scores in language understanding, mathematics, and reasoning across several public benchmarks. Tencent attributes these results to a "fusion reasoning" architecture that allows users to select between a fast-thinking mode for concise answers and a slow-thinking mode for more elaborate multi-step reasoning.
A key technical feature is the native 256K token context window, which is sufficient to ingest roughly 500,000 English words in a single pass. Tencent highlights in-house applications such as Tencent Meeting and WeChat Reading, where the models can parse an entire meeting transcript or full-length book at once, maintaining character relationships and plot details for downstream queries.
The four compact LLMs integrate with mainstream inference frameworks, including SGLang, vLLM, and TensorRT-LLM, and support multiple quantization formats. Initial endorsements from Arm, Qualcomm, Intel, and MediaTek indicate forthcoming deployment packages optimized for their respective client processors.
Early use cases underscore the practical focus of the release. Tencent Mobile Manager reports millisecond-level spam interception without off-device data transfer. At the same time, a dual-model scheme in Tencent's smart-cabin assistant balances on-board power consumption against conversational depth. These examples, Tencent argues, demonstrate that small models can deliver enterprise-grade agent capabilities when thoughtfully engineered.
Source(s)
Fast Technology (in Chinese)