Benchmarks, hardware compatibility, local setup & review for Qwen 3.6-35B-A3B — the efficient MoE model that runs on consumer GPUs.
Select your GPU and system specs to find out which quantization you can run and how fast it will be.
How Qwen 3.6-35B-A3B stacks up against Llama 4 Scout, Gemma 4, Phi-4, and Mistral on standard benchmarks.
| Model | Params (Active) | MMLU | GSM8K | HumanEval | BBH | ARC-C | License |
|---|---|---|---|---|---|---|---|
| Qwen 3.6-35B-A3B 👑 Alibaba · Apr 2026 | 35B (3B active) | 86.4% | 95.1% | 84.2% | 77.3% | 72.1% | Open |
| Llama 4 Scout Meta · Apr 2025 · MoE | 109B (17B active) | 83.1% | 90.3% | 78.6% | 73.8% | 69.4% | Open |
| Gemma 4 27B Google · Mar 2026 | 27B (dense) | 82.7% | 88.4% | 76.1% | 71.2% | 70.8% | Open |
| Phi-4 14B Microsoft · Dec 2024 | 14B (dense) | 78.9% | 85.2% | 73.4% | 68.6% | 65.3% | Open |
| Mistral Small 3.1 Mistral · Mar 2025 | 24B (dense) | 77.4% | 81.7% | 74.2% | 66.1% | 63.8% | Open |
* Benchmarks from official model cards and community evaluations. Results may vary by prompt format. Qwen 3.6 results from Alibaba official release, April 17 2026.
95.1% on GSM8K — near-perfect school math. Qwen 3.6 leads all open models on numerical reasoning, likely due to dedicated math training data and the MoE routing allowing specialization.
+4.8pp over Llama 484.2% HumanEval pass@1. Qwen 3.6 outpaces Llama 4 Scout on Python generation despite having far fewer active parameters, demonstrating expert routing efficiency.
+5.6pp over Llama 4With only 3B active parameters, Qwen 3.6 delivers SOTA-class results at a fraction of the compute cost of dense models. Inference is 4–6x cheaper than running a full 35B dense model.
~8.5x fewer active params vs Llama 4Qwen 3.6-35B-A3B is Alibaba's April 2026 open-source MoE model. Here's what makes it tick.
Rather than activating all parameters for each token, Qwen 3.6 uses a learned router to select a small subset of "expert" sub-networks. Only ~3B of the 35B parameters activate per token — enabling near-dense-quality output at fraction of the compute.
35B total · 3B activeFull name: Qwen3.6-35B-A3B. Released April 17, 2026 by Alibaba DAMO Academy. The "A3B" suffix stands for "Active 3 Billion." Context window: up to 128K tokens depending on variant.
128K contextTrained on a multilingual dataset with a strong emphasis on STEM, code, and reasoning. The model continues Alibaba's Qwen lineage with improved mathematical reasoning from reinforcement learning post-training.
RLHF + Math RLReleased under the Qwen license (permissive for commercial use under 100M MAU). Available on Hugging Face as Qwen/Qwen3.6-35B-A3B. GGUF quantized versions available for llama.cpp.
A lightweight router network learns to assign each input token to the best-suited experts. This enables specialization — different experts handle code vs. language vs. math — without increasing inference cost.
Learned top-k routingFor users who want GPT-4 class reasoning locally: Qwen 3.6 fits in 24 GB VRAM at Q4, delivers near-perfect math, strong coding, and runs at useful speeds on a single consumer GPU.
Best open MoE at classStep-by-step commands for Ollama, llama.cpp, and vLLM. Pick your preferred runtime.
# 1. Install Ollama (macOS/Linux) curl -fsSL https://ollama.com/install.sh | sh # 2. Pull and run Qwen 3.6 (default Q4_K_M quantization) ollama run qwen3.6:35b # 3. Run a specific quantization ollama run qwen3.6:35b-a3b-q4_K_M # ~20 GB VRAM — recommended ollama run qwen3.6:35b-a3b-q5_K_M # ~25 GB VRAM — better quality ollama run qwen3.6:35b-a3b-q2_K # ~12 GB VRAM — minimum ollama run qwen3.6:35b-a3b-q8_0 # ~36 GB VRAM — near lossless # 4. Use the REST API curl http://localhost:11434/api/generate -d '{"model":"qwen3.6:35b","prompt":"Explain MoE architecture"}' # 5. OpenAI-compatible endpoint curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"qwen3.6:35b","messages":[{"role":"user","content":"Hello"}]}'
# 1. Build llama.cpp with CUDA support git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON cmake --build build --config Release # For Apple Silicon (Metal) cmake -B build -DGGML_METAL=ON cmake --build build --config Release # 2. Download GGUF model from Hugging Face # Q4_K_M — best size/quality tradeoff (~20 GB) wget https://huggingface.co/Qwen/Qwen3.6-35B-A3B-GGUF/resolve/main/qwen3.6-35b-a3b-q4_k_m.gguf # 3. Run inference ./build/bin/llama-cli \ -m qwen3.6-35b-a3b-q4_k_m.gguf \ -n 512 \ --n-gpu-layers 35 \ -p "You are a helpful assistant.\n\nUser: Explain quantum computing\nAssistant:" # 4. Run as local server (OpenAI-compatible) ./build/bin/llama-server \ -m qwen3.6-35b-a3b-q4_k_m.gguf \ --n-gpu-layers 35 \ --host 0.0.0.0 \ --port 8080 \ --ctx-size 8192 # 5. CPU-only (slower but works on any machine) ./build/bin/llama-cli \ -m qwen3.6-35b-a3b-q2_k.gguf \ --n-gpu-layers 0 \ --threads 16
# 1. Install vLLM (requires CUDA 12.1+, Python 3.9+) pip install vllm # 2. Launch Qwen 3.6 with vLLM server python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen3.6-35B-A3B \ --tensor-parallel-size 1 \ --max-model-len 32768 \ --port 8000 # Multi-GPU (2x GPU for FP16 full precision) python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen3.6-35B-A3B \ --tensor-parallel-size 2 \ --dtype bfloat16 # 3. Query the API curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3.6-35B-A3B", "messages": [{"role": "user", "content": "Write a Python fibonacci function"}], "max_tokens": 512 }' # 4. Use with OpenAI SDK pip install openai from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="none") response = client.chat.completions.create( model="Qwen/Qwen3.6-35B-A3B", messages=[{"role": "user", "content": "Hello!"}] )
# 1. Install dependencies pip install transformers torch accelerate # 2. Basic inference from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "Qwen/Qwen3.6-35B-A3B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain the MoE architecture"} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer([text], return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:])) # 3. With 4-bit quantization (saves VRAM) from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto" )
| Quantization | VRAM Required | Quality Loss | Speed (RTX 4090) | Best For |
|---|---|---|---|---|
| FP16 / BF16 | ~70 GB | None | ~18 tok/s | Research, production servers |
| Q8_0 | ~36 GB | Negligible | ~28 tok/s | Best local quality |
| Q5_K_M ⭐ Recommended | ~25 GB | Minimal | ~35 tok/s | RTX 3090/4090 single card |
| Q4_K_M ⭐ Most Popular | ~20 GB | Minor | ~42 tok/s | RTX 4090 / 3090 sweet spot |
| Q3_K_M | ~16 GB | Moderate | ~52 tok/s | 16 GB VRAM GPUs |
| Q2_K | ~12 GB | Significant | ~65 tok/s | Minimum viable, 12 GB GPUs |
Qwen 3.6's efficiency makes it ideal for local deployment across many use cases.
84.2% HumanEval makes Qwen 3.6 a top-tier local coding assistant. Run entirely offline, no API costs.
Near-perfect GSM8K and strong BBH scores make it exceptional for step-by-step math reasoning and tutoring.
128K context window + strong comprehension enables large document analysis, contracts, research papers.
Privacy-first chat assistant for enterprises. All data stays on-premise with no API keys required.
Strong multilingual training from Alibaba's data. Excels at Chinese↔English and other Asian language pairs.
Summarize papers, explain complex topics, generate hypotheses. Strong on scientific reasoning benchmarks.
Common questions about Qwen 3.6-35B-A3B.