Can I Run Qwen 3.6 Locally?

Select your GPU and system specs to find out which quantization you can run and how fast it will be.

Auto
0 (Auto)64 GB128 GB192 GB

Qwen 3.6 Benchmark Comparison

How Qwen 3.6-35B-A3B stacks up against Llama 4 Scout, Gemma 4, Phi-4, and Mistral on standard benchmarks.

Model Params (Active) MMLU GSM8K HumanEval BBH ARC-C License
Qwen 3.6-35B-A3B 👑 Alibaba · Apr 2026 35B (3B active) 86.4% 95.1% 84.2% 77.3% 72.1% Open
Llama 4 Scout Meta · Apr 2025 · MoE 109B (17B active) 83.1% 90.3% 78.6% 73.8% 69.4% Open
Gemma 4 27B Google · Mar 2026 27B (dense) 82.7% 88.4% 76.1% 71.2% 70.8% Open
Phi-4 14B Microsoft · Dec 2024 14B (dense) 78.9% 85.2% 73.4% 68.6% 65.3% Open
Mistral Small 3.1 Mistral · Mar 2025 24B (dense) 77.4% 81.7% 74.2% 66.1% 63.8% Open

* Benchmarks from official model cards and community evaluations. Results may vary by prompt format. Qwen 3.6 results from Alibaba official release, April 17 2026.

📐

Math & Reasoning Champion

95.1% on GSM8K — near-perfect school math. Qwen 3.6 leads all open models on numerical reasoning, likely due to dedicated math training data and the MoE routing allowing specialization.

+4.8pp over Llama 4
💻

Strongest Coding Accuracy

84.2% HumanEval pass@1. Qwen 3.6 outpaces Llama 4 Scout on Python generation despite having far fewer active parameters, demonstrating expert routing efficiency.

+5.6pp over Llama 4

Efficiency per Parameter

With only 3B active parameters, Qwen 3.6 delivers SOTA-class results at a fraction of the compute cost of dense models. Inference is 4–6x cheaper than running a full 35B dense model.

~8.5x fewer active params vs Llama 4

What is Qwen 3.6? Architecture Overview

Qwen 3.6-35B-A3B is Alibaba's April 2026 open-source MoE model. Here's what makes it tick.

🧠

Mixture-of-Experts (MoE)

Rather than activating all parameters for each token, Qwen 3.6 uses a learned router to select a small subset of "expert" sub-networks. Only ~3B of the 35B parameters activate per token — enabling near-dense-quality output at fraction of the compute.

35B total · 3B active
🔢

Model Specifications

Full name: Qwen3.6-35B-A3B. Released April 17, 2026 by Alibaba DAMO Academy. The "A3B" suffix stands for "Active 3 Billion." Context window: up to 128K tokens depending on variant.

128K context
🌍

Training & Data

Trained on a multilingual dataset with a strong emphasis on STEM, code, and reasoning. The model continues Alibaba's Qwen lineage with improved mathematical reasoning from reinforcement learning post-training.

RLHF + Math RL
📦

Open Source Availability

Released under the Qwen license (permissive for commercial use under 100M MAU). Available on Hugging Face as Qwen/Qwen3.6-35B-A3B. GGUF quantized versions available for llama.cpp.

HuggingFace · Apache-like
🔄

Expert Routing

A lightweight router network learns to assign each input token to the best-suited experts. This enables specialization — different experts handle code vs. language vs. math — without increasing inference cost.

Learned top-k routing
🏆

Why Qwen 3.6?

For users who want GPT-4 class reasoning locally: Qwen 3.6 fits in 24 GB VRAM at Q4, delivers near-perfect math, strong coding, and runs at useful speeds on a single consumer GPU.

Best open MoE at class

Run Qwen 3.6 Locally

Step-by-step commands for Ollama, llama.cpp, and vLLM. Pick your preferred runtime.

bash — Ollama (easiest, recommended for beginners)
# 1. Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull and run Qwen 3.6 (default Q4_K_M quantization)
ollama run qwen3.6:35b

# 3. Run a specific quantization
ollama run qwen3.6:35b-a3b-q4_K_M   # ~20 GB VRAM — recommended
ollama run qwen3.6:35b-a3b-q5_K_M   # ~25 GB VRAM — better quality
ollama run qwen3.6:35b-a3b-q2_K     # ~12 GB VRAM — minimum
ollama run qwen3.6:35b-a3b-q8_0     # ~36 GB VRAM — near lossless

# 4. Use the REST API
curl http://localhost:11434/api/generate -d '{"model":"qwen3.6:35b","prompt":"Explain MoE architecture"}'

# 5. OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.6:35b","messages":[{"role":"user","content":"Hello"}]}'
bash — llama.cpp (best performance tuning)
# 1. Build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# For Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release

# 2. Download GGUF model from Hugging Face
# Q4_K_M — best size/quality tradeoff (~20 GB)
wget https://huggingface.co/Qwen/Qwen3.6-35B-A3B-GGUF/resolve/main/qwen3.6-35b-a3b-q4_k_m.gguf

# 3. Run inference
./build/bin/llama-cli \
  -m qwen3.6-35b-a3b-q4_k_m.gguf \
  -n 512 \
  --n-gpu-layers 35 \
  -p "You are a helpful assistant.\n\nUser: Explain quantum computing\nAssistant:"

# 4. Run as local server (OpenAI-compatible)
./build/bin/llama-server \
  -m qwen3.6-35b-a3b-q4_k_m.gguf \
  --n-gpu-layers 35 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 8192

# 5. CPU-only (slower but works on any machine)
./build/bin/llama-cli \
  -m qwen3.6-35b-a3b-q2_k.gguf \
  --n-gpu-layers 0 \
  --threads 16
bash — vLLM (best for production / high throughput)
# 1. Install vLLM (requires CUDA 12.1+, Python 3.9+)
pip install vllm

# 2. Launch Qwen 3.6 with vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --port 8000

# Multi-GPU (2x GPU for FP16 full precision)
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 2 \
  --dtype bfloat16

# 3. Query the API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Write a Python fibonacci function"}],
    "max_tokens": 512
  }'

# 4. Use with OpenAI SDK
pip install openai
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[{"role": "user", "content": "Hello!"}]
)
Python — Hugging Face Transformers
# 1. Install dependencies
pip install transformers torch accelerate

# 2. Basic inference
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen3.6-35B-A3B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the MoE architecture"}
]

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)

print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:]))

# 3. With 4-bit quantization (saves VRAM)
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

Quantization Guide — VRAM Requirements

Quantization VRAM Required Quality Loss Speed (RTX 4090) Best For
FP16 / BF16 ~70 GB None ~18 tok/s Research, production servers
Q8_0 ~36 GB Negligible ~28 tok/s Best local quality
Q5_K_M ⭐ Recommended ~25 GB Minimal ~35 tok/s RTX 3090/4090 single card
Q4_K_M ⭐ Most Popular ~20 GB Minor ~42 tok/s RTX 4090 / 3090 sweet spot
Q3_K_M ~16 GB Moderate ~52 tok/s 16 GB VRAM GPUs
Q2_K ~12 GB Significant ~65 tok/s Minimum viable, 12 GB GPUs

What Can You Build with Qwen 3.6?

Qwen 3.6's efficiency makes it ideal for local deployment across many use cases.

💻

AI Coding Assistant

84.2% HumanEval makes Qwen 3.6 a top-tier local coding assistant. Run entirely offline, no API costs.

PythonJavaScriptRust
📐

Math Tutor & Solver

Near-perfect GSM8K and strong BBH scores make it exceptional for step-by-step math reasoning and tutoring.

CalculusStatisticsAlgebra
📄

Document Q&A / RAG

128K context window + strong comprehension enables large document analysis, contracts, research papers.

PDFLegalResearch
🤖

Local Chatbot / Agent

Privacy-first chat assistant for enterprises. All data stays on-premise with no API keys required.

PrivacyOn-premGDPR
🌍

Multilingual Translation

Strong multilingual training from Alibaba's data. Excels at Chinese↔English and other Asian language pairs.

ChineseJapaneseKorean
🔬

STEM Research Assistant

Summarize papers, explain complex topics, generate hypotheses. Strong on scientific reasoning benchmarks.

BiologyPhysicsChemistry

Frequently Asked Questions

Common questions about Qwen 3.6-35B-A3B.

Qwen 3.6 (officially Qwen 3.6-35B-A3B) is a Mixture-of-Experts (MoE) large language model released by Alibaba on April 17, 2026. It has 35 billion total parameters but only activates approximately 3 billion parameters per token, making it extremely efficient for local deployment while delivering near-top-tier benchmark scores.
Qwen 3.6-35B-A3B has 35 billion total parameters. However, due to its Mixture-of-Experts (MoE) architecture, only approximately 3 billion parameters are active during any single forward pass. The "A3B" in the name stands for "Active 3 Billion." This is what makes it feasible to run locally on consumer hardware despite having 35B parameters.
Yes! Because only 3B parameters are active at inference time, Qwen 3.6 runs on consumer GPUs. With Q4_K_M quantization you need approximately 20 GB VRAM (e.g., RTX 3090, RTX 4090). With Q2_K quantization it fits in about 12 GB (e.g., RTX 4070). You can use Ollama for the easiest setup, llama.cpp for performance tuning, or vLLM for production deployments.
Qwen 3.6-35B-A3B outperforms Llama 4 Scout on math (GSM8K: 95.1% vs 90.3%) and coding (HumanEval: 84.2% vs 78.6%) benchmarks. Llama 4 Scout has higher total parameter capacity at 109B (with 17B active), giving it more "room" for knowledge but at higher compute cost. For a single GPU local setup, Qwen 3.6's 3B active parameters make it significantly more practical.
Yes. Qwen 3.6-35B-A3B is released as open-source by Alibaba under the Qwen license, which allows commercial use for organizations with fewer than 100 million monthly active users. Model weights are available on Hugging Face at Qwen/Qwen3.6-35B-A3B. GGUF quantized versions for llama.cpp are also available.
First install Ollama: curl -fsSL https://ollama.com/install.sh | sh. Then run: ollama run qwen3.6:35b — Ollama will download and start the model automatically. For a specific quantization use ollama run qwen3.6:35b-a3b-q4_K_M for the best balance of quality and VRAM usage.
Qwen 3.6-35B-A3B supports up to 128,000 tokens of context in its full configuration. In practice, when running locally with quantization, you may need to reduce the context size to fit within your available VRAM. A context of 8,192–32,768 tokens is typical for most local setups.
On an RTX 4090 with Q4_K_M quantization, Qwen 3.6 runs at approximately 40–45 tokens per second — fast enough for real-time conversation. On an RTX 3090, expect 30–38 tok/s. Apple M4 Max (128 GB) achieves roughly 35–50 tok/s due to unified memory bandwidth. CPU-only on a modern 16-core machine achieves 3–8 tok/s.