QwQ-32B - A high-performance, resource-efficient inference model developed by Ali Qwen.

## Overview of QwQ-32B QwQ-32B is an open-source inference model developed by the Ali Qwen team, featuring approximately 3.25 billion parameters. It is designed to enhance reasoning capabilities, particularly in mathematical and coding tasks, and performs comparably to larger models like DeepSeek-R1. The model can run on consumer-grade GPUs and is available under the Apache 2.0 license. ## Key Features of QwQ-32B QwQ-32B is a causal language model with the following key features: - **Type**: Causal Language Model - **Training Phases**: Pre-training and post-training (including supervised fine-tuning and reinforcement learning) - **Architecture**: Based on transformers, incorporating RoPE, SwiGLU, RMSNorm, and Attention QKV bias - **Parameter Count**: 3.25 billion total parameters, with 3.1 billion non-embedding parameters - **Layers**: 64 layers - **Attention Heads (GQA)**: 40 Q heads and 8 KV heads - **Context Length**: Supports up to 131,072 tokens, with YaRN required for prompts exceeding 8,192 tokens - **Quantized Version**: Q4_K_M, with a file size of approximately 20GB - **License**: Apache 2.0 ## Performance of QwQ-32B in Specific Tasks QwQ-32B excels in mathematical and coding tasks, demonstrating performance comparable to larger models like DeepSeek-R1. It is particularly effective in enhancing reasoning capabilities for complex problems, making it suitable for applications such as academic research, AI development, and practical uses like chatbots, code generation, and mathematical problem-solving. ## Hardware Requirements for QwQ-32B QwQ-32B can run on consumer-grade GPUs such as the RTX 3090 and RTX 4090. The quantized version, Q4_K_M, has a file size of approximately 20GB, making it suitable for users with limited computational resources. ## Usage Guidelines for QwQ-32B To use QwQ-32B, follow these guidelines: - **Environment Requirements**: Use the latest version of the transformers library (versions below 4.37.0 may cause errors). - **Quick Start**: Load the model using AutoModelForCausalLM and AutoTokenizer from "Qwen/QwQ-32B". - **Forced Thought Output**: Begin prompts with "\\n" and set add_generation_prompt=True using apply_chat_template. - **Sampling Parameters**: Recommended settings include Temperature=0.6, TopP=0.95, MinP=0, TopK=20-40, and presence_penalty=0-2. - **Multi-turn Dialogue**: Use apply_chat_template for smooth dialogue without thought content in history. - **Output Format Standardization**: For mathematical problems, provide step-by-step reasoning and box the final answer with \\boxed{}. For multiple-choice questions, use JSON format and output only the option letter (e.g., "answer": "C"). - **Long Input Handling**: Enable YaRN for prompts exceeding 8,192 tokens by adding specific configurations to config.json and using vLLM. ## Access Points for QwQ-32B QwQ-32B can be accessed through the following links: - **Hugging Face**: [QwQ-32B on Hugging Face](https://huggingface.co/Qwen/QwQ-32B) - **Demo**: [QwQ-32B Demo](https://huggingface.co/spaces/Qwen/QwQ-32B-Demo) - **Blog**: [QwQ-32B Blog](https://qwenlm.github.io/blog/qwq-32b/) - **Documentation**: [QwQ Documentation](https://qwen.readthedocs.io/en/latest/) - **GitHub**: [Qwen2.5 GitHub](https://github.com/QwenLM/Qwen2.5) ### Citation sources: - [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) - Official URL Updated: 2025-03-31

Register Now

Login

Lost Password

Add question

Login

Register Now

QwQ-32B - A high-performance, resource-efficient inference model developed by Ali Qwen.

QwQ-32B - A high-performance, resource-efficient inference model developed by Ali Qwen.