使用 DeepSeek-R1-Distill 模型在 Python 中进行推理
1. 预先准备:创建虚拟环境并安装 ONNX Runtime GenAI
# Installing onnxruntime-genai, olive, and dependencies for CPU
python -m venv .venv && source .venv/bin/activate
pip install requests numpy --pre onnxruntime-genai olive-ai
# Installing onnxruntime-genai, olive, and dependencies for CUDA GPU
python -m venv .venv && source .venv/bin/activate
pip install requests numpy --pre onnxruntime-genai-cuda "olive-ai[gpu]"
2. 获取模型
选择您的模型并转换为 ONNX。请注意,许多 LLM 都可以工作,所以请随时尝试其他模型
# Using Olive auto-opt to pull a huggingface model, optimize for CPU, and quantize to INT4 using RTN.
olive auto-opt --model_name_or_path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output_path ./deepseek-r1-distill-qwen-1.5B --device cpu --provider CPUExecutionProvider --precision int4 --use_model_builder --log_level 1
# Using Olive auto-opt to pull a huggingface model, optimize for CUDA GPUs, and quantize to INT4 using RTN.
olive auto-opt --model_name_or_path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output_path ./deepseek-r1-distill-qwen-1.5B --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1
或使用 Huggingface CLI 直接下载
# Download the model directly using the huggingface cli
huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include 'deepseek-r1-distill-qwen-1.5B/*' --local-dir .
3. 在设备上试用您的模型!
# CPU Chat inference. If you pulled the model from huggingface, adjust the model directory (-m) accordingly
curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
python model-chat.py -m deepseek-r1-distill-qwen-1.5B/model -e cpu --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>"
# On-Device GPU Chat inference. Works on devices with Nvidia GPUs. If you pulled the model from huggingface, adjust the model directory (-m) accordingly
curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
python model-chat.py -m deepseek-r1-distill-qwen-1.5B/model -e cuda --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>"