在带有 NPU 的 Snapdragon 设备上运行 SLM

了解如何在带有 ONNX Runtime 的 Snapdragon 设备上运行 SLM。

模型

目前支持的模型有：

Phi-3.5 mini instruct
Llama 3.2 3B

带有 Snapdragon NPU 的设备需要特定大小和格式的模型。

生成此格式模型的说明可在为 Snapdragon 构建模型中找到

构建或下载模型后，将模型资产放置在已知位置。这些资产包括：

genai_config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
quantizer.onnx
dequantizer.onnx
position-processor.onnx
一组 transformer 模型二进制文件
- Qualcomm 上下文二进制文件 (*.bin)
- 上下文二进制元数据 (*.json)
- ONNX 包装模型 (*.onnx)

Python 应用程序

如果您的设备安装了 Python，您可以运行一个简单的问答脚本来查询模型。

安装运行时

pip install onnxruntime-genai

下载脚本

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-qa.py -o model-qa.py

运行脚本

此脚本假定模型资产位于名为 models\Phi-3.5-mini-instruct 的文件夹中

python .\model-qa.py -e cpu -g -v --system_prompt "You are a helpful assistant. Be brief and concise." --chat_template "<|user|>\n{input} <|end|>\n<|assistant|>" -m ..\..\models\Phi-3.5-mini-instruct

Python 脚本内部探究

完整的 Python 脚本发布在此处：https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/model-qa.py。该脚本以以下标准方式使用 API：

加载模型
```
model = og.Model(config)
```
这将模型加载到内存中。

创建预处理器并标记系统提示

 tokenizer = og.Tokenizer(model)
 tokenizer_stream = tokenizer.create_stream()

 # Optional
 system_tokens = tokenizer.encode(system_prompt)

这将创建一个分词器和分词流，允许在生成标记时将它们返回给用户。

交互式输入循环

while True:
    # Read prompt
    # Run the generation, streaming the output tokens

生成循环

# 1. Pre-process the prompt into tokens
input_tokens = tokenizer.encode(prompt)

# 2. Create parameters and generator (KV cache etc) and process the prompt
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
generator = og.Generator(model, params)
generator.append_tokens(system_tokens + input_tokens)

# 3. Loop until all output tokens are generated, printing
# out the decoded token
while not generator.is_done():
    generator.generate_next_token()

    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="", flush=True)

 print()
    
 # Delete the generator to free the captured graph before creating another one
 del generator

C++ 应用程序

要在 C++ 应用程序中于 Snapdragon NPU 上运行模型，请使用此处的代码：https://github.com/microsoft/onnxruntime-genai/tree/main/examples/c。

构建和运行此应用程序需要一台带有 Snapdragon NPU 的 Windows PC，以及：

cmake
Visual Studio 2022

克隆仓库

   git clone https://github.com/microsoft/onnxruntime-genai
   cd examples\c

安装 onnxruntime

目前需要 onnxruntime 的夜间构建版本，因为语言模型的 QNN 支持每天都在更新。

从此处下载 ONNX Runtime QNN 二进制文件的夜间版本

   mkdir onnxruntime-win-arm64-qnn
   move Microsoft.ML.OnnxRuntime.QNN.1.22.0-dev-20250225-0548-e46c0d8.nupkg onnxruntime-win-arm64-qnn
   cd onnxruntime-win-arm64-qnn
   tar xvzf Microsoft.ML.OnnxRuntime.QNN.1.22.0-dev-20250225-0548-e46c0d8.nupkg
   copy runtimes\win-arm64\native\* ..\..\..\lib
   cd ..

安装 onnxruntime-genai

   curl https://github.com/microsoft/onnxruntime-genai/releases/download/v0.6.0/onnxruntime-genai-0.6.0-win-arm64.zip -o onnxruntime-genai-win-arm64.zip
   tar xvf onnxruntime-genai-win-arm64.zip
   cd onnxruntime-genai-0.6.0-win-arm64
   copy include\* ..\include
   copy lib\* ..\lib

构建示例

   cmake -A arm64 -S . -B build -DPHI3-QA=ON
   cd build
   cmake --build . --config Release

运行示例

   cd Release
   .\phi3_qa.exe <path_to_model>

C++ 示例内部探究

C++ 应用程序发布在此处：https://github.com/microsoft/onnxruntime-genai/blob/main/examples/c/src/phi3_qa.cpp。该脚本以以下标准方式使用 API：

加载模型
```
auto model = OgaModel::Create(*config);
```
这将模型加载到内存中。

创建预处理器

auto tokenizer = OgaTokenizer::Create(*model);
auto tokenizer_stream = OgaTokenizerStream::Create(*tokenizer);

这将创建一个分词器和分词流，允许在生成标记时将它们返回给用户。

交互式输入循环

while True:
    # Read prompt
    # Run the generation, streaming the output tokens

生成循环

# 1. Pre-process the prompt into tokens
auto sequences = OgaSequences::Create();
tokenizer->Encode(prompt.c_str(), *sequences);
   
# 2. Create parameters and generator (KV cache etc) and process the prompt
auto params = OgaGeneratorParams::Create(*model);
params->SetSearchOption("max_length", 1024);
auto generator = OgaGenerator::Create(*model, *params);
generator->AppendTokenSequences(*sequences);

# 3. Loop until all output tokens are generated, printing
# out the decoded token
while (!generator->IsDone()) {
  generator->GenerateNextToken();

  if (is_first_token) {
    timing.RecordFirstTokenTimestamp();
    is_first_token = false;
  }

  const auto num_tokens = generator->GetSequenceCount(0);
  const auto new_token = generator->GetSequenceData(0)[num_tokens - 1];
  std::cout << tokenizer_stream->Decode(new_token) << std::flush;
}