使用 ONNX Runtime generate() API 运行 Phi-3 语言模型

简介

Phi-3 和 Phi 3.5 ONNX 模型托管在 HuggingFace 上，您可以使用 ONNX Runtime 的 generate() API 运行它们。

目前提供并支持 mini (3.3B) 和 medium (14B) 版本。 mini 和 medium 版本都具有短上下文 (4k) 版本和长上下文 (128k) 版本。长上下文版本可以接受更长的提示并生成更长的输出文本，但会消耗更多内存。

可用模型为

本教程演示了如何下载并运行 Phi 3 模型的短上下文 (4k) mini (3B) 变体。有关其他变体的下载命令，请参阅模型参考。

本教程下载并运行短上下文 (4k) mini (3B) 模型变体。有关其他变体的下载命令，请参阅模型参考。

设置
选择您的平台
使用 DirectML 运行
使用 NVIDIA CUDA 运行
在 CPU 上运行
Phi-3 ONNX 模型参考

设置

安装 git 大型文件系统扩展

HuggingFace 使用 git 进行版本控制。要下载 ONNX 模型，您需要安装 git lfs，如果尚未安装。
- Windows: winget install -e --id GitHub.GitLFS (如果您没有 winget，请从官方来源下载并运行 exe)
- Linux: apt-get install git-lfs
- MacOS: brew install git-lfs
然后运行 git lfs install
安装 HuggingFace CLI
```
pip install huggingface-hub[cli]
```

选择您的平台

您使用的是带 GPU 的 Windows 机器吗？

我不知道 → 查看本指南，了解您的 Windows 机器中是否有 GPU，并确认您的 GPU 支持 DirectML。
是 → 按照DirectML的说明进行操作。
否 → 您有 NVIDIA GPU 吗？
- 我不知道 → 查看本指南，了解您是否有支持 CUDA 的 GPU。
- 是 → 按照NVIDIA CUDA GPU的说明进行操作。
- 否 → 按照CPU的说明进行操作。

注意：根据您的硬件，只需一个包和一个模型。也就是说，只执行以下部分中的一个步骤。

使用 DirectML 运行

下载模型

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .

此命令将模型下载到名为 directml 的文件夹中。

安装 generate() API
```
pip install --pre onnxruntime-genai-directml
```
您现在应该在 pip list 中看到 onnxruntime-genai-directml。

运行模型

使用 phi3-qa.py 运行模型。

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m directml\directml-int4-awq-block-128 -e dml

脚本加载模型后，它会循环询问您的输入，并流式传输模型生成的输出。例如

Input: Tell me a joke about GPUs

Certainly! Here\'s a light-hearted joke about GPUs:

Why did the GPU go to school? Because it wanted to improve its "processing power"!

This joke plays on the double meaning of "processing power," referring both to the computational abilities of a GPU and the idea of a student wanting to improve their academic skills.

使用 NVIDIA CUDA 运行

下载模型

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .

此命令将模型下载到名为 cuda 的文件夹中。

安装 generate() API

pip install --pre onnxruntime-genai-cuda

运行模型

使用 phi3-qa.py 运行模型。

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32  -e cuda

脚本加载模型后，它会循环询问您的输入，并流式传输模型生成的输出。例如

Input: Tell me a joke about creative writing
 
Output:  Why don't writers ever get lost? Because they always follow the plot! 

在 CPU 上运行

下载模型

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .

此命令将模型下载到名为 cpu_and_mobile 的文件夹中

为 CPU 安装 generate() API
```
pip install --pre onnxruntime-genai
```

运行模型

使用 phi3-qa.py 运行模型。

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu

脚本加载模型后，它会循环询问您的输入，并流式传输模型生成的输出。例如

Input: Tell me a joke about generative AI

Output:  Why did the generative AI go to school?

To improve its "creativity" algorithm!

Phi-3 ONNX 模型参考

Phi-3 mini 4k 上下文 CPU

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 mini 4k 上下文 CUDA

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 mini 4k 上下文 DirectML

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .
python phi3-qa.py -m directml\directml-int4-awq-block-128 -e dml

Phi-3 mini 128k 上下文 CPU

huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 mini 128k 上下文 CUDA

huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 mini 128k 上下文 DirectML

huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include directml/* --local-dir .
python phi3-qa.py -m directml\directml-int4-awq-block-128 -e dml

Phi-3 medium 4k 上下文 CPU

git clone https://hugging-face.cn/microsoft/Phi-3-medium-4k-instruct-onnx-cpu
python phi3-qa.py -m Phi-3-medium-4k-instruct-onnx-cpu/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 medium 4k 上下文 CUDA

git clone https://hugging-face.cn/microsoft/Phi-3-medium-4k-instruct-onnx-cuda
python phi3-qa.py -m Phi-3-medium-4k-instruct-onnx-cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 medium 4k 上下文 DirectML

git clone https://hugging-face.cn/microsoft/Phi-3-medium-4k-instruct-onnx-directml
python phi3-qa.py -m Phi-3-medium-4k-instruct-onnx-directml/directml-int4-awq-block-128 -e dml

Phi-3 medium 128k 上下文 CPU

git clone https://hugging-face.cn/microsoft/Phi-3-medium-128k-instruct-onnx-cpu
python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-cpu/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 medium 128k 上下文 CUDA

git clone https://hugging-face.cn/microsoft/Phi-3-medium-128k-instruct-onnx-cuda
python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 medium 128k 上下文 DirectML

git clone https://hugging-face.cn/microsoft/Phi-3-medium-128k-instruct-onnx-directml
python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-directml/directml-int4-awq-block-128 -e dml

Phi-3.5 mini 128k 上下文 CUDA

huggingface-cli download microsoft/Phi-3.5-mini-instruct-onnx --include cuda/cuda-int4-awq-block-128/* --local-dir .
python phi3-qa.py -m cuda/cuda-int4-awq-block-128 -e cuda

Phi-3.5 mini 128k 上下文 CPU

huggingface-cli download microsoft/Phi-3.5-mini-instruct-onnx --include cpu_and_mobile/cpu-int4-awq-block-128-acc-level-4/* --local-dir .
python phi3-qa.py -m cpu_and_mobile/cpu-int4-awq-block-128-acc-level-4 -e cpu