使用 ONNX Runtime 进行训练、转换和预测#

本示例展示了一个端到端的场景，首先训练一个 scikit-learn 管道，该管道的输入不是常规向量，而是一个字典 { int: float }，因为它的第一步是一个 DictVectorizer。

训练管道#

第一步是创建一个虚拟数据集。

import pandas
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(1000, n_targets=1)

X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_dict = pandas.DataFrame(X_train[:, 1:]).T.to_dict().values()
X_test_dict = pandas.DataFrame(X_test[:, 1:]).T.to_dict().values()

我们创建一个管道。

from sklearn.ensemble import GradientBoostingRegressor  # noqa: E402
from sklearn.feature_extraction import DictVectorizer  # noqa: E402
from sklearn.pipeline import make_pipeline  # noqa: E402

pipe = make_pipeline(DictVectorizer(sparse=False), GradientBoostingRegressor())

pipe.fit(X_train_dict, y_train)

Pipeline(steps=[('dictvectorizer', DictVectorizer(sparse=False)),
                ('gradientboostingregressor', GradientBoostingRegressor())])

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任此笔记本。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

管道

?管道文档i已拟合

参数

	steps	[('dictvectorizer', ...), ('gradientboostingregressor', ...)]
	transform_input	None
	memory	None
	verbose	False

DictVectorizer

?DictVectorizer 文档

参数

	dtype	<class 'numpy.float64'>
	separator	'='
	sparse	False
	sort	True

GradientBoostingRegressor

?GradientBoostingRegressor 文档

参数

	loss	'squared_error'
	learning_rate	0.1
	n_estimators	100
	subsample	1.0
	criterion	'friedman_mse'
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_depth	3
	min_impurity_decrease	0.0
	init	None
	random_state	None
	max_features	None
	alpha	0.9
	verbose	0
	max_leaf_nodes	None
	warm_start	False
	validation_fraction	0.1
	n_iter_no_change	None
	tol	0.0001
	ccp_alpha	0.0

我们计算测试集上的预测并显示混淆矩阵。

from sklearn.metrics import r2_score  # noqa: E402

pred = pipe.predict(X_test_dict)
print(r2_score(y_test, pred))

0.8734108801502843

转换为 ONNX 格式#

我们使用 sklearn-onnx 模块将模型转换为 ONNX 格式。

from skl2onnx import convert_sklearn  # noqa: E402
from skl2onnx.common.data_types import DictionaryType, FloatTensorType, Int64TensorType  # noqa: E402

# initial_type = [('float_input', DictionaryType(Int64TensorType([1]), FloatTensorType([])))]
initial_type = [("float_input", DictionaryType(Int64TensorType([1]), FloatTensorType([])))]
onx = convert_sklearn(pipe, initial_types=initial_type, target_opset=17)
with open("pipeline_vectorize.onnx", "wb") as f:
    f.write(onx.SerializeToString())

我们使用 ONNX Runtime 加载模型并查看其输入和输出。

import onnxruntime as rt  # noqa: E402
from onnxruntime.capi.onnxruntime_pybind11_state import InvalidArgument  # noqa: E402

sess = rt.InferenceSession("pipeline_vectorize.onnx", providers=rt.get_available_providers())

inp, out = sess.get_inputs()[0], sess.get_outputs()[0]
print(f"input name='{inp.name}' and shape={inp.shape} and type={inp.type}")
print(f"output name='{out.name}' and shape={out.shape} and type={out.type}")

input name='float_input' and shape=[] and type=map(int64,tensor(float))
output name='variable' and shape=[None, 1] and type=tensor(float)

我们计算预测。我们可以一次性完成这个操作

try:
    sess.run([out.name], {inp.name: X_test_dict})[0]
except (RuntimeError, InvalidArgument) as e:
    print(e)

[ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: ((seq(map(int64,tensor(float))))) , expected: ((map(int64,tensor(float))))

但它失败了，因为对于 DictVectorizer，ONNX Runtime 期望一次处理一个观测值。

pred_onx = [sess.run([out.name], {inp.name: row})[0][0, 0] for row in X_test_dict]

我们将它们与模型的预测结果进行比较。

print(r2_score(pred, pred_onx))

0.9999999999999448

非常相似。ONNX Runtime 使用单精度浮点数（floats）而不是双精度浮点数（doubles），这解释了微小的差异。

脚本总运行时间： (0 分钟 2.872 秒)

由 Sphinx-Gallery 生成的画廊