Introduction
This test explores the inference speed of locally deployed ONNX models and simulates request handling under high-concurrency production workloads.
Embedding model used: Alibaba-NLP/gte-multilingual-base
Service endpoint tested: an embedding service deployed with TEI.
ONNX Benchmark (CPU)
When running ONNX inference on CPU, the per-request latency (ms) is shown below.
Notes:
- To reduce long-tail noise, total request volume for CPU tests is 10,000.
- Table values are P99/P75/P50 latencies under 4/8/16/32 CPU-core limits. For example, with 4 cores, P99 is 188.04ms.
| CPU | 4 | 8 | 16 | 32 |
|---|---|---|---|---|
| 99% | 188.04 | 93.17 | 38.9 | 31.8 |
| 75% | 106.71 | 84.01 | 29.85 | 25.76 |
| 50% | 104.33 | 75.29 | 26.38 | 21.3 |
Key conclusions:
- Inference gets faster as physical core count increases.
- The gap between 16 and 32 cores is small because the test machine has only 16 physical cores. Even with 32 logical cores, scheduling still runs on 16 physical cores. In compute-heavy inference, moving from 16 to 32 logical cores does not add real compute resources; hyper-threading gains are limited. In some mixed-workload situations, extra logical cores can still improve scheduling slightly.
- ONNX-optimized models show excellent tokenization speed: around
200–400 µsper request. - ONNX Runtime also has efficient scheduling: around
400–600 µsdispatch overhead per request.
Concurrency Test Data
| cons | 2 | 3 | 6 | 10 |
|---|---|---|---|---|
| 99% | 60 | 86 | 158 | 213 |
| 50% | 49 | 73 | 113 | 162 |
Under concurrent load, single-request latency increases sharply as concurrency rises. Based on logs, likely causes are:
- Scheduling delay grows significantly under concurrency:
total_time="95.838661ms" tokenization_time="250.414µs" queue_time="63.300541ms" inference_time="32.212502ms" - Inference itself remains stable around 20–40 ms, indicating the main bottleneck is scheduling and queueing rather than pure compute.
- By queueing theory, as utilization approaches 100%, waiting time can grow exponentially, matching the observed “explosive growth.”
GPU Benchmark
Because CPU inference degrades under high concurrency, we tested GPU inference for these reasons:
- GPUs have far more lightweight cores and stronger parallel throughput.
- GPUs can process multiple inference tasks concurrently, improving total capacity.
- Modern GPUs include deep-learning-specific optimizations that accelerate inference.
For GPU concurrency tests, we increased total request count from 10k to 1 million. The reason: in early tests, GPU concurrency was strong enough that low request volume did not reflect behavior under sustained high load.
Specifically, with only 10k requests, percentile spread was very large. At concurrency 512, P99 reached 483.93ms, while P90 was only 90.36ms. This suggests a small number of high-latency outliers disproportionately affected results.
Possible causes:
- GPU resources were underutilized at low total volume.
- Batching strategy increased wait time for some requests.
- Temporary bottlenecks in memory transfer or I/O.
- OS/GPU-driver-level scheduling jitter.
To obtain more representative results, we scaled to 1 million requests. This better simulates real high-load production, provides more stable statistics, and helps distinguish random outliers from systematic behavior.
1M-Request Test
| cons | 128 | 256 | 384 | 512 |
|---|---|---|---|---|
| 99% | 44 | 61 | 79 | 96 |
| 95% | 38 | 52 | 70 | 85 |
| 90% | 36 | 49 | 65 | 79 |
| 75% | 33 | 44 | 59 | 71 |
| 50% | 30 | 40 | 52 | 64 |
After scaling to 1 million requests, metrics became much more stable. As concurrency increased from 128 to 512, percentile latencies rose roughly linearly. Importantly, P99 and P50 remained relatively close (e.g., 96ms vs 64ms at 512), indicating good consistency under load.
Compared with the 10k test, latency distribution became more even. At concurrency 512, the P99–P90 gap narrowed substantially (96ms vs 79ms), likely due to better GPU utilization and more effective batching at scale.
Although latency still grows with concurrency, growth is relatively controlled, suggesting the system may not yet be saturated at concurrency 512. That leaves room for further concurrency scaling.
Summary
Based on these tests:
- In low-concurrency or test scenarios, ONNX models are usually sufficient. But ONNX export may introduce some accuracy loss, so inference outputs can diverge from the original model.
- For production, GPU inference is recommended to achieve higher concurrency and better availability.