(Before writing this, I realized it has been exactly two months since my first Rust blog post, which is also my total Rust learning time so far. A lot of feelings there.)
Preface
What is TEI?
TEI (Text Embeddings Inference) can be understood as a high-speed text vectorization inference solution. See the official GitHub repo:
GitHub Repositoryhuggingface/text-embeddings-inferenceWhy use TEI as the embedding service in production?
Two core reasons: speed and cost efficiency. Mature embedding APIs already exist both domestically and overseas, but we did not adopt them mainly for these reasons:
-
API latency T95 is often around 700ms (overseas services can be 1s-2s), which forces an extra ~0.7s in the cloud request chain. For real-time online scenarios where users expect immediate response, this is often unacceptable.
-
Current text embedding models do not differ dramatically. In our business scenarios with synthetic-data testing, commercial APIs (for example
text-embedding-3) can even underperform open-source models. Model selection can refer to MTEB leaderboard.WARNING
MTEB is a benchmark for evaluating embedding models. Avoid over-relying on the Average score. MTEB covers many downstream tasks, so models often have uneven strengths. A model ranked #1 overall may still perform poorly in your specific scenario.
NOTE
Quick notes on MTEB metrics:
- CLS (Classification): performance on classification tasks where embeddings are used as features (for example sentiment/topic classification).
- Clustering: quality of embedding-based clustering (for example ARI, Silhouette).
- Pair_CLS (Pair Classification): classify relationships between text pairs (for example same topic/sentiment).
- Reranking: performance when reranking candidates by relevance (common in search/recommendation).
- Retrieval: performance in search/retrieval tasks (precision, recall, etc.).
- STS (Semantic Textual Similarity): ability to score semantic similarity between two texts (often via cosine similarity vs human annotations).
After deciding to self-host text embedding models, TEI provides these advantages:
- Direct deployment of Hugging Face embedding models: makes model switching easy following MTEB or internal benchmark results.
- Built-in gRPC interface: embedding service is usually a standalone microservice, so TEI is near plug-and-play.
- GPU/CPU inference support: for ONNX base models, CPU inference is supported (CPU is not always slower than GPU).
- Very fast inference: embedding request latency T99 can be around 80ms (actual speed depends on your selected base model).
Why customize TEI?
The mainstream embedding pattern uses text embedding outputs as semantic-space mapping and applies similarity matching. This is dense embedding.
For most scenarios, dense embedding is good enough. But dense vectors may weaken explicit keyword signals in final representations, which leads to demand for sparse embedding/weights. Sparse ideas include TF-IDF, BM25, etc. (see previous blog: Dive into Embedding). Core idea: identify key terms that carry the main semantic signal.
Clearly, neither dense nor sparse alone is a silver bullet. The natural direction is to combine both. Therefore, TEI’s dense-only output was not enough for our needs.
Implementation Path
Hybrid Score
Hybrid Score enhances dense embedding usage through two parts:
- similarity score from dense embeddings;
- sparse weight score from sparse signals.
Below are simplified code snippets (this part assumes some model-architecture familiarity; feel free to skip if not needed):
Inference part
def _encode(
self,
texts: Dict[str, torch.Tensor] = None,
...
return_dense: bool = True,
return_sparse: bool = False,
):
# tokenize text, return pytorch tensors
text_input = self.tokenizer(
texts,
padding=True,
truncation=True,
return_tensors="pt",
max_length=max_length,
)
# tokenizer output is on CPU by default; move to model device
text_input = {k: v.to(self.model.device) for k, v in text_input.items()}
# run Text Embedding Model
model_out = self.model(**text_input, return_dict=True)
output = {}
if return_dense:
dense_vecs = model_out.last_hidden_state[:, 0, :dimension]
...
output["dense_embeddings"] = dense_vecs
if return_sparse:
# relu quickly filters logits values below 0
token_weights = torch.relu(model_out.logits).squeeze(-1)
token_weights = list(
map(
self._process_token_weights,
token_weights.detach().cpu().numpy().tolist(),
text_input["input_ids"].cpu().numpy().tolist(),
)
)
# token weights
output["token_weights"] = token_weights
return output
Sparse score calculation
Dense part is straightforward, so we focus on sparse scoring:
def _compute_sparse_scores(self, embs1, embs2):
scores = 0
# repeated keywords contribute to sparse score;
# higher repetition usually means stronger shared key semantics
for token, weight in embs1.items():
if token in embs2:
scores += weight * embs2[token]
return scores
Processing token weights:
def _process_token_weights(self, token_weights: np.ndarray, input_ids: list):
result = defaultdict(int)
unused_tokens = set(...)
# token_weights are logits-like outputs
# logits indicate token importance in sentence
# filter meaningless tokens and tokens <= 0
# keep tokens with major contribution
for w, idx in zip(token_weights, input_ids):
if idx not in unused_tokens and w > 0:
token = self.tokenizer.decode([int(idx)])
if w > result[token]:
result[token] = w
return result
Hybrid score calculation
# weighted dense score + sparse score
scores = (
self.compute_dense_scores(
embs1["dense_embeddings"], embs2["dense_embeddings"]
)
* dense_weight
+ self.compute_sparse_scores(embs1["token_weights"], embs2["token_weights"])
* sparse_weight
)
NOTE
In one sentence: while computing semantic similarity, also score contribution from shared keywords (same tokens).
Customization Work
So far, sparse-score computation is clear. Next question: how to customize TEI in practice? Also, due to cost constraints, the base model for TEI deployment should be ONNX for CPU-accelerated inference.
Before real implementation, we needed answers to:
- Does base model output structure expose logits?
- After ONNX conversion, can output structure still preserve original outputs?
- Can
_process_token_weightsbe ported to Rust? - What is Rust ecosystem support for ONNX models?
ONNX format
ONNX model support was one of the core issues. Since TEI CPU deployment requires ONNX models, all later computation parameters come from ONNX model outputs.
ONNX can be understood as a model standard: ONNX-format models can be accelerated by ONNX Runtime (typically 5x-6x speedup on CPU; around 1.3x-1.7x on GPU).
Conclusions:
- ONNX models are converted from base models to ONNX standard format.
- ONNX outputs can match pre-conversion outputs. If base model outputs logits, ONNX model can also expose logits.
- Rust ecosystem provides ort for running ONNX models.
NOTE
ONNX format can be compared to OCI as a container standard: images that satisfy OCI can be managed by Docker or other container tools.
Important detail: default ONNX export may omit logits in outputs (because many use cases do not need logits). You must explicitly set output names during export:
torch.onnx.export(
...
input_names=['input_ids', 'attention_mask'],
output_names=['logits', 'last_hidden_state'],
...
)
Getting logits from ONNX outputs:
// python
onnx_outputs[0]
// rust
onnx_outputs.get("logits")
Once logits are available, sparse-score processing can follow the Python logic above almost directly.
Customization implementation
Actual coding was relatively straightforward. With Rust compiler guarantees, passing compilation usually means structural modifications are correctly integrated. The detailed implementation follows the Python path shown above, so it is omitted here.