Dive into Embedding

Behind the current LLM wave, the required compute and infrastructure are often beyond what small and medium-sized companies can afford. For these teams, practical LLM adoption in B2B scenarios is still relatively narrow, mostly replacing traditional NLP solutions such as named entity recognition (NER) and sentiment analysis.

For these classic tasks, large models provide a major benefit: pretraining compresses a huge amount of knowledge. This partly alleviates long-standing AI challenges such as entity disambiguation and catastrophic forgetting, while also lowering the barrier for using, fine-tuning, and training AI systems.

However, local deployment is expensive and inference is often slow. So most companies choose third-party APIs instead. It is cost-effective, avoids maintenance burden, and enables fast product delivery.

That said, API calls have a clear downside: latency. In internet product development, there is a common saying that 80% of runtime is spent waiting for network calls. It is exaggerated, but directionally true. For example, one domestic cloud LLM API call test looked like this:

0.01s user 0.02s system 3% cpu 0.684 total

Actual client-side request/response work: 0.03s
Network overhead + third-party model inference: 0.654s

For online products, any high-latency feature can be treated as unavailable or low-availability, because user willingness drops as latency increases. But integrating LLMs still reduces development cost significantly (time and task completion quality). This is why the Embedding component inside the LLM stack has become increasingly important for developers.

As LLMs evolve, embedding quality keeps improving. Modern embedding models are no longer simple combinations of word vectors (word2vec); they are far more usable representations enhanced by attention mechanisms.

Statistical Language Models

Let’s revisit early next-token prediction. Suppose the current text is "Today's weather is", and candidate words are sunny, rainy, humid with probabilities 0.8, 0.7, and 0.5. To maximize final confidence, we pick sunny.

Where do these probabilities come from? Traditionally, they are estimated statistically from large corpora.

NOTE

Term Frequency (TF): the simplest method, counting word occurrences in a document or corpus. P(word) = count(word) / total_words

TF-IDF (Term Frequency-Inverse Document Frequency): combines frequency and global distribution across documents. Mostly used for retrieval/relevance rather than direct probability modeling.

Bag of Words: counts word frequency without word order.

Maximum Likelihood Estimation (MLE): estimates probability as frequency divided by total word count. P(word) = count(word) / N, where N is total corpus words.

But this approach has obvious issues:

The same input tends to produce fixed outputs.
Next-token selection is weakly grounded in true semantic context.

So the goal becomes context-aware next-token prediction. A natural idea is conditional probability: token probability at position i conditioned on the previous i-1 tokens.

This introduces another problem: how do we estimate this conditional distribution efficiently? As sentence length grows, direct estimation becomes intractable (the classic n-gram bottleneck).

The practical workaround was the N-order Markov assumption: the next token depends only on the previous N words. This keeps probability estimation computable regardless of sentence length.

NOTE
Still, N-order Markov models have limits: they only capture finite context and fail on long-range dependencies. Researchers then proposed more complex models such as RNNs and LSTMs, which can theoretically retain longer context.
However, RNN/LSTM training often suffers from vanishing/exploding gradients on long sequences. In 2017, the Transformer architecture became a major breakthrough. With self-attention, it can process tokens in parallel and model long-distance dependencies effectively. It became the foundation of modern large language models like BERT and GPT.
Modern embedding techniques (for example, contextual embeddings in BERT) do more than map words to vectors. They dynamically adjust representations by context, enabling disambiguation and richer semantic understanding. LLMs like GPT-3 further scaled this with massive pretraining data.
Finally, regarding “same input always gives fixed output,” modern models introduce controllable randomness (such as temperature), which enables more diverse and natural generation.

word2vec

Why did embedding appear in the first place? In short, raw text cannot be consumed directly by models. For example, "My dog is so cute" cannot be used as structured numeric input even after tokenization. So we need to convert text into numeric representations.

But embedding is not just “text to numbers.” The key is semantic structure: semantically similar words should map to nearby positions in vector space.

Embedding

A simplified internal flow of embedding:

The user inputs a query. The tokenizer splits it into a token array (including rules for out-of-vocabulary token splitting).
Tokens are mapped to vectors using the model vocabulary, then arranged in original order to form an embedding matrix.
The embedding matrix is fed into subsequent model layers (for example, attention blocks), then transformed by fully connected or related layers into a fixed-dimensional vector representation.