Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown

Link: Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown

Jina Reader API v1

Use a headless Chrome browser to fetch page source.
Use Mozilla readability to strip headers/footers/nav and keep main content.
Use regex and Turndown library to convert the extracted HTML to Markdown.

Problems in v1:

readability may mistakenly remove useful content;
Turndown cannot convert every HTML pattern reliably;
regex-heavy fixing cannot cover all cases and is hard to maintain across languages.

So they considered moving to an E2E LM approach, as shown below:

NOTE
E2E means End-to-End. In simple terms, raw content goes in, final target output comes out directly.

Jina Reader API v2

A key clarification: this is LM, not necessarily LLM, because large LLMs bring much higher cost and slower runtime.

So the team looked at SLMs (small language models): fewer than 1B parameters, able to run efficiently on edge-like devices (the original post does not define edge strictly, but a laptop can be treated as a non-typical edge device).

However, according to scaling laws, fewer parameters usually hurt reasoning and summarization quality. If an SLM is too small, it may fail to generate meaningful outputs.

Let’s inspect the HTML2Markdown task itself:

This task is less creative/complex than typical LLM generation. The model mainly solves two problems:
- select correct content from HTML and skip noisy markup;
- convert HTML styling/structure to Markdown syntax.
So intuitively this is simpler than open-ended text generation.
Long-context support is crucial. Modern HTML includes a lot of noisy tags. If SLM should convert well, context window must be large; 8K/16K windows are often insufficient.

So the final requirement becomes a “shallow but wide” SLM: “shallow” means the task mainly needs reliable copy-transform behavior so fewer transformer blocks may be enough, while “wide” means long-context support still requires attention capacity.

Past research suggests context length and reasoning quality are closely linked. Optimizing both while keeping parameter count small is difficult.

Jina’s reader-lm-0.5b and reader-lm-1.5b performed significantly better than existing general large models on HTML2Markdown:

My recent tests of general LLMs on email-HTML cleaning feel consistent with the chart above. Output uncertainty from next-token generation is still the biggest issue.

From published specs, both models keep small parameter sizes while extending context length to 256K, and also reduce hidden size/layer count appropriately.

Benchmark

Evaluation metrics:

ROUGE-L: commonly used in summarization/QA; measures overlap similarity between predicted output and reference text. Higher is better.
Token Error Rate (TER): percentage of generated Markdown tokens that do not appear in original HTML content, used as hallucination proxy. Lower is better.
Word Error Rate (WER): common in OCR/ASR; counts insertion (ADD), substitution (SUB), deletion (DEL) mismatch. Lower is better.

Prompt used for LLM baselines:

Your task is to convert the content of the provided HTML file into the corresponding markdown file. You need to convert the structure, elements, and attributes of the HTML into equivalent representations in markdown format, ensuring that no important information is lost. The output should strictly be in markdown format, without any additional explanations.

NOTE
Benchmark results are not universally decisive, but they do verify several current patterns:

For the same architecture, larger parameter count generally gives better overall performance.

Hallucination remains significant in general-purpose models.

It is a bit surprising that Anthropic models were not included in the benchmark. My guess is maybe they were not clearly beaten by Reader-LM 😜.
Another point is overfitting: for this narrow conversion scenario, a certain degree of overfitting may actually lower WER/TER, which can be interpreted as “reduced imagination” 🤷‍♂️.

Reader-LM Usage

Jina provides a Reader-LM notebook on Google Colab.
In production, RTX 3090/4090 with bfloat16 + flash-attention is recommended to reduce VRAM usage and mitigate long-input performance degradation.

Reader-LM Training

Data

Training input is paired HTML and Markdown QA-format data. A key point is that SLMs are especially sensitive to data quality, so Jina built a data pipeline to ensure high-quality Markdown entries in training sets.

NOTE
This “high sensitivity” can be viewed as a side effect of small parameter size. With limited learning capacity, SLMs overfit easily; once bad data appears in training, the model can learn wrong patterns quickly.

Jina also synthesized part of HTML-Markdown pairs using GPT-4o. Compared with real HTML, synthetic data has simpler structure and less noise, making it easier for SLMs to learn.

Final training format (total 2.5B tokens):

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{{RAW_HTML}}<|im_end|>
<|im_start|>assistant
{{MARKDOWN}}<|im_end|>

NOTE
My guess: synthetic data proportion is no less than 20% and no more than 30%.
Why not more?
Because real HTML contains more noise. If all data is synthetic, two issues appear:

This becomes direct distillation and may involve commercial copyright risk.

More importantly, model quality may drop further because overly simple training data limits what the SLM learns and increases overfitting risk.

Then why include synthetic data at all?
Using some synthetic data works like soft transfer learning: start from idealized examples, then gradually adapt to harder real-world cases.

Two-stage Training

Jina trained Reader-LM in two phases:

short and simple HTML: max sequence length (HTML+Markdown) = 32K tokens, about 1.5B tokens data;
long and hard HTML: max sequence length expanded to 128K tokens, about 1.2B tokens data, with ring flash attention.

NOTE
So although Jina advertises 256K context support, training data itself only goes up to 128K. However, because the base model Qwen2 uses RoPE position encoding, it generally handles long sequences better than classic sinusoidal-position Transformer variants.

Problems encountered during training:

Repetitive / looping token output

The model sometimes repeated the same token(s) or entered short loops until max-generation limit.

Solutions were based on two references:

Use contrastive search decoding and add contrastive loss during training, which reduced repetition in practice.
Add repeated-token detection in transformer pipeline and stop decoding on loop patterns, referencing this issue.

Long-input training efficiency

Transformers can hit OOM risk on long inputs, so Jina used chunk-wise model forwarding to reduce memory usage. I have not found a specific paper for this part yet.

Another issue: in some trainer pipelines, long inputs are split, causing sub-texts to lose context; or inputs are padded to equal length for batching, introducing meaningless filler tokens. Both can push the model to hallucinate by forcing it to infer missing context.

So Jina improved the Trainer to concatenate multiple short texts into long sequences for no-padding training.

Final observation: 0.5B is the smallest model that can do selective copying under long-context input, while 1.5B is the smallest “larger” model that significantly improves quality without hitting severe diminishing returns on parameter growth.

Alternative Architecture: Encoder-only Models

This part has limited practical relevance for my current work, so I skipped detailed analysis.

Summary

The most valuable takeaway is the attempt to train task-specific SLMs for proprietary conversion tasks. The second valuable part is experimentation around long-context attention and handling training-time failure modes.

Next I plan to read the ring flash attention repo and the repeated-token mitigation methods in more detail. Consider this a public TODO flag, haha.