Skip to content

Commit 87137db

Browse files
committed
Merge branch 'jakep/new_data'
2 parents 7ef6020 + 4b8146c commit 87137db

File tree

98 files changed

+15524
-2637
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

98 files changed

+15524
-2637
lines changed

README.md

Lines changed: 71 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ Features:
3535
- (Based on a 7B parameter VLM, so it requires a GPU)
3636

3737
### News
38+
- October 21, 2025 - v0.4.0 - [New model release](https://huggingface.co/allenai/olmOCR-2-7B-1025-FP8), boosts olmOCR-bench score by ~4 points using synthetic data and introduces RL training.
3839
- August 13, 2025 - v0.3.0 - [New model release](https://huggingface.co/allenai/olmOCR-7B-0825-FP8), fixes auto-rotation detection, and hallucinations on blank documents.
3940
- July 24, 2025 - v0.2.1 - [New model release](https://huggingface.co/allenai/olmOCR-7B-0725-FP8), scores 3 points higher on [olmOCR-Bench](https://github.com/allenai/olmocr/tree/main/olmocr/bench), also runs significantly faster because it's default FP8, and needs much fewer retries per document.
4041
- July 23, 2025 - v0.2.0 - New cleaned up [trainer code](https://github.com/allenai/olmocr/tree/main/olmocr/train), makes it much simpler to train olmOCR models yourself.
@@ -66,28 +67,28 @@ We also ship a comprehensive benchmark suite covering over 7,000 test cases acro
6667
</thead>
6768
<tbody>
6869
<tr>
69-
<td align="left">Marker v1.7.5 (base, force_ocr)</td>
70-
<td align="center">76.0</td>
71-
<td align="center">57.9</td>
72-
<td align="center">57.6</td>
73-
<td align="center">27.8</td>
74-
<td align="center">84.9</td>
70+
<td align="left">Marker v1.10.1 (base, force_ocr)</td>
71+
<td align="center"><strong>83.8</strong></td>
72+
<td align="center">66.8</td>
7573
<td align="center">72.9</td>
76-
<td align="center"><strong>84.6</strong></td>
77-
<td align="center">99.1</td>
78-
<td align="center">70.1 ± 1.1</td>
74+
<td align="center">33.5</td>
75+
<td align="center">86.6</td>
76+
<td align="center">80.0</td>
77+
<td align="center"><strong>85.7</strong></td>
78+
<td align="center">99.3</td>
79+
<td align="center">76.1 ± 1.1</td>
7980
</tr>
8081
<tr>
81-
<td align="left">MinerU v1.3.10</td>
82-
<td align="center">75.4</td>
83-
<td align="center">47.4</td>
84-
<td align="center">60.9</td>
85-
<td align="center">17.3</td>
86-
<td align="center"><strong>96.6</strong></td>
87-
<td align="center">59.0</td>
88-
<td align="center">39.1</td>
89-
<td align="center">96.6</td>
90-
<td align="center">61.5 ± 1.1</td>
82+
<td align="left">MinerU v2.5.4</td>
83+
<td align="center">75.5</td>
84+
<td align="center">50.2</td>
85+
<td align="center">59.9</td>
86+
<td align="center">19.2</td>
87+
<td align="center"><strong>97.0</strong></td>
88+
<td align="center">58.7</td>
89+
<td align="center">44.6</td>
90+
<td align="center">97.8</td>
91+
<td align="center">62.9 ± 1.1</td>
9192
</tr>
9293
<tr>
9394
<td align="left">Mistral OCR API</td>
@@ -115,28 +116,40 @@ We also ship a comprehensive benchmark suite covering over 7,000 test cases acro
115116
</tr>
116117
<tr>
117118
<td align="left">olmOCR v0.2.0</td>
118-
<td align="center"><strong>78.8</strong></td>
119+
<td align="center">78.8</td>
119120
<td align="center">77.5</td>
120121
<td align="center">71.9</td>
121-
<td align="center"><strong>45.4</strong></td>
122+
<td align="center">45.4</td>
122123
<td align="center">94.2</td>
123-
<td align="center"><strong>78.6</strong></td>
124+
<td align="center">78.6</td>
124125
<td align="center">81.4</td>
125-
<td align="center"><strong>99.8</strong></td>
126-
<td align="center"><strong>78.5 ± 1.1</strong></td>
126+
<td align="center">99.8</td>
127+
<td align="center">78.5 ± 1.1</td>
127128
</tr>
128129
<tr>
129130
<td align="left">olmOCR v0.3.0</td>
130131
<td align="center">78.6</td>
131-
<td align="center"><strong>79.9</strong></td>
132+
<td align="center">79.9</td>
132133
<td align="center">72.9</td>
133134
<td align="center">43.9</td>
134135
<td align="center">95.1</td>
135136
<td align="center">77.3</td>
136137
<td align="center">81.2</td>
137138
<td align="center">98.9</td>
138139
<td align="center">78.5 ± 1.1</td>
139-
</tr>
140+
</tr>
141+
<tr>
142+
<td align="left">olmOCR pipeline v0.4.0</td>
143+
<td align="center"><strong>83.0</strong></td>
144+
<td align="center"><strong>82.3</strong></td>
145+
<td align="center"><strong>84.9</strong></td>
146+
<td align="center"><strong>47.7</strong></td>
147+
<td align="center">96.1</td>
148+
<td align="center"><strong>83.7</strong></td>
149+
<td align="center">81.9</td>
150+
<td align="center">99.7</td>
151+
<td align="center"><strong>82.4 ± 1.1</strong></td>
152+
</tr>
140153
</tbody>
141154
</table>
142155

@@ -196,56 +209,51 @@ python -m olmocr.pipeline ./localworkspace --markdown --pdfs tests/gnarly_pdfs/*
196209

197210
With the addition of the `--markdown` flag, results will be stored as markdown files inside of `./localworkspace/markdown/`.
198211

199-
### Using External vLLM Server
212+
#### Viewing Results
213+
214+
The `./localworkspace/` workspace folder will then have both [Dolma](https://github.com/allenai/dolma) and markdown files (if using `--markdown`).
200215

201-
If you have a vLLM server already running elsewhere (or any inference platform implementing the relevant subset of the OpenAI API), you can point olmOCR to use it instead of spawning a local instance:
202216

203217
```bash
204-
# Use external vLLM server instead of local one
205-
python -m olmocr.pipeline ./localworkspace --server http://remote-server:8000 --markdown --pdfs tests/gnarly_pdfs/*.pdf
218+
cat localworkspace/markdown/olmocr-sample.md
206219
```
207220

208-
The served model name should be `olmocr`. An example vLLM launch command would be:
209-
```bash
210-
vllm serve allenai/olmOCR-7B-0825-FP8 --served-model-name olmocr --max-model-len 16384
211221
```
222+
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
223+
...
224+
```
225+
226+
### Using an Inference Provider or External Server
227+
228+
If you have a vLLM server already running elsewhere (or any inference platform implementing the OpenAI API), you can point olmOCR to use it instead of spawning a local instance:
212229

213-
#### Run olmOCR with the DeepInfra server endpoint:
214-
Signup at [DeepInfra](https://deepinfra.com/) and get your API key from the DeepInfra dashboard.
215-
Store the API key as an environment variable.
216230
```bash
217-
export DEEPINFRA_API_KEY="your-api-key-here"
231+
# Use external vLLM server instead of local one
232+
python -m olmocr.pipeline ./localworkspace --server http://remote-server:8000/v1 --markdown --pdfs tests/gnarly_pdfs/*.pdf
218233
```
219234

235+
The served model name should be `olmocr`. An example vLLM launch command would be:
220236
```bash
221-
python -m olmocr.pipeline ./localworkspace \
222-
--server https://api.deepinfra.com/v1/openai \
223-
--api_key $DEEPINFRA_API_KEY \
224-
--pages_per_group 100 \
225-
--model allenai/olmOCR-7B-0825 \
226-
--markdown \
227-
--pdfs path/to/your/*.pdf
237+
vllm serve allenai/olmOCR-2-7B-1025-FP8 --served-model-name olmocr --max-model-len 16384
228238
```
229-
- `--server`: DeepInfra's OpenAI-compatible endpoint: `https://api.deepinfra.com/v1/openai`
230-
- `--api_key`: Your DeepInfra API key
231-
- `--pages_per_group`: You may want a smaller number of pages per group as many external provides have lower concurrent request limits
232-
- `--model`: The model identifier on DeepInfra: `allenai/olmOCR-7B-0825`
233-
- Other arguments work the same as with local inference
234-
235239

236-
#### Viewing Results
240+
#### Verified External Providers
237241

238-
The `./localworkspace/` workspace folder will then have both [Dolma](https://github.com/allenai/dolma) and markdown files (if using `--markdown`).
242+
We have tested `olmOCR-2-7B-1025-FP8` on these external model providers and confirmed that they work
239243

244+
| Provider | $/1M Input tokens | $/1M Output tokens | Example Command |
245+
|-----------|-------------------|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
246+
| [DeepInfra](https://deepinfra.com/) | $0.14 | $0.80 | `python -m olmocr.pipeline ./localworkspace1 --server https://api.deepinfra.com/v1/openai --api_key DfXXXXXXX --model allenai/olmOCR-7B-1025 --pdfs tests/gnarly_pdfs/*.pdf` |
247+
| [Parasail](https://www.saas.parasail.io/serverless?name=olmocr-7b-1025-fp8) | $0.10 | $0.20 | `python -m olmocr.pipeline ./localworkspace1 --server https://api.parasail.io/v1 --api_key psk-XXXXX --model parasail-olmocr-7b-1025-fp8 --pdfs tests/gnarly_pdfs/*.pdf` |
248+
| | | | |
240249

241-
```bash
242-
cat localworkspace/markdown/olmocr-sample.md
243-
```
250+
Notes on arguments
251+
- `--server`: Defines the OpenAI-compatible endpoint: ex `https://api.deepinfra.com/v1/openai`
252+
- `--api_key`: Your API key, bassed in via Authorization Bearer HTTP header
253+
- `--pages_per_group`: You may want a smaller number of pages per group as many external provides have lower concurrent request limits
254+
- `--model`: The model identifier, ex. `allenai/olmOCR-7B-1025`, different providers have different names, and if you run locally, you can use `olmocr`
255+
- Other arguments work the same as with local inference
244256

245-
```
246-
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
247-
...
248-
```
249257

250258
### Multi-node / Cluster Usage
251259

@@ -371,10 +379,11 @@ beaker/cluster execution:
371379
372380
There are some nice reusable pieces of the code that may be useful for your own projects:
373381
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
374-
- An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
375382
- Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
376-
- Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
377-
- Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
383+
- SFT Finetuning code for Qwen2.5-VL - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
384+
- GRPO RL Trainer - [grpo_train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/grpo_train.py)
385+
- Synthetic data generation - [mine_html_templates.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/synth/mine_html_templates.py)
386+
- Processing millions of PDFs through a finetuned model using VLLM - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
378387
- Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
379388
380389

olmocr/bench/README.md

Lines changed: 49 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,18 @@ to run it against your own OCR tools. Your tool just needs to support Markdown o
4949
<td align="center">48.3 ± 1.1</td>
5050
</tr>
5151
<tr>
52+
<td align="left">Marker v1.10.1 (base, force_ocr)</td>
53+
<td align="center"><strong>83.8</strong></td>
54+
<td align="center">66.8</td>
55+
<td align="center">72.9</td>
56+
<td align="center">33.5</td>
57+
<td align="center">86.6</td>
58+
<td align="center">80.0</td>
59+
<td align="center"><strong>85.7</strong></td>
60+
<td align="center">99.3</td>
61+
<td align="center">76.1 ± 1.1</td>
62+
</tr>
63+
<!-- <tr>
5264
<td align="left">Marker v1.7.5 (base, force_ocr)</td>
5365
<td align="center">76.0</td>
5466
<td align="center">57.9</td>
@@ -59,8 +71,20 @@ to run it against your own OCR tools. Your tool just needs to support Markdown o
5971
<td align="center"><strong>84.6</strong></td>
6072
<td align="center">99.1</td>
6173
<td align="center">70.1 ± 1.1</td>
62-
</tr>
74+
</tr> -->
6375
<tr>
76+
<td align="left">MinerU v2.5.4</td>
77+
<td align="center">75.5</td>
78+
<td align="center">50.2</td>
79+
<td align="center">59.9</td>
80+
<td align="center">19.2</td>
81+
<td align="center"><strong>97.0</strong></td>
82+
<td align="center">58.7</td>
83+
<td align="center">44.6</td>
84+
<td align="center">97.8</td>
85+
<td align="center">62.9 ± 1.1</td>
86+
</tr>
87+
<!-- <tr>
6488
<td align="left">MinerU v1.3.10</td>
6589
<td align="center">75.4</td>
6690
<td align="center">47.4</td>
@@ -71,7 +95,7 @@ to run it against your own OCR tools. Your tool just needs to support Markdown o
7195
<td align="center">39.1</td>
7296
<td align="center">96.6</td>
7397
<td align="center">61.5 ± 1.1</td>
74-
</tr>
98+
</tr> -->
7599
<tr>
76100
<td align="left">Mistral OCR API</td>
77101
<td align="center">77.2</td>
@@ -88,7 +112,7 @@ to run it against your own OCR tools. Your tool just needs to support Markdown o
88112
<td align="left">Nanonets OCR</td>
89113
<td align="center">67.0</td>
90114
<td align="center">68.6</td>
91-
<td align="center"><strong>77.7</strong></td>
115+
<td align="center">77.7</td>
92116
<td align="center">39.5</td>
93117
<td align="center">40.7</td>
94118
<td align="center">69.9</td>
@@ -194,32 +218,43 @@ to run it against your own OCR tools. Your tool just needs to support Markdown o
194218
</tr>
195219
<tr>
196220
<td align="left">olmOCR v0.2.0</td>
197-
<td align="center"><strong>78.8</strong></td>
221+
<td align="center">78.8</td>
198222
<td align="center">77.5</td>
199223
<td align="center">71.9</td>
200-
<td align="center"><strong>45.4</strong></td>
224+
<td align="center">45.4</td>
201225
<td align="center">94.2</td>
202-
<td align="center"><strong>78.6</strong></td>
226+
<td align="center">78.6</td>
203227
<td align="center">81.4</td>
204228
<td align="center"><strong>99.8</strong></td>
205-
<td align="center"><strong>78.5 ± 1.1</strong></td>
229+
<td align="center">78.5 ± 1.1</td>
206230
</tr>
207231
<tr>
208232
<td align="left">olmOCR v0.3.0</td>
209233
<td align="center">78.6</td>
210-
<td align="center"><strong>79.9</strong></td>
234+
<td align="center">79.9</td>
211235
<td align="center">72.9</td>
212236
<td align="center">43.9</td>
213237
<td align="center">95.1</td>
214238
<td align="center">77.3</td>
215239
<td align="center">81.2</td>
216240
<td align="center">98.9</td>
217241
<td align="center">78.5 ± 1.1</td>
218-
</tr>
242+
</tr>
243+
<tr>
244+
<td align="left">olmOCR pipeline v0.4.0</td>
245+
<td align="center">83.0</td>
246+
<td align="center"><strong>82.3</strong></td>
247+
<td align="center"><strong>84.9</strong></td>
248+
<td align="center"><strong>47.7</strong></td>
249+
<td align="center">96.1</td>
250+
<td align="center"><strong>83.7</strong></td>
251+
<td align="center">81.9</td>
252+
<td align="center">99.7</td>
253+
<td align="center"><strong>82.4 ± 1.1</strong></td>
254+
</tr>
219255
</tbody>
220256
</table>
221257

222-
223258
<sup><sub>There was a small drop in scores from olmOCR v0.1.68 (77.4), which is due to two factors. One, is that we have adjusted our benchmark code to not include
224259
any "fallback" mechanism when measuring benchmark scores (though it still exists when you run olmocr.pipeline). Second, there is a small drop in scores as we have updated
225260
from sglang 0.4.2 to vllm 0.9.1. In net, we think the upgrade to vllm is the right choice, given that sglang 0.4.6 had even lower scores by one point, and vllm comes with a
@@ -309,13 +344,13 @@ huggingface-cli download --repo-type dataset --resume-download allenai/olmOCR-be
309344
Convert your documents
310345
```bash
311346
# You will need to install the [gpu] subset of olmocr dependencies to run gpu inference
347+
# Then convert using using olmocr.bench.convert, see the olmocr/bench/runners directory for options
312348
pip install olmocr[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
313-
314-
# convert using the same engine as olmOCR pipeline.py uses, see the olmocr/bench/runners directory for options
315349
python -m olmocr.bench.convert olmocr_pipeline --dir ./olmOCR-bench/bench_data
316350

317-
# or use convert_all.sh to run OCR with many common frameworks all at once, API keys will be required
318-
./olmocr/bench/scripts/convert_all.sh
351+
# OR, you can use the pipeline to convert the benchmark PDFs and move them into the final format
352+
python -m olmocr.pipeline ./localworkspace --markdown --pdfs ./olmOCR-bench/bench_data/pdfs/**/*.pdf
353+
python olmocr/bench/scripts/workspace_to_bench.py localworkspace/ olmOCR-bench/bench_data/olmocr --bench-path ./olmOCR-bench/
319354
```
320355

321356
Now run the benchmark

0 commit comments

Comments
 (0)