You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+71-62Lines changed: 71 additions & 62 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,6 +35,7 @@ Features:
35
35
- (Based on a 7B parameter VLM, so it requires a GPU)
36
36
37
37
### News
38
+
- October 21, 2025 - v0.4.0 - [New model release](https://huggingface.co/allenai/olmOCR-2-7B-1025-FP8), boosts olmOCR-bench score by ~4 points using synthetic data and introduces RL training.
38
39
- August 13, 2025 - v0.3.0 - [New model release](https://huggingface.co/allenai/olmOCR-7B-0825-FP8), fixes auto-rotation detection, and hallucinations on blank documents.
39
40
- July 24, 2025 - v0.2.1 - [New model release](https://huggingface.co/allenai/olmOCR-7B-0725-FP8), scores 3 points higher on [olmOCR-Bench](https://github.com/allenai/olmocr/tree/main/olmocr/bench), also runs significantly faster because it's default FP8, and needs much fewer retries per document.
40
41
- July 23, 2025 - v0.2.0 - New cleaned up [trainer code](https://github.com/allenai/olmocr/tree/main/olmocr/train), makes it much simpler to train olmOCR models yourself.
@@ -66,28 +67,28 @@ We also ship a comprehensive benchmark suite covering over 7,000 test cases acro
With the addition of the `--markdown` flag, results will be stored as markdown files inside of `./localworkspace/markdown/`.
198
211
199
-
### Using External vLLM Server
212
+
#### Viewing Results
213
+
214
+
The `./localworkspace/` workspace folder will then have both [Dolma](https://github.com/allenai/dolma) and markdown files (if using `--markdown`).
200
215
201
-
If you have a vLLM server already running elsewhere (or any inference platform implementing the relevant subset of the OpenAI API), you can point olmOCR to use it instead of spawning a local instance:
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
223
+
...
224
+
```
225
+
226
+
### Using an Inference Provider or External Server
227
+
228
+
If you have a vLLM server already running elsewhere (or any inference platform implementing the OpenAI API), you can point olmOCR to use it instead of spawning a local instance:
212
229
213
-
#### Run olmOCR with the DeepInfra server endpoint:
214
-
Signup at [DeepInfra](https://deepinfra.com/) and get your API key from the DeepInfra dashboard.
-`--server`: Defines the OpenAI-compatible endpoint: ex `https://api.deepinfra.com/v1/openai`
252
+
-`--api_key`: Your API key, bassed in via Authorization Bearer HTTP header
253
+
-`--pages_per_group`: You may want a smaller number of pages per group as many external provides have lower concurrent request limits
254
+
-`--model`: The model identifier, ex. `allenai/olmOCR-7B-1025`, different providers have different names, and if you run locally, you can use `olmocr`
255
+
- Other arguments work the same as with local inference
244
256
245
-
```
246
-
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
247
-
...
248
-
```
249
257
250
258
### Multi-node / Cluster Usage
251
259
@@ -371,10 +379,11 @@ beaker/cluster execution:
371
379
372
380
There are some nice reusable pieces of the code that may be useful for your own projects:
373
381
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
374
-
- An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
375
382
- Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
376
-
- Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
377
-
- Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
383
+
- SFT Finetuning code for Qwen2.5-VL - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
- Synthetic data generation - [mine_html_templates.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/synth/mine_html_templates.py)
386
+
- Processing millions of PDFs through a finetuned model using VLLM - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
378
387
- Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
<sup><sub>There was a small drop in scores from olmOCR v0.1.68 (77.4), which is due to two factors. One, is that we have adjusted our benchmark code to not include
224
259
any "fallback" mechanism when measuring benchmark scores (though it still exists when you run olmocr.pipeline). Second, there is a small drop in scores as we have updated
225
260
from sglang 0.4.2 to vllm 0.9.1. In net, we think the upgrade to vllm is the right choice, given that sglang 0.4.6 had even lower scores by one point, and vllm comes with a
0 commit comments