NEVO1 is a research-driven playground for building unified OCR that reads mixed Hebrew–English documents. It includes a modular training stack, synthetic data generation, API benchmarks, prompt experiments, and GOT-OCR2 fine-tuning utilities.
- Vision–language OCR training stack (autoregressive decoder + vision encoder with swappable connectors).
- Synthetic line renderer for Hebrew, English, and mixed-script text with varied typography and augmentations.
- Benchmarks against external OCR APIs (Google Vision, Azure Vision, Gemini, Mistral, Tesseract) and internal baselines.
- Gradio app and prompt experiments for Hebrew/English comparisons.
- GOT-OCR2 fine-tuning helpers and checkpoints.
nevo_finetune/— core training stack (encoder–decoder VLM with CER evaluation)benchmarking/— API OCR benchmarks + synthetic rendererhebrew_vs_english/— prompts, Gradio demo, chat clients, and scrapers (hebrew_vs_english/scrapers/)assets/fonts/— Hebrew fonts for rendering (assets/fonts/hebrew)fine_tuning_GOT_OCR2/— GOT-OCR2 fine-tuning scripts, data prep helpers, and checkpoints
- Create/activate a Python env and install deps:
pip install -r requirements.txt. - Copy env templates and fill in keys (keep real keys untracked):
benchmarking/.env.example→benchmarking/.envhebrew_vs_english/.env.example→hebrew_vs_english/.envnevo_finetune/.env.example→nevo_finetune/.env
.envfiles are git-ignored; code prefersos.environand falls back to local.env.
- Purpose: modular VLM OCR training with YAML configs; supports connector variants, scheduler phases, mixed precision, gradient checkpointing, freezing policies, and LoRA.
- Run:
- Single GPU:
python -m nevo_finetune.cli.train --config train_ocr_config.yaml - Multi-GPU:
torchrun --nproc_per_node=NUM_GPUS --master_port=29500 -m nevo_finetune.cli.train --config train_ocr_config.yaml
- Single GPU:
- Configs:
train_ocr_config.yaml(pipeline, scheduler, training, paths, wandb, evaluation). Utils innevo_finetune/utils/. - Data: synthetic line renderer populates
ocr_training_output/(train/val/test for Hebrew/English/Mixed) with paired image/text; augmentations cover blur, noise, rotation, contrast, backgrounds, mild distortion. - Outputs: checkpoints, logs, evals under
ocr_training_output/; optional Weights & Biases logging ifWANDB_API_KEYis set. - Metrics: Character Error Rate (CER) per bucket (Heb/Eng/Mixed) and overall.
- Purpose: benchmark third-party OCR APIs and generate synthetic samples for quick tests.
- Entrypoints:
bench_google_vision.py,bench_azure_vision.py,bench_gemini.py,bench_mistral_ocr.py,bench_tesseract.py; renderer:renderer.py(usesassets/fonts/hebrewplus bundled EN fonts/backgrounds). - Env: set in
benchmarking/.env:GOOGLE_VISION_API_KEY,GEMINI_API_KEY(orGOOGLE_API_KEY)AZURE_VISION_ENDPOINT,AZURE_VISION_API_KEYMISTRAL_API_KEYHF_TOKEN(for HF-backed providers if added)
- Data: default samples in
benchmarking/test/; renderer writes square images intobenchmarking/test/<lang>/. - Example runs:
python benchmarking/bench_google_vision.py --data-dir benchmarking/test/hebrewpython benchmarking/bench_gemini.py --data-dir benchmarking/test/mixedpython benchmarking/bench_azure_vision.py --data-dir benchmarking/test/englishpython benchmarking/bench_mistral_ocr.py --data-dir benchmarking/test/mixedpython benchmarking/bench_tesseract.py --postfix mixed(requires tesseract binary installed)
- Outputs: printed per-sample/aggregate metrics via
benchmark_ocr; renderer emits images for local testing.
- Purpose: mixed Hebrew/English prompt experiments, Gradio comparison app, prompt databases, chat clients, and scrapers.
- Entrypoints:
- Gradio comparison:
python hebrew_vs_english/app.py(preloads paired prompts; needs Gemini/Google key). - Scrapers:
hebrew_vs_english/scrapers/nli_scrape.py(NLI IIIF search/manifests),hebrew_vs_english/scrapers/wiki_scrape.py(Wikipedia → HTML/PDF). - Chat clients: wrappers for OpenAI/Replicated/Dicta in
*ChatClient.py.
- Gradio comparison:
- Data: prompt CSVs under
hebrew_vs_english/prompt_database_*, backups underDB_backup/, test results intest_results_*.json. - Env: set in
hebrew_vs_english/.env—GEMINI_API_KEY(orGOOGLE_API_KEY);NLI_API_KEYfor the NLI scraper; chat clients require their respective provider tokens.
- Purpose: adapt GOT-OCR2 (ViT encoder + decoder) on custom OCR datasets.
- Entrypoints:
- Training:
python fine_tuning_GOT_OCR2/fine_tune_vit.py --data-dir <dataset_root> --save-dir <out_dir> - Data prep:
python fine_tuning_GOT_OCR2/get_nougat.py --num-samples 10,python fine_tuning_GOT_OCR2/get_sroie.py,python fine_tuning_GOT_OCR2/generate_data.py --num-samples 200 --langs he en
- Training:
- Data layout:
fine_tuning_GOT_OCR2/dataset/(or--data-dir) with images + sidecar.txt, orlabels.csv/labels.jsonl(image,text). - Outputs: checkpoints under
fine_tuning_GOT_OCR2/checkpoints/by default; models metadata underfine_tuning_GOT_OCR2/models/. - Notes: ViT-based GOT-OCR2 training is VRAM-heavy; tune batch size/precision. Base model paths/IDs are defined in the scripts.
- Do not commit real API keys.
.envfiles are git-ignored; only commit the*.env.exampletemplates. - Rotate any keys that were previously stored in local
.envfiles before making the repo public. - Code paths that load secrets:
benchmarking/*,hebrew_vs_english/app.py,hebrew_vs_english/scrapers/nli_scrape.py,nevo_finetune/utils/config.py.
- License: MIT (see
LICENSE). - Third-party components:
benchmarking/text_renderer: MIT (license included in that folder).- Fonts under
assets/fonts/hebrew: use according to their upstream licenses (e.g., SIL OFL/Apache as provided by the font authors). If you redistribute them, keep their license terms; replace with your own font set if needed.