NEVO1: Bilingual Hebrew–English OCR Playground

NEVO1 is a research-driven playground for building unified OCR that reads mixed Hebrew–English documents. It includes a modular training stack, synthetic data generation, API benchmarks, prompt experiments, and GOT-OCR2 fine-tuning utilities.

What’s here

Vision–language OCR training stack (autoregressive decoder + vision encoder with swappable connectors).
Synthetic line renderer for Hebrew, English, and mixed-script text with varied typography and augmentations.
Benchmarks against external OCR APIs (Google Vision, Azure Vision, Gemini, Mistral, Tesseract) and internal baselines.
Gradio app and prompt experiments for Hebrew/English comparisons.
GOT-OCR2 fine-tuning helpers and checkpoints.

Repo layout

nevo_finetune/ — core training stack (encoder–decoder VLM with CER evaluation)
benchmarking/ — API OCR benchmarks + synthetic renderer
hebrew_vs_english/ — prompts, Gradio demo, chat clients, and scrapers (hebrew_vs_english/scrapers/)
assets/fonts/ — Hebrew fonts for rendering (assets/fonts/hebrew)
fine_tuning_GOT_OCR2/ — GOT-OCR2 fine-tuning scripts, data prep helpers, and checkpoints

Setup

Create/activate a Python env and install deps: pip install -r requirements.txt.
Copy env templates and fill in keys (keep real keys untracked):
- benchmarking/.env.example → benchmarking/.env
- hebrew_vs_english/.env.example → hebrew_vs_english/.env
- nevo_finetune/.env.example → nevo_finetune/.env
.env files are git-ignored; code prefers os.environ and falls back to local .env.

`nevo_finetune/`: training stack

Purpose: modular VLM OCR training with YAML configs; supports connector variants, scheduler phases, mixed precision, gradient checkpointing, freezing policies, and LoRA.
Run:
- Single GPU: python -m nevo_finetune.cli.train --config train_ocr_config.yaml
- Multi-GPU: torchrun --nproc_per_node=NUM_GPUS --master_port=29500 -m nevo_finetune.cli.train --config train_ocr_config.yaml
Configs: train_ocr_config.yaml (pipeline, scheduler, training, paths, wandb, evaluation). Utils in nevo_finetune/utils/.
Data: synthetic line renderer populates ocr_training_output/ (train/val/test for Hebrew/English/Mixed) with paired image/text; augmentations cover blur, noise, rotation, contrast, backgrounds, mild distortion.
Outputs: checkpoints, logs, evals under ocr_training_output/; optional Weights & Biases logging if WANDB_API_KEY is set.
Metrics: Character Error Rate (CER) per bucket (Heb/Eng/Mixed) and overall.

`benchmarking/`: API OCR + renderer

Purpose: benchmark third-party OCR APIs and generate synthetic samples for quick tests.
Entrypoints: bench_google_vision.py, bench_azure_vision.py, bench_gemini.py, bench_mistral_ocr.py, bench_tesseract.py; renderer: renderer.py (uses assets/fonts/hebrew plus bundled EN fonts/backgrounds).
Env: set in benchmarking/.env:
- GOOGLE_VISION_API_KEY, GEMINI_API_KEY (or GOOGLE_API_KEY)
- AZURE_VISION_ENDPOINT, AZURE_VISION_API_KEY
- MISTRAL_API_KEY
- HF_TOKEN (for HF-backed providers if added)
Data: default samples in benchmarking/test/; renderer writes square images into benchmarking/test/<lang>/.
Example runs:
- python benchmarking/bench_google_vision.py --data-dir benchmarking/test/hebrew
- python benchmarking/bench_gemini.py --data-dir benchmarking/test/mixed
- python benchmarking/bench_azure_vision.py --data-dir benchmarking/test/english
- python benchmarking/bench_mistral_ocr.py --data-dir benchmarking/test/mixed
- python benchmarking/bench_tesseract.py --postfix mixed (requires tesseract binary installed)
Outputs: printed per-sample/aggregate metrics via benchmark_ocr; renderer emits images for local testing.

`hebrew_vs_english/`: prompts, demos, scrapers

Purpose: mixed Hebrew/English prompt experiments, Gradio comparison app, prompt databases, chat clients, and scrapers.
Entrypoints:
- Gradio comparison: python hebrew_vs_english/app.py (preloads paired prompts; needs Gemini/Google key).
- Scrapers: hebrew_vs_english/scrapers/nli_scrape.py (NLI IIIF search/manifests), hebrew_vs_english/scrapers/wiki_scrape.py (Wikipedia → HTML/PDF).
- Chat clients: wrappers for OpenAI/Replicated/Dicta in *ChatClient.py.
Data: prompt CSVs under hebrew_vs_english/prompt_database_*, backups under DB_backup/, test results in test_results_*.json.
Env: set in hebrew_vs_english/.env — GEMINI_API_KEY (or GOOGLE_API_KEY); NLI_API_KEY for the NLI scraper; chat clients require their respective provider tokens.

`fine_tuning_GOT_OCR2/`: GOT-OCR2 fine-tuning

Purpose: adapt GOT-OCR2 (ViT encoder + decoder) on custom OCR datasets.
Entrypoints:
- Training: python fine_tuning_GOT_OCR2/fine_tune_vit.py --data-dir <dataset_root> --save-dir <out_dir>
- Data prep: python fine_tuning_GOT_OCR2/get_nougat.py --num-samples 10, python fine_tuning_GOT_OCR2/get_sroie.py, python fine_tuning_GOT_OCR2/generate_data.py --num-samples 200 --langs he en
Data layout: fine_tuning_GOT_OCR2/dataset/ (or --data-dir) with images + sidecar .txt, or labels.csv / labels.jsonl (image,text).
Outputs: checkpoints under fine_tuning_GOT_OCR2/checkpoints/ by default; models metadata under fine_tuning_GOT_OCR2/models/.
Notes: ViT-based GOT-OCR2 training is VRAM-heavy; tune batch size/precision. Base model paths/IDs are defined in the scripts.

Security and keys

Do not commit real API keys. .env files are git-ignored; only commit the *.env.example templates.
Rotate any keys that were previously stored in local .env files before making the repo public.
Code paths that load secrets: benchmarking/*, hebrew_vs_english/app.py, hebrew_vs_english/scrapers/nli_scrape.py, nevo_finetune/utils/config.py.

License and third-party notices

License: MIT (see LICENSE).
Third-party components:
- benchmarking/text_renderer: MIT (license included in that folder).
- Fonts under assets/fonts/hebrew: use according to their upstream licenses (e.g., SIL OFL/Apache as provided by the font authors). If you redistribute them, keep their license terms; replace with your own font set if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEVO1: Bilingual Hebrew–English OCR Playground

What’s here

Repo layout

Setup

`nevo_finetune/`: training stack

`benchmarking/`: API OCR + renderer

`hebrew_vs_english/`: prompts, demos, scrapers

`fine_tuning_GOT_OCR2/`: GOT-OCR2 fine-tuning

Security and keys

License and third-party notices

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets/fonts/hebrew		assets/fonts/hebrew
benchmarking		benchmarking
fine_tuning_GOT_OCR2		fine_tuning_GOT_OCR2
hebrew_vs_english		hebrew_vs_english
nevo_finetune		nevo_finetune
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NEVO1: Bilingual Hebrew–English OCR Playground

What’s here

Repo layout

Setup

nevo_finetune/: training stack

benchmarking/: API OCR + renderer

hebrew_vs_english/: prompts, demos, scrapers

fine_tuning_GOT_OCR2/: GOT-OCR2 fine-tuning

Security and keys

License and third-party notices

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`nevo_finetune/`: training stack

`benchmarking/`: API OCR + renderer

`hebrew_vs_english/`: prompts, demos, scrapers

`fine_tuning_GOT_OCR2/`: GOT-OCR2 fine-tuning

Packages