Skip to content

ksang123/NEVO1

Repository files navigation

NEVO1: Bilingual Hebrew–English OCR Playground

NEVO1 is a research-driven playground for building unified OCR that reads mixed Hebrew–English documents. It includes a modular training stack, synthetic data generation, API benchmarks, prompt experiments, and GOT-OCR2 fine-tuning utilities.

What’s here

  • Vision–language OCR training stack (autoregressive decoder + vision encoder with swappable connectors).
  • Synthetic line renderer for Hebrew, English, and mixed-script text with varied typography and augmentations.
  • Benchmarks against external OCR APIs (Google Vision, Azure Vision, Gemini, Mistral, Tesseract) and internal baselines.
  • Gradio app and prompt experiments for Hebrew/English comparisons.
  • GOT-OCR2 fine-tuning helpers and checkpoints.

Repo layout

  • nevo_finetune/ — core training stack (encoder–decoder VLM with CER evaluation)
  • benchmarking/ — API OCR benchmarks + synthetic renderer
  • hebrew_vs_english/ — prompts, Gradio demo, chat clients, and scrapers (hebrew_vs_english/scrapers/)
  • assets/fonts/ — Hebrew fonts for rendering (assets/fonts/hebrew)
  • fine_tuning_GOT_OCR2/ — GOT-OCR2 fine-tuning scripts, data prep helpers, and checkpoints

Setup

  1. Create/activate a Python env and install deps: pip install -r requirements.txt.
  2. Copy env templates and fill in keys (keep real keys untracked):
    • benchmarking/.env.examplebenchmarking/.env
    • hebrew_vs_english/.env.examplehebrew_vs_english/.env
    • nevo_finetune/.env.examplenevo_finetune/.env
  3. .env files are git-ignored; code prefers os.environ and falls back to local .env.

nevo_finetune/: training stack

  • Purpose: modular VLM OCR training with YAML configs; supports connector variants, scheduler phases, mixed precision, gradient checkpointing, freezing policies, and LoRA.
  • Run:
    • Single GPU: python -m nevo_finetune.cli.train --config train_ocr_config.yaml
    • Multi-GPU: torchrun --nproc_per_node=NUM_GPUS --master_port=29500 -m nevo_finetune.cli.train --config train_ocr_config.yaml
  • Configs: train_ocr_config.yaml (pipeline, scheduler, training, paths, wandb, evaluation). Utils in nevo_finetune/utils/.
  • Data: synthetic line renderer populates ocr_training_output/ (train/val/test for Hebrew/English/Mixed) with paired image/text; augmentations cover blur, noise, rotation, contrast, backgrounds, mild distortion.
  • Outputs: checkpoints, logs, evals under ocr_training_output/; optional Weights & Biases logging if WANDB_API_KEY is set.
  • Metrics: Character Error Rate (CER) per bucket (Heb/Eng/Mixed) and overall.

benchmarking/: API OCR + renderer

  • Purpose: benchmark third-party OCR APIs and generate synthetic samples for quick tests.
  • Entrypoints: bench_google_vision.py, bench_azure_vision.py, bench_gemini.py, bench_mistral_ocr.py, bench_tesseract.py; renderer: renderer.py (uses assets/fonts/hebrew plus bundled EN fonts/backgrounds).
  • Env: set in benchmarking/.env:
    • GOOGLE_VISION_API_KEY, GEMINI_API_KEY (or GOOGLE_API_KEY)
    • AZURE_VISION_ENDPOINT, AZURE_VISION_API_KEY
    • MISTRAL_API_KEY
    • HF_TOKEN (for HF-backed providers if added)
  • Data: default samples in benchmarking/test/; renderer writes square images into benchmarking/test/<lang>/.
  • Example runs:
    • python benchmarking/bench_google_vision.py --data-dir benchmarking/test/hebrew
    • python benchmarking/bench_gemini.py --data-dir benchmarking/test/mixed
    • python benchmarking/bench_azure_vision.py --data-dir benchmarking/test/english
    • python benchmarking/bench_mistral_ocr.py --data-dir benchmarking/test/mixed
    • python benchmarking/bench_tesseract.py --postfix mixed (requires tesseract binary installed)
  • Outputs: printed per-sample/aggregate metrics via benchmark_ocr; renderer emits images for local testing.

hebrew_vs_english/: prompts, demos, scrapers

  • Purpose: mixed Hebrew/English prompt experiments, Gradio comparison app, prompt databases, chat clients, and scrapers.
  • Entrypoints:
    • Gradio comparison: python hebrew_vs_english/app.py (preloads paired prompts; needs Gemini/Google key).
    • Scrapers: hebrew_vs_english/scrapers/nli_scrape.py (NLI IIIF search/manifests), hebrew_vs_english/scrapers/wiki_scrape.py (Wikipedia → HTML/PDF).
    • Chat clients: wrappers for OpenAI/Replicated/Dicta in *ChatClient.py.
  • Data: prompt CSVs under hebrew_vs_english/prompt_database_*, backups under DB_backup/, test results in test_results_*.json.
  • Env: set in hebrew_vs_english/.envGEMINI_API_KEY (or GOOGLE_API_KEY); NLI_API_KEY for the NLI scraper; chat clients require their respective provider tokens.

fine_tuning_GOT_OCR2/: GOT-OCR2 fine-tuning

  • Purpose: adapt GOT-OCR2 (ViT encoder + decoder) on custom OCR datasets.
  • Entrypoints:
    • Training: python fine_tuning_GOT_OCR2/fine_tune_vit.py --data-dir <dataset_root> --save-dir <out_dir>
    • Data prep: python fine_tuning_GOT_OCR2/get_nougat.py --num-samples 10, python fine_tuning_GOT_OCR2/get_sroie.py, python fine_tuning_GOT_OCR2/generate_data.py --num-samples 200 --langs he en
  • Data layout: fine_tuning_GOT_OCR2/dataset/ (or --data-dir) with images + sidecar .txt, or labels.csv / labels.jsonl (image,text).
  • Outputs: checkpoints under fine_tuning_GOT_OCR2/checkpoints/ by default; models metadata under fine_tuning_GOT_OCR2/models/.
  • Notes: ViT-based GOT-OCR2 training is VRAM-heavy; tune batch size/precision. Base model paths/IDs are defined in the scripts.

Security and keys

  • Do not commit real API keys. .env files are git-ignored; only commit the *.env.example templates.
  • Rotate any keys that were previously stored in local .env files before making the repo public.
  • Code paths that load secrets: benchmarking/*, hebrew_vs_english/app.py, hebrew_vs_english/scrapers/nli_scrape.py, nevo_finetune/utils/config.py.

License and third-party notices

  • License: MIT (see LICENSE).
  • Third-party components:
    • benchmarking/text_renderer: MIT (license included in that folder).
    • Fonts under assets/fonts/hebrew: use according to their upstream licenses (e.g., SIL OFL/Apache as provided by the font authors). If you redistribute them, keep their license terms; replace with your own font set if needed.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages