thl-google-ocr

This repo is for scripts used to OCR (generally Tibetan) texts for THL and related projects.

The simplest script quickocr.py just does a simple call to convert one image and this call is repeated for a known series of numbered images.

Further developments should include REST calls to the Google Storage to convert all files in a bucket.

The scripts use the Google Vision API and other Google APIS. See:

Instructions

Convert Original Images to JPEGs

If original scans are .tiff they need to be converted to .jpgs. This can be done on a Mac with the script in this repo called convimgs.sh. The Mac will need to have ImageMagick installed using homebrew, e.g. brew install imagemagick. The script assumes that the .tiffs are in a sister folder called "tiffs" and there is an output folder called "jpgs", but these directory names can be changed in the script itself. To run the script type ./convimgs.sh and it will convert all images in the /tiffs folder.

Upload JPEGs to Google Cloud

In the CSC Google Cloud Console (https://console.cloud.google.com/storage/browser/), there is a "Google OCR Project (woven-invention-254116)". Choose this project and go to the storage section. There are buckets there to hold the images. Use an existing one or create your own and upload all images into a folder on that bucket. The example here is the /ngb-peltsek-scans bucket.

Set up Authentication

Go to the Service Accounts section for the project and create a key and down load the JSON file for the key. Then in a terminal do: export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json" Pointing to the JSON file. This will allow you to access the storage bucket with your script.

Run the OCR Script

Edit the script quickocr.py and modify the variables at the top of the script to match your situation:

bucket = 'ngb-peltsek-scans'
folder = 'ngb_vol58jpg'
startpg = 0
endpg = 100
outdir = 'docs'
outfilenm = 'pt-vol058-ocr-1-100.txt'

These are self-explanatory. The out directory is relative to the script itself.

Then run the script with python quickocr.py in a terminal or through your IDE.

Than Grove Jan 13, 2021 (modified June 24, 2021)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
.gitignore		.gitignore
README.md		README.md
clean-ocr.py		clean-ocr.py
convimgs.sh		convimgs.sh
quickocr.py		quickocr.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

thl-google-ocr

Instructions

Convert Original Images to JPEGs

Upload JPEGs to Google Cloud

Set up Authentication

Run the OCR Script

About

Uh oh!

Releases

Packages

Languages

thl-texts/thl-google-ocr

Folders and files

Latest commit

History

Repository files navigation

thl-google-ocr

Instructions

Convert Original Images to JPEGs

Upload JPEGs to Google Cloud

Set up Authentication

Run the OCR Script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages