A command‐line tool to do OCR and extract text from a given scanned document image.
Scanned document images ( include English only ). Your command accepts the following file formats.
- PNG
- JPEG
Extracted texts as a text file.
python your_code.py --input=./test.pdf --output=output.text --verbose
- –input : input file
- –output : output text file
- –verbose : verbose mode ( output detailed logs )
Before doing the OCR process with TessaractOCR, does pre‐processing to improve OCR accuracy.
- Uses Click as an command line interface builder
- Uses Poetry to install required thirdparty packages
- Uses yapf and isort to format python codes
- Uses logging package to do output. Never use
print
for log output.
This application performs Optical Character Recognition (OCR) on PDF and image files (PNG, JPG) using easyocr
and pdf2image
libraries. The application is built using Streamlit for a user-friendly web interface.
- Extract text from PDF files.
- Extract text from image files (PNG, JPG).
- Supports verbose logging for detailed processing information.
- Python 3.7+
- Streamlit
- OpenCV
- EasyOCR
- PDF2Image
- Poppler (for PDF support)
-
Clone the repository:
git clone https://github.com/yourusername/ocr-streamlit-app.git cd ocr-streamlit-app
-
Set up a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required Python packages:
pip install -r requirements.txt
-
Install Poppler:
- Download the latest release from Poppler for Windows.
- Extract the ZIP file to a directory, e.g.,
C:\poppler
. - Add the
bin
directory of Poppler to your system PATH.- Open the Start Menu and search for "Environment Variables".
- Select "Edit the system environment variables".
- Click "Environment Variables...".
- Under "System variables", find and select the
Path
variable, then click "Edit...". - Click "New" and add the path to the
bin
directory, e.g.,C:\poppler\bin
.
- Install Homebrew if not already installed:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- Install Poppler using Homebrew:
brew install poppler
- For Debian-based systems (Ubuntu):
sudo apt-get install poppler-utils
- For Red Hat-based systems (Fedora):
sudo dnf install poppler-utils
-
Start the Streamlit app:
streamlit run ocr_app.py
-
Open your web browser and go to:
http://localhost:8501
-
Upload a PDF or image file and extract text.
Here's a quick example of how to use the application:
- Run the Streamlit app.
- Upload a PDF or image file.
- Click on the "Process" button.
- The extracted text will be displayed on the web page.
-
AttributeError: module 'PIL.Image' has no attribute 'Resampling':
- Upgrade Pillow to the latest version:
pip install --upgrade pillow
- Upgrade Pillow to the latest version:
-
PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?:
- Ensure that Poppler is installed and the path to its
bin
directory is added to your system PATH.
- Ensure that Poppler is installed and the path to its
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please open an issue or submit a pull request for any changes.