nexus-ocr

Task:

A command‐line tool to do OCR and extract text from a given scanned document image.

Input

Scanned document images ( include English only ). Your command accepts the following file formats.

PNG
JPEG
PDF

Output

Extracted texts as a text file.

python your_code.py --input=./test.pdf --output=output.text --verbose

Interface

–input : input file
–output : output text file
–verbose : verbose mode ( output detailed logs )

Requirements

Before doing the OCR process with TessaractOCR, does pre‐processing to improve OCR accuracy.

Regulations

Uses Click as an command line interface builder
Uses Poetry to install required thirdparty packages
Uses yapf and isort to format python codes
Uses logging package to do output. Never use print for log output.

OCR Streamlit Application

This application performs Optical Character Recognition (OCR) on PDF and image files (PNG, JPG) using easyocr and pdf2image libraries. The application is built using Streamlit for a user-friendly web interface.

Features

Extract text from PDF files.
Extract text from image files (PNG, JPG).
Supports verbose logging for detailed processing information.

Requirements

Python 3.7+
Streamlit
OpenCV
EasyOCR
PDF2Image
Poppler (for PDF support)

Installation

Clone the repository:

git clone https://github.com/yourusername/ocr-streamlit-app.git
cd ocr-streamlit-app

Set up a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required Python packages:
```
pip install -r requirements.txt
```
Install Poppler:

On Windows:
- Download the latest release from Poppler for Windows.
- Extract the ZIP file to a directory, e.g., C:\poppler.
- Add the bin directory of Poppler to your system PATH.
  - Open the Start Menu and search for "Environment Variables".
  - Select "Edit the system environment variables".
  - Click "Environment Variables...".
  - Under "System variables", find and select the Path variable, then click "Edit...".
  - Click "New" and add the path to the bin directory, e.g., C:\poppler\bin.
On macOS:
- Install Homebrew if not already installed:
```
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- Install Poppler using Homebrew:
```
brew install poppler
```
On Linux:
- For Debian-based systems (Ubuntu):
```
sudo apt-get install poppler-utils
```
- For Red Hat-based systems (Fedora):
```
sudo dnf install poppler-utils
```

Running the Application

Start the Streamlit app:
```
streamlit run ocr_app.py
```
Open your web browser and go to:
```
http://localhost:8501
```
Upload a PDF or image file and extract text.

Example

Here's a quick example of how to use the application:

Run the Streamlit app.
Upload a PDF or image file.
Click on the "Process" button.
The extracted text will be displayed on the web page.

Troubleshooting

AttributeError: module 'PIL.Image' has no attribute 'Resampling':
- Upgrade Pillow to the latest version:
```
pip install --upgrade pillow
```
PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?:
- Ensure that Poppler is installed and the path to its bin directory is added to your system PATH.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any changes.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Input		Input
Output		Output
.gitignore		.gitignore
README.md		README.md
main.py		main.py
ocr_app.py		ocr_app.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nexus-ocr

Task:

Input

Output

Interface

Requirements

Regulations

OCR Streamlit Application

Features

Requirements

Installation

On Windows:

On macOS:

On Linux:

Running the Application

Example

Troubleshooting

License

Contributing

About

Releases

Packages

Languages

julicq/nexus-ocr

Folders and files

Latest commit

History

Repository files navigation

nexus-ocr

Task:

Input

Output

Interface

Requirements

Regulations

OCR Streamlit Application

Features

Requirements

Installation

On Windows:

On macOS:

On Linux:

Running the Application

Example

Troubleshooting

License

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages