Skip to content

Commit d2b073c

Browse files
committed
update tutorials and add layout parser docs
1 parent fbc4858 commit d2b073c

File tree

4 files changed

+54
-7
lines changed

4 files changed

+54
-7
lines changed

docs/index.rst

+5
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,11 @@ with a simple and intuitive interface and a powerful Pipeline API.
1111
It unifies the multitude of interfaces provided by a wide range of cloud tools & other open-source libraries
1212
and provides a simple, easy-to-use interface for the user.
1313

14+
.. image:: _static/ocrpy-overview-plain.png
15+
:align: center
16+
:alt: ocrpy overview
17+
:height: 400px
18+
1419
Getting Started
1520
===============
1621

docs/overview.rst

+28
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,31 @@
22
Overview
33
========
44

5+
6+
.. image:: _static/ocrpy-overview-plain.png
7+
:align: center
8+
:alt: ocrpy overview
9+
:height: 400px
10+
11+
12+
At its core, ocrpy is a Python library that can read Pdf and Image documents from Any cloud
13+
storage service or a local file system, and then perform a set of operations on these documents like identification of document type,
14+
parsing the layout of the document, and extracting text &/or tables from the document and then writing the results to cloud storage,
15+
local file system or a database.
16+
17+
Ocrpy Internal System Architecture
18+
----------------------------------
19+
20+
.. image:: _static/ocrpy-architecture.png
21+
:align: center
22+
:alt: ocrpy architecture
23+
:height: 400px
24+
25+
- Read - Parse - Write Pipeline (R-P-W) Detils
26+
- Document classification Details
27+
- Document layout parsing Details
28+
- Read - Parse - Index Pipeline (R-P-I) Detils
29+
30+
31+
Ocrpy Features Overview
32+
-----------------------

docs/tutorials.rst

+18-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,21 @@
11
Tutorials
22
=========
33

4-
- Please refer to `Ocrpy Quick starter Notebook <https://github.com/maxent-ai/ocrpy/blob/main/notebooks/ocrpy_usage.ipynb>`_ for an in-depth walkthrough of how to use Ocrpy.
5-
- Also checkout the `Ocr, index and Search Notebook <https://github.com/maxent-ai/ocrpy/blob/main/notebooks/ocrpy_with_haystack.ipynb>`_ for an in-depth walkthrough of how to use Ocrpy to ocr, index and do semantic search.
4+
Getting Started
5+
---------------
6+
7+
Please refer to `Ocrpy Quick starter Notebook <https://github.com/maxent-ai/ocrpy/blob/main/notebooks/ocrpy_usage.ipynb>`_ for an in-depth walkthrough of how to use Ocrpy.
8+
This notebook gives an overview of how to use Ocrpy to perform:
9+
10+
- Document classification
11+
- Layout Parsing
12+
- Table extraction
13+
- Running a full Text Ocr pipeline
14+
- and writing the extracted output to a storage of choice.
15+
16+
Ocr, Index and Search
17+
---------------------
18+
19+
Also checkout the `Ocr, index and Search Notebook <https://github.com/maxent-ai/ocrpy/blob/main/notebooks/ocrpy_with_haystack.ipynb>`_ for an in-depth walkthrough of how to use Ocrpy to ocr, index and do semantic search.
20+
In this Notebook, you will find how to use ocrpy to extract text and tables from your pdf's and images and then index the extracted the data to either
21+
opensearch, elasticsearch or a mysql database & then query the indexed collection of docs via semantic search.

ocrpy/experimental/layout_parser.py

+3-5
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
import re
33
from PIL import Image
44
from typing import List
5-
import layoutparser as lp
65
from attrs import define, field
76
from ..parsers import TextParser
87
from ..io.reader import DocumentReader
@@ -35,10 +34,9 @@ class DocumentLayoutParser:
3534
3635
Note
3736
----
38-
- The model is trained on the Publaynet dataset and can detect the following blocks from the document:
39-
text, title, list, table, figure
37+
- The model is trained on the Publaynet dataset and can detect the following blocks from the document: text, title, list, table, figure
4038
41-
- For more information on the datase please refer this paper: https://arxiv.org/abs/1908.07836
39+
- For more information on the dataset please refer this paper: https://arxiv.org/abs/1908.07836
4240
4341
"""
4442
model_name: str = field(
@@ -109,7 +107,7 @@ def _update_blocks(self, blocks, tokens, meta_data=None):
109107
blocks_list.append(self._block_formatter(block, tokens, meta_data))
110108
return blocks_list
111109

112-
def parser(self, reader: DocumentReader, ocr: TextParser) -> List:
110+
def parse(self, reader: DocumentReader, ocr: TextParser) -> List:
113111
"""
114112
Predict the document type of the document in the reader.
115113

0 commit comments

Comments
 (0)