update tutorials and add layout parser docs

bharathgs · bharathgs · commit d2b073c0370c · 2022-08-08T17:27:43.000+05:30
diff --git a/docs/index.rst b/docs/index.rst
@@ -11,6 +11,11 @@ with a simple and intuitive interface and a powerful Pipeline API.
 It unifies the multitude of interfaces provided by a wide range of cloud tools & other open-source libraries 
 and provides a simple, easy-to-use interface for the user.
 
+.. image:: _static/ocrpy-overview-plain.png
+   :align: center
+   :alt: ocrpy overview
+   :height: 400px
+
 Getting Started
 ===============
 
diff --git a/docs/overview.rst b/docs/overview.rst
@@ -2,3 +2,31 @@
 Overview
 ========
 
+
+.. image:: _static/ocrpy-overview-plain.png
+   :align: center
+   :alt: ocrpy overview
+   :height: 400px
+
+
+At its core, ocrpy is a Python library that can read Pdf and Image documents from Any cloud 
+storage service or a local file system, and then perform a set of operations on these documents like identification of document type, 
+parsing the layout of the document, and extracting text &/or tables from the document and then writing the results to cloud storage, 
+local file system or a database.
+
+Ocrpy Internal System Architecture
+----------------------------------
+
+.. image:: _static/ocrpy-architecture.png
+   :align: center
+   :alt: ocrpy architecture
+   :height: 400px
+
+- Read - Parse - Write Pipeline (R-P-W) Detils 
+- Document classification Details 
+- Document layout parsing Details
+- Read - Parse - Index Pipeline (R-P-I) Detils
+
+
+Ocrpy Features Overview
+-----------------------
diff --git a/docs/tutorials.rst b/docs/tutorials.rst
@@ -1,5 +1,21 @@
 Tutorials
 =========
 
-- Please refer to `Ocrpy Quick starter Notebook <https://github.com/maxent-ai/ocrpy/blob/main/notebooks/ocrpy_usage.ipynb>`_ for an in-depth walkthrough of how to use Ocrpy.
-- Also checkout the `Ocr, index and Search Notebook <https://github.com/maxent-ai/ocrpy/blob/main/notebooks/ocrpy_with_haystack.ipynb>`_ for an in-depth walkthrough of how to use Ocrpy to ocr, index and do semantic search.
+Getting Started
+---------------
+
+Please refer to `Ocrpy Quick starter Notebook <https://github.com/maxent-ai/ocrpy/blob/main/notebooks/ocrpy_usage.ipynb>`_ for an in-depth walkthrough of how to use Ocrpy.
+This notebook gives an overview  of how to use Ocrpy to perform: 
+
+- Document classification 
+- Layout Parsing 
+- Table extraction 
+- Running a full Text Ocr pipeline 
+- and writing the extracted output to a storage of choice.
+
+Ocr, Index and Search
+---------------------
+
+Also checkout the `Ocr, index and Search Notebook <https://github.com/maxent-ai/ocrpy/blob/main/notebooks/ocrpy_with_haystack.ipynb>`_ for an in-depth walkthrough of how to use Ocrpy to ocr, index and do semantic search.
+In this Notebook, you will find how to use ocrpy to extract text and tables from your pdf's and images and then index the extracted the data to either 
+opensearch, elasticsearch or a mysql database & then query the indexed collection of docs via semantic search.
diff --git a/ocrpy/experimental/layout_parser.py b/ocrpy/experimental/layout_parser.py
@@ -2,7 +2,6 @@
 import re
 from PIL import Image
 from typing import List
-import layoutparser as lp
 from attrs import define, field
 from ..parsers import TextParser
 from ..io.reader import DocumentReader
@@ -35,10 +34,9 @@ class DocumentLayoutParser:
 
     Note
     ----
-    - The model is trained on the Publaynet dataset and can detect the following blocks from the document:
-    text, title, list, table, figure
+    - The model is trained on the Publaynet dataset and can detect the following blocks from the document: text, title, list, table, figure
 
-    - For more information on the datase please refer this paper: https://arxiv.org/abs/1908.07836
+    - For more information on the dataset please refer this paper: https://arxiv.org/abs/1908.07836
 
     """
     model_name: str = field(
@@ -109,7 +107,7 @@ def _update_blocks(self, blocks, tokens, meta_data=None):
             blocks_list.append(self._block_formatter(block, tokens, meta_data))
         return blocks_list
 
-    def parser(self, reader: DocumentReader, ocr: TextParser) -> List:
+    def parse(self, reader: DocumentReader, ocr: TextParser) -> List:
         """
         Predict the document type of the document in the reader.