Skip to content

Commit fa2835b

Browse files
committed
Initial Version
1 parent be5239d commit fa2835b

35 files changed

+25849
-7
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.ipynb_checkpoints
2+
.DS_Store

01-idp-document-classification.ipynb

Lines changed: 864 additions & 0 deletions
Large diffs are not rendered by default.

02-idp-document-extraction.ipynb

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Document Extraction\n",
8+
"\n",
9+
"In this lab we will look at a method of how to extract table information out of the documents.\n"
10+
]
11+
},
12+
{
13+
"cell_type": "markdown",
14+
"metadata": {},
15+
"source": [
16+
"\n",
17+
"- [Step 1: Setup notebook](#step1)\n",
18+
"- [Step 2: Extract table from a sample doc using Amazon Textract](#step2)\n",
19+
"- [Step 3: Look at the other ways to extract structured and semi-structured data using Textract](#step3)\n",
20+
"\n",
21+
"---"
22+
]
23+
},
24+
{
25+
"cell_type": "markdown",
26+
"metadata": {},
27+
"source": [
28+
"# Step 1: Setup notebook <a id=\"step1\"></a>\n",
29+
"\n",
30+
"In this step, we will import some necessary libraries that will be used throughout this notebook. "
31+
]
32+
},
33+
{
34+
"cell_type": "code",
35+
"execution_count": null,
36+
"metadata": {},
37+
"outputs": [],
38+
"source": [
39+
"import boto3\n",
40+
"import botocore\n",
41+
"import sagemaker\n",
42+
"import pandas as pd\n",
43+
"import os\n",
44+
"import random\n",
45+
"from IPython.display import display\n",
46+
"from textractcaller.t_call import call_textract, Textract_Features\n",
47+
"from textractprettyprinter.t_pretty_print import convert_table_to_list\n",
48+
"from trp import Document\n",
49+
"\n",
50+
"# variables\n",
51+
"data_bucket = sagemaker.Session().default_bucket()\n",
52+
"region = boto3.session.Session().region_name\n",
53+
"account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
54+
"\n",
55+
"os.environ[\"BUCKET\"] = data_bucket\n",
56+
"os.environ[\"REGION\"] = region\n",
57+
"role = sagemaker.get_execution_role()\n",
58+
"\n",
59+
"print(f\"SageMaker role is: {role}\\nDefault SageMaker Bucket: s3://{data_bucket}\")\n",
60+
"\n",
61+
"s3=boto3.client('s3')\n",
62+
"textract = boto3.client('textract', region_name=region)\n",
63+
"comprehend=boto3.client('comprehend', region_name=region)\n",
64+
"\n",
65+
"%store -r document_classifier_arn\n",
66+
"print(f\"Amazon Comprehend Custom Classifier ARN: {document_classifier_arn}\")\n"
67+
]
68+
},
69+
{
70+
"cell_type": "markdown",
71+
"metadata": {},
72+
"source": [
73+
"---\n",
74+
"# Step 2: Extract table using Amazon Textract <a id=\"step2\"></a>\n",
75+
"\n",
76+
"In this step we will take a brief look at how to extract table information from one of the bank statements from our dataset. "
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": null,
82+
"metadata": {},
83+
"outputs": [],
84+
"source": [
85+
"prefix = 'idp/comprehend/classified-docs/bank-statements'\n",
86+
"start_after = 'idp/comprehend/classified-docs/bank-statements/'\n",
87+
"\n",
88+
"paginator = s3.get_paginator('list_objects_v2')\n",
89+
"operation_parameters = {'Bucket': data_bucket,\n",
90+
" 'Prefix': prefix,\n",
91+
" 'StartAfter':start_after}\n",
92+
"list_items=[]\n",
93+
"page_iterator = paginator.paginate(**operation_parameters)\n",
94+
"\n",
95+
"for page in page_iterator:\n",
96+
" if \"Contents\" in page:\n",
97+
" for item in page['Contents']:\n",
98+
" print(item['Key'])\n",
99+
" list_items.append(f's3://{data_bucket}/{item[\"Key\"]}')\n",
100+
" else:\n",
101+
" list_items.append('./samples/mixedbag/document_0.png')\n",
102+
"list_items"
103+
]
104+
},
105+
{
106+
"cell_type": "markdown",
107+
"metadata": {},
108+
"source": [
109+
"Let's select a random bank statement from the list"
110+
]
111+
},
112+
{
113+
"cell_type": "code",
114+
"execution_count": null,
115+
"metadata": {},
116+
"outputs": [],
117+
"source": [
118+
"file = random.sample(list_items, k=1)[0] #select a random bank statement document from the list\n",
119+
"file"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"metadata": {},
125+
"source": [
126+
"Our bank statements have two tables. We will see how to extract the tables using the Textract pretty printer tool."
127+
]
128+
},
129+
{
130+
"cell_type": "code",
131+
"execution_count": null,
132+
"metadata": {},
133+
"outputs": [],
134+
"source": [
135+
"resp = call_textract(input_document=file, features=[Textract_Features.TABLES])\n",
136+
"tdoc = Document(resp)\n",
137+
"dfs = list()\n",
138+
"\n",
139+
"for page in tdoc.pages:\n",
140+
" for table in page.tables:\n",
141+
" dfs.append(pd.DataFrame(convert_table_to_list(trp_table=table)))\n",
142+
"\n",
143+
"df1 = dfs[0]\n",
144+
"df2 = dfs[1]"
145+
]
146+
},
147+
{
148+
"cell_type": "code",
149+
"execution_count": null,
150+
"metadata": {},
151+
"outputs": [],
152+
"source": [
153+
"df1"
154+
]
155+
},
156+
{
157+
"cell_type": "code",
158+
"execution_count": null,
159+
"metadata": {},
160+
"outputs": [],
161+
"source": [
162+
"df2"
163+
]
164+
},
165+
{
166+
"cell_type": "markdown",
167+
"metadata": {},
168+
"source": [
169+
"---\n",
170+
"# Step 3: Extract structured and semi-structured data using Amazon Textract <a id=\"step3\"></a>\n",
171+
"\n",
172+
"Let's look at some of the other ways Amazon Textract can extract structured as well as semi-structured data from documents. We will pull in a notebook from the Amazon Textract [code sample repository](https://github.com/aws-samples/amazon-textract-code-samples/tree/master/python). \n",
173+
"\n",
174+
"Run the code cell below to pull the notebook. Once the notebook named `02-idp-document-extraction-01.ipynb` shows up, open the notebook and perform the following sections in the notebook. These sections will demonstrate how to extract form data and table data using Amazon textract. We will pull a single notebook and look at a few specific functionalities.\n",
175+
"\n",
176+
"- Section 8. Forms: Key/Values\n",
177+
"- Section 10. Tables\n",
178+
"- Section 12. Invoices and Receipts processing\n",
179+
"\n",
180+
"Run the code below and execute the above listed sections in the `02-idp-document-extraction-01.ipynb` file."
181+
]
182+
},
183+
{
184+
"cell_type": "code",
185+
"execution_count": null,
186+
"metadata": {},
187+
"outputs": [],
188+
"source": [
189+
"!wget 'https://github.com/aws-samples/amazon-textract-code-samples/raw/master/python/Textract.ipynb' -O './02-idp-document-extraction-01.ipynb'"
190+
]
191+
},
192+
{
193+
"cell_type": "markdown",
194+
"metadata": {},
195+
"source": [
196+
"\n",
197+
"You can further explore all Amazon Textract capabilities by cloning the entire code repository using the `git clone` command below.\n",
198+
"\n",
199+
"`git clone https://github.com/aws-samples/amazon-textract-code-samples`"
200+
]
201+
},
202+
{
203+
"cell_type": "markdown",
204+
"metadata": {},
205+
"source": [
206+
"---\n",
207+
"# Cleanup\n",
208+
"\n",
209+
"Cleanup is optional if you want to execute subsequent notebooks. \n",
210+
"\n",
211+
"Refer to the `05-idp-cleanup.ipynb` for cleanup and deletion of resources."
212+
]
213+
},
214+
{
215+
"cell_type": "markdown",
216+
"metadata": {},
217+
"source": [
218+
"---\n",
219+
"# Conclusion\n",
220+
"\n",
221+
"In this notebook we did a table extraction from a bank statement and further looked on a few additional ways Amazon Textract can help extract specific structured and semi-structured data such as forms data from our documents. In the next notebook we will extract entity information from our documents using Amazon Comprehend."
222+
]
223+
},
224+
{
225+
"cell_type": "code",
226+
"execution_count": null,
227+
"metadata": {},
228+
"outputs": [],
229+
"source": []
230+
}
231+
],
232+
"metadata": {
233+
"instance_type": "ml.t3.medium",
234+
"kernelspec": {
235+
"display_name": "Python 3 (Data Science)",
236+
"language": "python",
237+
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0"
238+
},
239+
"language_info": {
240+
"codemirror_mode": {
241+
"name": "ipython",
242+
"version": 3
243+
},
244+
"file_extension": ".py",
245+
"mimetype": "text/x-python",
246+
"name": "python",
247+
"nbconvert_exporter": "python",
248+
"pygments_lexer": "ipython3",
249+
"version": "3.7.10"
250+
}
251+
},
252+
"nbformat": 4,
253+
"nbformat_minor": 4
254+
}

0 commit comments

Comments
 (0)