aws-samples
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎01-idp-document-classification.ipynb‎
Lines changed: 864 additions & 0 deletions b/‎01-idp-document-classification.ipynb‎
Lines changed: 864 additions & 0 deletions
diff --git a/‎02-idp-document-extraction.ipynb‎
Lines changed: 254 additions & 0 deletions b/‎02-idp-document-extraction.ipynb‎
Lines changed: 254 additions & 0 deletions
@@ -0,0 +1,2 @@
+.ipynb_checkpoints
+.DS_Store
@@ -0,0 +1,254 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Document Extraction\n",
+    "\n",
+    "In this lab we will look at a method of how to extract table information out of the documents.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "- [Step 1: Setup notebook](#step1)\n",
+    "- [Step 2: Extract table from a sample doc using Amazon Textract](#step2)\n",
+    "- [Step 3: Look at the other ways to extract structured and semi-structured data using Textract](#step3)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Step 1: Setup notebook <a id=\"step1\"></a>\n",
+    "\n",
+    "In this step, we will import some necessary libraries that will be used throughout this notebook. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import boto3\n",
+    "import botocore\n",
+    "import sagemaker\n",
+    "import pandas as pd\n",
+    "import os\n",
+    "import random\n",
+    "from IPython.display import display\n",
+    "from textractcaller.t_call import call_textract, Textract_Features\n",
+    "from textractprettyprinter.t_pretty_print import convert_table_to_list\n",
+    "from trp import Document\n",
+    "\n",
+    "# variables\n",
+    "data_bucket = sagemaker.Session().default_bucket()\n",
+    "region = boto3.session.Session().region_name\n",
+    "account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
+    "\n",
+    "os.environ[\"BUCKET\"] = data_bucket\n",
+    "os.environ[\"REGION\"] = region\n",
+    "role = sagemaker.get_execution_role()\n",
+    "\n",
+    "print(f\"SageMaker role is: {role}\\nDefault SageMaker Bucket: s3://{data_bucket}\")\n",
+    "\n",
+    "s3=boto3.client('s3')\n",
+    "textract = boto3.client('textract', region_name=region)\n",
+    "comprehend=boto3.client('comprehend', region_name=region)\n",
+    "\n",
+    "%store -r document_classifier_arn\n",
+    "print(f\"Amazon Comprehend Custom Classifier ARN: {document_classifier_arn}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "# Step 2: Extract table using Amazon Textract <a id=\"step2\"></a>\n",
+    "\n",
+    "In this step we will take a brief look at how to extract table information from one of the bank statements from our dataset. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prefix = 'idp/comprehend/classified-docs/bank-statements'\n",
+    "start_after = 'idp/comprehend/classified-docs/bank-statements/'\n",
+    "\n",
+    "paginator = s3.get_paginator('list_objects_v2')\n",
+    "operation_parameters = {'Bucket': data_bucket,\n",
+    "                        'Prefix': prefix,\n",
+    "                        'StartAfter':start_after}\n",
+    "list_items=[]\n",
+    "page_iterator = paginator.paginate(**operation_parameters)\n",
+    "\n",
+    "for page in page_iterator:\n",
+    "    if \"Contents\" in page:\n",
+    "        for item in page['Contents']:\n",
+    "            print(item['Key'])\n",
+    "            list_items.append(f's3://{data_bucket}/{item[\"Key\"]}')\n",
+    "    else:\n",
+    "        list_items.append('./samples/mixedbag/document_0.png')\n",
+    "list_items"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's select a random bank statement from the list"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "file = random.sample(list_items, k=1)[0] #select a random bank statement document from the list\n",
+    "file"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Our bank statements have two tables. We will see how to extract the tables using the Textract pretty printer tool."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "resp = call_textract(input_document=file, features=[Textract_Features.TABLES])\n",
+    "tdoc = Document(resp)\n",
+    "dfs = list()\n",
+    "\n",
+    "for page in tdoc.pages:\n",
+    "    for table in page.tables:\n",
+    "        dfs.append(pd.DataFrame(convert_table_to_list(trp_table=table)))\n",
+    "\n",
+    "df1 = dfs[0]\n",
+    "df2 = dfs[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "# Step 3: Extract structured and semi-structured data using Amazon Textract <a id=\"step3\"></a>\n",
+    "\n",
+    "Let's look at some of the other ways Amazon Textract can extract structured as well as semi-structured data from documents. We will pull in a notebook from the Amazon Textract [code sample repository](https://github.com/aws-samples/amazon-textract-code-samples/tree/master/python). \n",
+    "\n",
+    "Run the code cell below to pull the notebook. Once the notebook named `02-idp-document-extraction-01.ipynb` shows up, open the notebook and perform the following sections in the notebook. These sections will demonstrate how to extract form data and table data using Amazon textract. We will pull a single notebook and look at a few specific functionalities.\n",
+    "\n",
+    "- Section 8. Forms: Key/Values\n",
+    "- Section 10. Tables\n",
+    "- Section 12. Invoices and Receipts processing\n",
+    "\n",
+    "Run the code below and execute the above listed sections in the `02-idp-document-extraction-01.ipynb` file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!wget 'https://github.com/aws-samples/amazon-textract-code-samples/raw/master/python/Textract.ipynb' -O './02-idp-document-extraction-01.ipynb'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "You can further explore all Amazon Textract capabilities by cloning the entire code repository using the `git clone` command below.\n",
+    "\n",
+    "`git clone https://github.com/aws-samples/amazon-textract-code-samples`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "# Cleanup\n",
+    "\n",
+    "Cleanup is optional if you want to execute subsequent notebooks. \n",
+    "\n",
+    "Refer to the `05-idp-cleanup.ipynb` for cleanup and deletion of resources."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "# Conclusion\n",
+    "\n",
+    "In this notebook we did a table extraction from a bank statement and further looked on a few additional ways Amazon Textract can help extract specific structured and semi-structured data such as forms data from our documents. In the next notebook we will extract entity information from our documents using Amazon Comprehend."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "instance_type": "ml.t3.medium",
+  "kernelspec": {
+   "display_name": "Python 3 (Data Science)",
+   "language": "python",
+   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}