Skip to content

Latest commit

 

History

History

intelligent-retrieval-augmented-generation-for-proxy-documents-proxybot

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

AWS Generative AI Financial Services - Intelligent RAG Chatbot 💰

Welcome to the Intelligent Retrieval-Augmented Generation (RAG) Chatbot - ProxyBot.

1.0 Key Features and Services

  1. Amazon Bedrock Model (Titan Embeddings and Claude)
  2. Amazon OpenSearch Serverless (AOSS)
  3. Amazon Textract
  4. An orchestrating workflow to get started for the AOSS with necessary permissions setting, extracted text files with summarization.
  5. Use of modern Generative AI tools like streamlit and langchain.

2.0 Architecture

2.1 Overall Workflow Architecture

Overall Workflow

3.0 Getting started

  1. Clone the repo
  2. Navigate to the working directory
cd aws-generative-ai-financial-services-examples/intelligent-retrieval-augmented-generation-for-proxy-documents-proxybot
  1. It is recommended to use a virtual environment (preferably a conda virtual environment, due to some packages dependencies).

  2. If you are on any of the following operating systems, do the following:

    As an alternative, you could run the provided installation.sh script for installation. This will only work for Cloud9 or EC2 or local linux machine. This will download and install the Anaconda Distribution framework. Remember to the following:

    chmod +x installation.sh
    bash installation.sh

Warning

If you would like to use Amazon Cloud9, please see section below for details and minimum requirements.

3.1 Managed Environment (conda preferred)

After the Anaconda Distribution installation is complete. Do the following:

  1. Create a conda working environment as shown below:
conda create --name advancedragenv python=3.10 -y
  1. Then activate the environment
conda activate advancedragenv

3.2 Package Installation

3.2.1 Using Poetry (recommended)

It is recommended to use poetry as this will help to manage cross-OS dependencies.

Caution

Make sure you have installed the Anaconda Distribution and have created and activated your working environment.

  1. Install the poetry package. See here for more details.
pip install poetry
  1. In the main working directory, install the required packages with their python packages and operating system dependencies. This uses the poetry.lock and pyproject.toml files (please do not edit this files manually).
cd aws-generative-ai-financial-services-examples/fs_04_intelligent_rag_chatbot
poetry install
  1. Poetry manages the environment internally, to leverage this, please activate the environment as follows:
poetry shell

When the environment is activated, `poetry will detect or activate its own virtual environment.

If you need to deactivate the enabled poetry shell, simply type exit.

3.2.2 Using Conda environment with requirements.txt

Please ensure you have the Anaconda Distribution installed by following your respective operating system steps. And have created and activated your working environment.

  1. Installing needed packages Use the requirements.txt for packages installation as shown below.
pip install -r requirements.txt

Warning

You can use the native Python environment venv but be mindful of the package dependencies.

If the above steps were successful, you should be able to continue the next steps.

4.0 Roles and Policies

Before using any of Amazon Web Services (AWS), please check the following:

  1. Ensure you have followed the steps under the Getting started with AWS CLI to make sure you can communicate with AWS and its services.
  2. For the purpose of running things smoothly, it is recommended you have your AWS credentials configured to a profile, if not the default profile will be used. You can also provide your AWS credentials on the terminals.
  3. Since we will be using Amazon Bedrock, please ensure you have model access for the required models - Amazon Titan Embeddings and Anthropic Claude 3.
  4. A config.ini file has been provided to help with certain dependent variables. For example, you will need unique s3 prefix. Please update this file accordingly for your use case execution.
  5. Creating Policies and Permissions: Please ensure you have all the necessary permissions to create the following AOSS related policies. If you do have permissions already, please set the run_permission to False in the config.ini. And ensure aoss_action is set to True to remove old AOSS with same names.
  • iam:CreatePolicy
  • iam:CreateServiceLinkedRole
  • aoss:CreateCollection
  • aoss:CreateSecurityPolicy
  • aoss:APIAccessAll
  • aoss:DashboardsAccessAll
  • s3:*
  • textract:*
  • aoss:*"
  • osis:Ingest"
  • secretsmanager:*
  • bedrock:*
  1. If you would like to create a username (not recommended), ensure to set your username, if not leave as empty. The username is used for the role.
  2. Update the config.ini with the task_prefix and update other entities as much as needed. The task_prefix is used to define the Amazon Opensearch Serverless policies.

A sample extract of the config.ini is shown below: Update the placeholder xxxxxxxxxxxx with yours.

Important

It is crucial that you maintain the same order for the ticker_list and the ticker_company_name as this must match the sequence for the data folder. For example, AAPL is in teh first position for the ticker_list, so Apple must be in the first position for ticker_company_name also.

All the needed files have been provided in the data folder in YYYY.pdf format. 2024.pdf under data/AAPL et cetera.

For a quicker run, you can use just one ticker, which means there must only be that ticker folder in the data folder. This has been done to maintain proper stucture for indexing and query search.

[ticker_list]
...
ticker_list = AAPL, AMZN, BR, GOOGL, MSFT
ticker_company_name = Apple, Amazon, Broadridge, Alphabet, Microsoft
start_year = 2021

[aws_params]
...
profile_name = 
region_name = us-east-1
runtime = True

[s3_buckets]
bucket_prefix = xxxxxxxxxxxx

[secret]
aoss_credentials_name = xxxxxxxxxxxx

[identity_access_role_policy]
username = 
role_name = GFSAOSSRoleBespoke
policy_name = GFSAOSSPolicyBespoke

[runtime_action]
run_permision = True
aoss_action = True

[opensearch]
task_prefix = xxxxxxxxxxxx

Note that the start_year must also cover the minimum date of the document.

5.0 Running the Application.

There are four main stages for running the app. Instead of running the first three stages individually, you can use the provided run_setup_processes.sh to make a single run for steps 1, 2 and 3. Use run_setup_processes.bat for Windows OS user.

Warning

If at any point there is any error, please check the log and fix the issues. You will have to do a clean up before restarting the script again. So run 04_run_cleanup.py and then run run_setup_processes.sh or run_setup_processes.bat again. Be mindful that there are intentional time delay in the code scripts to allow for cool offs.

**You do not need to run steps 1, 2 and 3 again if you have had a successful run of the provided script.

Step 1. General Settings and Permissions: This involves creating all the necessary settings, access and permissions. It also creates the s3 buckets and Opensearch Serverless credential. This stage is done in the file - 01_run_aoss.py. You only require to run this stage once.

Important

When you provide your temporary credentials on the terminal, be mindful of the expiration time and ensure you set AWS_DEFAULT_REGION=us-east-1 also.

Step 2. Running Textract: This stage is done by 02_run_textract.py. Smple data have been provided with respective sub-folders as company ticker. For this use case, we have added Apple (AAPL), Amazon (AMZN), Broadridge (BR), Alphabet (GOOGL) and Microsoft (MSFT). Each ticker folder has SEC forms with dates. This is done for demo purposes as this structure helps with interactions with other AWS services like s3 and Opensearch Serverless (AOSS). If you need to add more tickers, ensure to update the [ticker_list] accordingly and add the necessary files in the data folder. The SEC documents are first uploaded to s3 bucket suffixed -pdf-data. Your full s3 bucket should start with the specified bucket_prefix in the config.ini file.

Below is the data folder structure sample structure.

├─ data
   ├── AAPL/              
   ├── AMZN/
   ├── BR/              
   ├── GOOGL/
   ├── MSFT/

After the original documents are uploaded to the s3 bucket, the next step is the textract extraction. This is currently set to run and save extracted files located (you can modify this to write directly to s3).

Step 3. Data Ingestion to Opensearch Serverless (AOSS): This stage handles the ingestion of processed chaptered text files (stored locally for demo purposes), creation of vector embeddings and the indexing and storing of the vector embeddings into the earlier created AOSS vector index. This stage is executed by running the 03_run_ingestion.py. Allow some time for this ingestion to complete. You can check the created logs folder for the process logs.

Step 4. Running the Streamlit App: See following section.

5.1 Running code option

5.1.1 Locally

To run the code locally, run the app.py as follows:

streamlit run app.py

This should pop up on your browser, if not type the localhost link as shown below:

http://localhost:8501/

5.1.2 Using AWS Cloud9

To use Cloud9, please ensure the following:

  1. Follow the entire setup processes above.
  2. Ensure your Cloud9 has sufficient memory and disk storage, if not you won't be able to run the code as this requires sufficient memory and disk storage. This link can help. Be patience with this, it might take a few minutes. And reboot to have the changes effected.
  3. It is recommended to have a minimum of 40 GB disk space and minimum 16 GB RAM for the Cloud9.
  4. Run the conda_installation.sh script and restart the terminal to have conda active.
  5. With new Cloud9, temporary default credentials are usually set, but to be on a safer side, please provide your temporary AWS credentials. And make sure you set AWS_DEFAULT_REGION=us-east-1.
chmod +x conda_installation.sh
bash conda_installation.sh
  1. Then create and activate your conda environment. Check above for more details on the conda environment creation and activation.
  2. Follow all the other setup scripts above also to continue.

Warning

If you get an InvalidTokenId error and you are using AWS managed temporary credentials, your access might be restricted. For more information, see the temporary managed credentials url.

5.1.3 Container Execution (#TOBEFIXED)

It is recommended to run this app in a container.

  1. Build the conatiner image.
docker-compose build --no-cache
  1. Start the container image.
docker-compose up

The ProxyBot app server will start automatically. But note that you need to copy the link to your browser. If you are running docker locally, please use http://localhost:8501/, if on EC2, access the public URL it provided (http://xxx.xxx.xxx.xxx:8501) in a browser.

5.1.4 Running on an EC2 - (TOBEADDED, similar to Cloud9)

Also, you can run the entire code on an EC2. The instructions to get this demo running on an EC2 instance is similar to Cloud9's

You will be required to:

  1. Use the EC2 console to open up the security group's inbound rules to allow all IPv4 traffic to for the port Streamlit returned (8501)
  2. Access the public URL it provided (http://xxx.xxx.xxx.xxx:8501) in a browser.

Project Structure

├─ intelligent-retrieval-augmented-generation-for-proxy-documents-proxybot
   ├── aoss/              
   ├── assets/
   ├── awsmanager/   
   ├── data/   
   ├── data_pipeline/   
   ├── logics/
   ├── parameters/
   ├── textract/
   ├── utils/
   ├── .dockerignore
   ├── .env
   ├── .gitignore             
   ├── .gitkeep
   ├── 01_run_aoss.py
   ├── 02_run_textract.py
   ├── 03_run_ingestion.py     
   ├── 04_run_cleanup.py   
   ├── app.py
   ├── config.ini   
   ├── docker-compose.yml
   ├── Dockerfile
   ├── installation.sh   
   ├── poetry.lock
   ├── pyproject.toml
   ├── requirements-dev.txt
   ├── requirements.txt
   ├── run_reviews.py
   ├── run_setup_processes.sh       
   ├── run_setup_processes.bat   
   └── README.md   

Potential issues

  • Cloud9: You might experience issues withe memory and disk storage limitation with Cloud9, so please ensure you have the minimum requirements, if not you will likely get the following:
Unpacking payload ...
concurrent.futures.process._RemoteTraceback:                                                                                                                                                                                                                                                                                       
'''
Traceback (most recent call last):
  File "concurrent/futures/process.py", line 392, in wait_result_broken_or_wakeup
  File "multiprocessing/connection.py", line 251, in recv
TypeError: InvalidArchiveError.__init__() missing 1 required positional argument: 'msg'
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "entry_point.py", line 294, in <module>
  File "entry_point.py", line 286, in main
  File "entry_point.py", line 194, in _constructor_subcommand
  File "entry_point.py", line 159, in _constructor_extract_conda_pkgs
  File "concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
  File "concurrent/futures/_base.py", line 621, in result_iterator
  File "concurrent/futures/_base.py", line 319, in _result_or_cancel
  File "concurrent/futures/_base.py", line 458, in result
  File "concurrent/futures/_base.py", line 403, in __get_result
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
[5588] Failed to execute script 'entry_point' due to unhandled exception!
conda_script.sh: line 95: conda: command not found

Check your directory space to confirm storage size.

df -h /home/

Cleanup

To remove all resources created, run the 04_run_cleanup.py file as shown below:

python 04_run_cleanup.py

Authors

Contributors

contributors

Initial Security Reviews

  1. https://github.com/PyCQA/bandit
  2. https://bandit.readthedocs.io/en/latest/start.html