Welcome to the Intelligent Retrieval-Augmented Generation (RAG) Chatbot - ProxyBot
.
- Amazon Bedrock Model (Titan Embeddings and Claude)
- Amazon OpenSearch Serverless (AOSS)
- Amazon Textract
- An orchestrating workflow to get started for the
AOSS
with necessary permissions setting, extracted text files with summarization. - Use of modern Generative AI tools like
streamlit
andlangchain
.
- Clone the repo
- Navigate to the working directory
cd aws-generative-ai-financial-services-examples/intelligent-retrieval-augmented-generation-for-proxy-documents-proxybot
-
It is recommended to use a virtual environment (preferably a
conda
virtual environment, due to some packages dependencies). -
If you are on any of the following operating systems, do the following:
- Windows: Download and install the Anaconda Distribution for Windows.
- Mac: Follow the instruction for MacOS here
- Linux: Please the installation link for Linux here. Please be mindful of the flavour of Linux distribution you have.
As an alternative, you could run the provided
installation.sh
script for installation. This will only work forCloud9
orEC2
orlocal
linux machine. This will download and install the Anaconda Distribution framework. Remember to the following:chmod +x installation.sh bash installation.sh
Warning
If you would like to use Amazon Cloud9, please see section below for details and minimum requirements.
After the Anaconda Distribution installation is complete. Do the following:
- Create a conda working environment as shown below:
conda create --name advancedragenv python=3.10 -y
- Then activate the environment
conda activate advancedragenv
It is recommended to use poetry
as this will help to manage cross-OS dependencies.
Caution
Make sure you have installed the Anaconda Distribution and have created and activated your working environment.
- Install the
poetry
package. See here for more details.
pip install poetry
- In the main working directory, install the required packages with their python packages and operating system dependencies. This uses the
poetry.lock
andpyproject.toml
files (please do not edit this files manually).
cd aws-generative-ai-financial-services-examples/fs_04_intelligent_rag_chatbot
poetry install
- Poetry manages the environment internally, to leverage this, please activate the environment as follows:
poetry shell
When the environment is activated, `poetry will detect or activate its own virtual environment.
If you need to deactivate the enabled poetry shell, simply type exit
.
Please ensure you have the Anaconda Distribution installed by following your respective operating system steps. And have created and activated your working environment.
- Installing needed packages
Use the
requirements.txt
for packages installation as shown below.
pip install -r requirements.txt
Warning
You can use the native Python environment venv but be mindful of the package dependencies.
If the above steps were successful, you should be able to continue the next steps.
Before using any of Amazon Web Services (AWS), please check the following:
- Ensure you have followed the steps under the Getting started with AWS CLI to make sure you can communicate with AWS and its services.
- For the purpose of running things smoothly, it is recommended you have your AWS credentials configured to a profile, if not the
default
profile will be used. You can also provide your AWS credentials on the terminals. - Since we will be using Amazon Bedrock, please ensure you have model access for the required models - Amazon Titan Embeddings and Anthropic Claude 3.
- A
config.ini
file has been provided to help with certain dependent variables. For example, you will need uniques3
prefix. Please update this file accordingly for your use case execution. - Creating Policies and Permissions: Please ensure you have all the necessary permissions to create the following AOSS related policies. If you do have permissions already, please set the
run_permission
toFalse
in theconfig.ini
. And ensureaoss_action
is set toTrue
to remove old AOSS with same names.
iam:CreatePolicy
iam:CreateServiceLinkedRole
aoss:CreateCollection
aoss:CreateSecurityPolicy
aoss:APIAccessAll
aoss:DashboardsAccessAll
s3:*
textract:*
aoss:*"
osis:Ingest"
secretsmanager:*
bedrock:*
If you would like to create a username (not recommended), ensure to set yourusername
, if not leave as empty. Theusername
is used for the role.- Update the
config.ini
with thetask_prefix
and update other entities as much as needed. Thetask_prefix
is used to define the Amazon Opensearch Serverless policies.
A sample extract of the config.ini
is shown below: Update the placeholder xxxxxxxxxxxx
with yours.
Important
It is crucial that you maintain the same order for the ticker_list
and the ticker_company_name
as this must match the sequence for the data
folder. For example, AAPL
is in teh first position for the ticker_list
, so Apple
must be in the first position for ticker_company_name
also.
All the needed files have been provided in the data folder in YYYY.pdf
format. 2024.pdf
under data/AAPL
et cetera.
For a quicker run, you can use just one ticker, which means there must only be that ticker
folder in the data
folder. This has been done to maintain proper stucture for indexing and query search.
[ticker_list]
...
ticker_list = AAPL, AMZN, BR, GOOGL, MSFT
ticker_company_name = Apple, Amazon, Broadridge, Alphabet, Microsoft
start_year = 2021
[aws_params]
...
profile_name =
region_name = us-east-1
runtime = True
[s3_buckets]
bucket_prefix = xxxxxxxxxxxx
[secret]
aoss_credentials_name = xxxxxxxxxxxx
[identity_access_role_policy]
username =
role_name = GFSAOSSRoleBespoke
policy_name = GFSAOSSPolicyBespoke
[runtime_action]
run_permision = True
aoss_action = True
[opensearch]
task_prefix = xxxxxxxxxxxx
Note that the start_year
must also cover the minimum date of the document.
There are four main stages for running the app. Instead of running the first three stages individually, you can use the provided run_setup_processes.sh
to make a single run for steps 1, 2 and 3. Use run_setup_processes.bat
for Windows OS user.
Warning
If at any point there is any error, please check the log and fix the issues. You will have to do a clean up before restarting the script again. So run 04_run_cleanup.py
and then run run_setup_processes.sh
or run_setup_processes.bat
again. Be mindful that there are intentional time delay in the code scripts to allow for cool offs.
**You do not need to run steps 1, 2 and 3 again if you have had a successful run of the provided script.
Step 1. General Settings and Permissions: This involves creating all the necessary settings, access and permissions. It also creates the s3
buckets and Opensearch Serverless credential. This stage is done in the file - 01_run_aoss.py
. You only require to run this stage once.
Important
When you provide your temporary credentials on the terminal, be mindful of the expiration time and ensure you set AWS_DEFAULT_REGION=us-east-1 also.
Step 2. Running Textract: This stage is done by 02_run_textract.py
. Smple data have been provided with respective sub-folders as company ticker
. For this use case, we have added Apple (AAPL
), Amazon (AMZN
), Broadridge (BR
), Alphabet (GOOGL
) and Microsoft (MSFT
). Each ticker folder has SEC forms with dates. This is done for demo purposes as this structure helps with interactions with other AWS services like s3
and Opensearch Serverless (AOSS
). If you need to add more tickers, ensure to update the [ticker_list]
accordingly and add the necessary files in the data folder.
The SEC documents are first uploaded to s3
bucket suffixed -pdf-data
. Your full s3
bucket should start with the specified bucket_prefix
in the config.ini
file.
Below is the data folder structure sample structure.
├─ data
├── AAPL/
├── AMZN/
├── BR/
├── GOOGL/
├── MSFT/
After the original documents are uploaded to the s3
bucket, the next step is the textract extraction. This is currently set to run and save extracted files located (you can modify this to write directly to s3
).
Step 3. Data Ingestion to Opensearch Serverless (AOSS): This stage handles the ingestion of processed chaptered text files (stored locally for demo purposes), creation of vector embeddings and the indexing and storing of the vector embeddings into the earlier created AOSS vector index. This stage is executed by running the 03_run_ingestion.py
. Allow some time for this ingestion to complete. You can check the created logs
folder for the process logs.
Step 4. Running the Streamlit App: See following section.
To run the code locally, run the app.py
as follows:
streamlit run app.py
This should pop up on your browser, if not type the localhost link as shown below:
http://localhost:8501/
To use Cloud9, please ensure the following:
- Follow the entire setup processes above.
- Ensure your Cloud9 has sufficient memory and disk storage, if not you won't be able to run the code as this requires sufficient memory and disk storage. This link can help. Be patience with this, it might take a few minutes. And reboot to have the changes effected.
- It is recommended to have a minimum of
40 GB
disk space and minimum16 GB
RAM for the Cloud9. - Run the
conda_installation.sh
script and restart the terminal to haveconda
active. - With new Cloud9, temporary default credentials are usually set, but to be on a safer side, please provide your temporary AWS credentials. And make sure you set
AWS_DEFAULT_REGION=us-east-1
.
chmod +x conda_installation.sh
bash conda_installation.sh
- Then create and activate your
conda
environment. Check above for more details on theconda
environment creation and activation. - Follow all the other setup scripts above also to continue.
Warning
If you get an InvalidTokenId
error and you are using AWS managed temporary credentials, your access might be restricted. For more information, see the temporary managed credentials url.
It is recommended to run this app in a container.
- Build the conatiner image.
docker-compose build --no-cache
- Start the container image.
docker-compose up
The ProxyBot app server will start automatically. But note that you need to copy the link to your browser. If you are running docker locally, please use http://localhost:8501/
, if on EC2, access the public URL it provided (http://xxx.xxx.xxx.xxx:8501) in a browser.
Also, you can run the entire code on an EC2. The instructions to get this demo running on an EC2 instance is similar to Cloud9's
You will be required to:
- Use the EC2 console to open up the security group's inbound rules to allow all IPv4 traffic to for the port Streamlit returned (8501)
- Access the public URL it provided (http://xxx.xxx.xxx.xxx:8501) in a browser.
├─ intelligent-retrieval-augmented-generation-for-proxy-documents-proxybot
├── aoss/
├── assets/
├── awsmanager/
├── data/
├── data_pipeline/
├── logics/
├── parameters/
├── textract/
├── utils/
├── .dockerignore
├── .env
├── .gitignore
├── .gitkeep
├── 01_run_aoss.py
├── 02_run_textract.py
├── 03_run_ingestion.py
├── 04_run_cleanup.py
├── app.py
├── config.ini
├── docker-compose.yml
├── Dockerfile
├── installation.sh
├── poetry.lock
├── pyproject.toml
├── requirements-dev.txt
├── requirements.txt
├── run_reviews.py
├── run_setup_processes.sh
├── run_setup_processes.bat
└── README.md
- Cloud9: You might experience issues withe memory and disk storage limitation with Cloud9, so please ensure you have the minimum requirements, if not you will likely get the following:
Unpacking payload ...
concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
File "concurrent/futures/process.py", line 392, in wait_result_broken_or_wakeup
File "multiprocessing/connection.py", line 251, in recv
TypeError: InvalidArchiveError.__init__() missing 1 required positional argument: 'msg'
'''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "entry_point.py", line 294, in <module>
File "entry_point.py", line 286, in main
File "entry_point.py", line 194, in _constructor_subcommand
File "entry_point.py", line 159, in _constructor_extract_conda_pkgs
File "concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
File "concurrent/futures/_base.py", line 621, in result_iterator
File "concurrent/futures/_base.py", line 319, in _result_or_cancel
File "concurrent/futures/_base.py", line 458, in result
File "concurrent/futures/_base.py", line 403, in __get_result
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
[5588] Failed to execute script 'entry_point' due to unhandled exception!
conda_script.sh: line 95: conda: command not found
Check your directory space to confirm storage size.
df -h /home/
To remove all resources created, run the 04_run_cleanup.py
file as shown below:
python 04_run_cleanup.py