This project implements a semantic similarity search system for transaction descriptions using transformer-based language models. It allows users to find transactions similar to a given input description based on semantic meaning rather than exact text matching.
- Overview
- Features
- Setup Instructions
- Usage
- Directory Structure
- Technologies Used
- Contributing
- License
The goal of this project is to provide a scalable and efficient method to find transactions with similar descriptions using transformer-based language models. It involves tokenizing transaction descriptions, computing embeddings using a pre-trained model, and measuring semantic similarity using cosine similarity.
- Pattern Matching: This involves extracting user's names from a description in the transactions table and matching it to the users information.
- Semantic Search: Find transactions similar to a given description based on semantic meaning.
- Efficient Embedding: Utilize transformer models for computing embeddings efficiently.
- Python 3.10+
- pip package manager
- Git
-
Clone the repository:
git clone <https://github.com/DrUkachi/transaction-handling-API/tree/main> cd transaction-semantic-similarity
-
Install dependencies:
pip install -r requirements.txt
-
Prepare Transaction Data:
Ensure your transaction data is available in a suitable format (CSV).
-
To start the Fast API application:
Run the
app.pyscript to start the apppython app.py
-
Use the Streamlit app to access the API:
Run the Streamlit app using the
main.pyscript to interact with the API.streamlit run app.py
You should see this:
-
You can also visit the local Swagger UI using the link: API documentation.
transaction-handling-api/
│
├── app.py # Main script for the api
├── README.md # Project overview and instructions and description of solution
├── requirements.txt # Python dependencies
└── transactions.csv # transaction data
|___ users.csv # users data (example)
- Transformers Library: Hugging Face Transformers for transformer-based models.
- Python Libraries: pandas, numpy, scikit-learn for data handling and computation.
- Git: Version control for project management and collaboration.
- FastAPI: For converting the model to an API Web service. See deployed link link
Contributions are welcome! Please fork the repository and create a pull request for any enhancements or bug fixes.
This project is licensed under the MIT License.
Endpoint Definition:
- The endpoint
/match-users/is defined as a POST method that expects atransaction_idas input. - It returns a JSON response with matched users sorted by relevance and includes the total number of matches.
-
Dependencies:
- The endpoint uses a dependency -
(get_loaded_data)to fetch data (trans_dataandusers_data) needed for processing. This assumes thatget_loaded_datais a function returning loaded data, likely from some external source such as a database or file.
- The endpoint uses a dependency -
-
Processing Logic:
- It first checks if the transaction with the provided
transaction_idexists intrans_data. If not found, it raises a404HTTPException. - Extracts the transaction description and attempts to identify the user's name using several methods (
extract_name,extract_name_spacy,extract_name_distilbert). These methods likely employ different techniques (like regex, NLP libraries) to extract names from the transaction description. - If a user name is successfully extracted, it proceeds to match this name against
users_datausing fuzzy matching (fuzz.ratiofromfuzzywuzzylibrary) to calculate a similarity score (match_metric). - Matches are sorted in descending order of
match_metricto prioritize more relevant matches.
- It first checks if the transaction with the provided
-
Response:
- The response JSON includes:
"users": A list of dictionaries containingidandmatch_metricfor each matched user, sorted by relevance."total_number_of_matches": An integer indicating the total number of users matched.
- The response JSON includes:
-
Error Handling:
- Handles exceptions and returns appropriate HTTP status codes (
404for not found,500for internal server errors) with detailed error messages.
- Handles exceptions and returns appropriate HTTP status codes (
-
Language Limitation:
- The solution assumes that all names are in Latin alphabet characters. It does not handle names in other languages or character sets, which could lead to mismatches if non-Latin characters are present.
-
Name Extraction Issues:
- It might struggle with transaction descriptions that have unconventional formatting or where names are not clearly separated. For example, names written without spaces (e.g., "JohnSmith") could be challenging to extract accurately.
-
Performance Considerations:
- Depending on the size of
users_data, iterating through each user could become inefficient ifusers_datais very large. Optimization techniques like indexing or filtering based on initial criteria could improve performance.
- Depending on the size of
-
Multilingual Support: Implement other methods to handle names in different languages using appropriate NLP libraries or techniques for name recognition and fuzzy matching, by training small sized and effecient LLMs for this particular task.
-
Enhanced Name Extraction: Develop more robust techniques to handle diverse formats of transaction descriptions, such as names without spaces or with punctuation.
The provided solution extends the FastAPI application with a new endpoint /similar-transactions/ that uses language model embeddings to find transactions with descriptions similar to a given input string. Here’s a breakdown:
-
Endpoint Definition:
- The endpoint
/similar-transactions/is defined as a POST method that expects aninput_stringas input. - It returns a JSON response with transactions sorted by relevance based on their similarity to the
input_string.
- The endpoint
-
Dependencies:
- It uses the following dependencies
(get_loaded_data)to fetch transaction data (trans_data) needed for processing. The second element of the tuple (_) is unused here as it needs only the transaction data.
- It uses the following dependencies
-
Processing Logic:
-
Tokenization and Embedding Creation:
- Converts the
input_stringinto tokens using a tokenizer (tokenizer), which is likely based on a pre-trained language model (e.g., BERT, RoBERTa). The tokenizer returns token IDs suitable for the model and is wrapped into a PyTorch tensor. - Utilizes a pre-trained language model (
model) to obtain embeddings for theinput_string. It computes the mean of the last hidden state embeddings (input_output.last_hidden_state.mean(dim=1)) to get a fixed-size representation of the input.
- Converts the
-
Compute Similarity:
- Tokenizes and embeds all transaction descriptions (
trans_data['description']). This involves padding and truncation to ensure consistent input lengths for efficient batch processing using PyTorch tensors. - Computes cosine similarity between the embedding of
input_stringand each transaction description embedding (cosine_similarity). - Assigns similarity scores to each transaction and adds them as a new column (
similarity_score) totrans_data.
- Tokenizes and embeds all transaction descriptions (
-
Sorting and Response:
- Sorts
trans_databased onsimilarity_scorein descending order to prioritize transactions most similar toinput_string. - Constructs the response JSON:
"transactions": A list of dictionaries containingidanddescriptionof transactions sorted by similarity."total_number_of_tokens_used": Number of tokens used in the input string's tokenizer representation (input_tokens['input_ids'][0]).
- Sorts
-
-
Error Handling:
- Catches exceptions and returns a detailed error message with a
500HTTP status code if any error occurs during processing.
- Catches exceptions and returns a detailed error message with a
-
Contextual Understanding:
- Semantic Meaning: Although the solution uses language model embeddings, it may not capture nuanced semantic similarities effectively. Language models like BERT or RoBERTa encode text based on context, but they might not perfectly capture the intended semantic similarity for all use cases.
-
Performance Considerations:
- Computational Cost: Calculating embeddings and similarity scores for potentially large datasets (
trans_data) can be computationally intensive, especially in real-time applications or with large input strings.
- Computational Cost: Calculating embeddings and similarity scores for potentially large datasets (
-
Model and Tokenizer Dependency:
- The solution assumes availability and compatibility with a specific tokenizer and language model (
model). Changes in these components (like switching to a different model or tokenizer) could require substantial modifications.
- The solution assumes availability and compatibility with a specific tokenizer and language model (
-
Handling Long Texts:
- The approach may struggle with very long input strings or descriptions, as transformers like BERT have a maximum token limit. Long texts may need to be segmented or truncated, potentially affecting similarity accuracy.
-
Fine-tuning or Transfer Learning: Consider fine-tuning a pre-trained language model on transaction-specific data to better capture domain-specific semantics.
-
Advanced Similarity Metrics: Explore other similarity metrics beyond cosine similarity (e.g., BERTScore) that might better align with semantic similarity tasks.
-
Efficiency Enhancements: Implement batching or parallelization strategies to optimize computation, especially for large datasets.
-
Handling Long Texts: Develop strategies to handle long texts more effectively, such as hierarchical or attention-based approaches.
By addressing these limitations and considering improvements, the solution can be made more robust and suitable for a wider range of real-world applications where semantic similarity in transaction descriptions is crucial.
Taking a proof of concept (PoC) to production involves several key steps and considerations to ensure reliability, scalability, and maintainability. Below are some of my suggestions and improvements that can be made to transition from a PoC to a production-ready state for the Transaction Handling application, built with FastAPI.
-
Containerization: Use Docker to containerize your FastAPI application along with its dependencies. This ensures consistent deployment across different environments.
-
Orchestration: Deploy containers using Kubernetes or Docker Compose for managing scalability, load balancing, and fault tolerance.
-
Deployment Automation: Implement CI/CD pipelines (e.g., using Jenkins, GitLab CI/CD, or GitHub Actions) for automated testing, building, and deploying updates to production.
-
Cloud Provider: Choose a cloud provider (AWS, Azure, Google Cloud) based on your specific needs for scalability, region availability, compliance, and cost.
-
Database Management: Use managed database services (e.g., AWS RDS, Azure SQL Database) for data storage, ensuring backups, replication, and scalability are managed effectively.
-
Monitoring and Logging: Implement monitoring tools (e.g., Prometheus, Grafana) and centralized logging (e.g., ELK stack, Fluentd) to track application performance, errors, and usage metrics.
-
Authentication and Authorization: Implement OAuth2 authentication for securing API endpoints. Use JWT (JSON Web Tokens) for secure token-based authentication.
-
Data Encryption: Ensure data at rest and in transit is encrypted using SSL/TLS. Utilize encryption libraries (e.g., Python's
cryptographymodule) for sensitive data handling. -
API Rate Limiting: Implement rate limiting to prevent abuse and ensure fair usage of the API.
-
Caching: Use caching mechanisms (e.g., Redis) for frequently accessed data to improve response times and reduce load on the database.
-
Query Optimization: Optimize database queries using indexing, query optimization techniques, and database performance tuning.
-
Asynchronous Tasks: Use asynchronous programming (e.g., FastAPI's
BackgroundTasks, Celery) for long-running tasks to improve API responsiveness.
-
Unit Testing: Write comprehensive unit tests using tools like pytest to validate individual components and functions.
-
Integration Testing: Perform integration tests to ensure different parts of the application work together correctly.
-
Load Testing: Conduct load testing (e.g., using tools like Apache JMeter, Locust) to simulate real-world usage and identify performance bottlenecks.
-
API Documentation: Provide clear, comprehensive API documentation using tools like Swagger UI or FastAPI's built-in documentation generator.
-
Operational Documentation: Document deployment procedures, environment variables, configuration settings, and troubleshooting guides.
-
Health Checks: Implement health checks for the application and its dependencies to monitor availability and performance.
-
Version Control: Use version control (e.g., Git) to manage model changes, code changes, branching strategies, and release process or management.
-
Incident Response: Establish procedures for incident response and monitoring alerts to promptly address and mitigate issues.
