DiveHQ · iamparas0 · Jun 19, 2023 · Jun 19, 2023 · Jun 19, 2023
diff --git a/APPROACH.md b/APPROACH.md
@@ -0,0 +1,84 @@
+Documentation: Generating Meeting Transcript Data for Model Development
+Introduction
+This documentation outlines the process followed to generate additional meeting transcript data for the purpose of model development. The goal is to expand the existing dataset and provide a more comprehensive training sample.
+
+Steps for Generating Meeting Transcript Data
+1. Identifying the Need: Recognize the requirement to expand the dataset and generate more meeting transcript data.
+
+2. Data Format: Use the CSV format with columns: start_time, end_time, speaker, and text.
+
+3. Dialogue Generation: Create artificial dialogue lines for the meeting transcript using short forms.
+
+Example:
+
+"00:00:00","00:00:15","Alice","Good afternoon, everyone!"
+"00:00:15","00:00:40","Bob","Good afternoon, Alice. How was your weekend?"
+"00:00:40","00:01:00","Alice","It was great, Bob. How about yours?"
+4. Time Stamps: Assign hypothetical time stamps in "HH:MM:SS" format to each dialogue line.
+
+Example:
+
+"00:00:00","00:00:15","Alice","Good afternoon, everyone!"
+5. Speaker Attribution: Use short forms for speaker names, such as Alice (A), Bob (B), and Carol (C), to differentiate between speakers.
+
+Example:
+
+"00:00:15","00:00:40","B","Good afternoon, Alice. How was your weekend?"
+6. Text Crafting: Generate text for each speaker to simulate a meeting discussion using short forms.
+
+Example:
+
+"00:00:15","00:00:40","B","Good afternoon, Alice. How was your weekend?"
+7. Relevance and Coherence: Ensure coherence and relevance in the generated dialogue lines using short forms.
+
+Example:
+
+"00:00:40","00:01:00","A","It was great, B. How about yours?"
+Limitations and Considerations
+1. Artificial Nature of the Data: Generated meeting transcript data is synthetic and doesn't represent real meetings.
+
+2. Limited Variability: The provided data sample is relatively small and may not cover the full range of real meeting conversations.
+
+3. Contextual Considerations: The data lacks specific meeting topics, agendas, or participant roles.
+
+4. Training Data Quality: Manual verification/validation of the generated text was not performed.
+
+##what did you stop considering and why ..?
+
+1. Computationally expensive: Using an NLP library or a machine learning model to extract action items from a meeting transcript would be too computationally expensive for a large meeting transcript.
+
+2. Requires large amount of training data: Using a machine learning model to extract action items from a meeting transcript would require a large amount of training data, which I did not have access to.
+
+Instead, I decided to use a simple regular expression to extract action items from a meeting transcript. This approach is not as sophisticated as using an NLP library or a machine learning model, but it is much simpler and more efficient.
+
+##what did you pick first and why 
+because I am familiar with the process..!
+
+## LIMITATIONS
+
+here current approach in the above code has a few limitations:
+
+1. Language Dependency: The code is trained on the BERT model, which is primarily trained on English language data. It may not perform optimally on transcripts or texts in other languages.
+
+2. Contextual Understanding: The code treats each sentence or phrase independently for classification. It does not take into account the context of the entire conversation or the relationships between different statements. This may lead to misclassifications or missing action items that require understanding the conversation as a whole.
+
+3. Action Item Detection: The code relies on a text classification model to identify action items. While it can provide reasonable predictions, it may not be perfect in distinguishing between action items and regular conversation. There is a possibility of false positives or false negatives in the extracted action items.
+
+4. Fine-tuning and Customization: The code uses a pre-trained BERT model that is not fine-tuned specifically for action item extraction from meeting transcripts. Fine-tuning the model on a specific domain or dataset may yield better results. Additionally, the code may require customization to handle specific nuances or variations in meeting transcripts.
+
+5. Performance and Scalability: The code processes the meeting transcripts sequentially and may not be optimized for large-scale or real-time processing of extensive transcripts. For scenarios involving a substantial amount of data or frequent updates, the code may need to be optimized for better performance and scalability.
+
+6. False Positives and Assignee Identification: The code assumes that action items are correctly identified based on the text classification model. However, there is a possibility of false positives, where non-action items are classified as action items. Additionally, the code assigns action items to assignees based on simple heuristics or patterns in the text. It may not accurately determine the intended assignee, especially in complex or ambiguous situations.
+
+-In order to train and test machine learning models that can be used to analyze meeting transcripts, it is necessary to gather a large dataset of meeting transcripts. This data can be gathered from a variety of sources, including:
+
+-Publicly available meeting transcripts: There are a number of websites that make publicly available meeting transcripts available for download. These transcripts can be used to train and test machine learning models without the need to obtain consent from the participants in the meetings.
+
+-Internal meeting transcripts: Many organizations record and transcribe their meetings. These transcripts can be used to train and test machine learning models, but it is important to obtain consent from the participants in the meetings before using their data.
+
+-Transcripts created by researchers: Researchers have created a number of datasets of meeting transcripts that can be used to train and test machine learning models. These datasets are often more difficult to obtain than publicly available transcripts, but they can provide a more diverse and representative dataset of meeting transcripts.
+Once a dataset of meeting transcripts has been gathered, it is important to clean and prepare the data for machine learning. This process may involve:
+
+Removing identifying information: It is important to remove any identifying information from the transcripts, such as the names of the participants, the organizations they work for, and the dates and times of the meetings.
+Standardizing the format of the transcripts: The format of the transcripts may vary from meeting to meeting. It is important to standardize the format of the transcripts so that they can be easily processed by machine learning algorithms.
+Labeling the transcripts: If the goal is to train a model to identify the speaker, the topic of discussion, or the sentiment of the conversation, it is necessary to label the transcripts with this information. This can be done manually or automatically using natural language processing techniques.
diff --git a/SUBMISSION.md b/SUBMISSION.md
@@ -0,0 +1,117 @@
+Certainly! Here's a template for proper documentation for the project:
+
+# ML MODEL
+
+
+## Video Of Running Model
+https://www.loom.com/share/f6cb0537486a4e6bb703ef42fb1061a1?sid=a7b98369-0d14-45fb-ade5-11c4f055702b
+
+
+## Description
+The project aims to extract action items from meeting transcripts using a pre-trained BERT model for text classification. It identifies sentences or phrases that represent tasks to be done and assigns them to the appropriate assignee. The output is a list of action items with their corresponding assignees.
+
+## Prerequisites
+- Python (version 3.11.1 or higher)
+- Libraries:
+  - torch 
+  - transformers
+
+## Installation
+1. Install Python:
+   - Download and install Python from the official Python website (python.org).
+
+2. Install the required libraries:
+   - Open a terminal or command prompt.
+   - Run the following commands:
+     ```
+     pip install torch
+     pip install transformers
+```
+## To run the code in a Python virtual environment, you can follow these steps:
+
+1. Set up a virtual environment: Open a terminal or command prompt and navigate to your project directory. Create a new virtual environment by running the following command:
+   ```
+   python -m venv myenv
+   ```
+
+2. Activate the virtual environment: Activate the virtual environment by running the appropriate command based on your operating system:
+   - For Windows:
+     ```
+     myenv\Scripts\activate
+     ```
+   - For macOS/Linux:
+     ```
+     source myenv/bin/activate
+     ```
+
+3. Install the necessary libraries: Make sure you have the required libraries installed. In this case, you need to have Python installed along with the `pandas` library for data manipulation and the `torch` and `transformers` libraries for BERT model usage. You can install them using `pip` by running the following command:
+   ```
+   pip install pandas torch transformers
+   ```
+
+4. Set up the code: Copy the provided code into a Python file, such as `meeting_transcript.py`, using a text editor or an integrated development environment (IDE) like Visual Studio Code or PyCharm.
+
+   ```
+   python meeting_transcript.py
+   ```
+
+7. View the output: The code will process the meeting transcripts and extract the action items. The extracted action items will be printed on the console or terminal as dictionaries in the format: `{"text": ..., "assignee": ...}`. Review the printed output to see the identified action items.
+
+Ensure that you have the necessary permissions and access to the required data or files mentioned in the code. Modify the code as needed to fit your specific use case or requirements.
+
+Remember to deactivate the virtual environment once you're done by running the command `deactivate` in the terminal or command prompt.
+
+
+
+## Usage
+1. Clone or download the project repository.
+
+2. Prepare the Meeting Transcripts:
+   - Open the project folder.
+   - Locate the `transcripts.txt` file.
+   - Update the file with the meeting transcripts in the following format:
+     ```
+     "start_time","end_time","speaker","text"
+     "00:00:00","00:00:15","Alice","Good afternoon, everyone!"
+     "00:00:15","00:00:40","Bob","Good afternoon, Alice. How was your weekend?"
+     ...
+     ```
+   - Save the file with the updated meeting transcripts.
+
+3. Run the Code:
+   - Open a terminal or command prompt.
+   - Navigate to the project folder.
+   - Execute the following command:
+     ```
+     python ml_model_1.py
+     ```
+
+4. Review the Results:
+   - The extracted action items will be displayed in the console or output window.
+   - Each action item will be shown in the format: `{"text": ..., "assignee": ...}`.
+   - Optionally, the timestamp of when the action item was detected can be included.
+
+## Customization and Further Steps
+- Customize the code to meet specific requirements, such as modifying the text classification model or integrating with other tools.
+- Explore enhancements, such as fine-tuning the model for improved accuracy or integrating with project management tools.
+- Refer to the code comments for guidance on customizing or extending the functionality.
+
+
+## Troubleshooting
+- If encountering any errors, ensure that all prerequisites are properly installed.
+- Verify that the meeting transcripts are correctly formatted in the `gathered-data.csv` file.
+
+## License
+This project is licensed under the MIT License
+
+## Acknowledgments
+We would like to acknowledge the following resources and projects that contributed to the development of this project:
+
+BERT - Pre-trained model for text classification.
+Transformers - Library for natural language processing tasks.
+
+
+## References
+BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
+The Illustrated Transformer
+Hugging Face Transformers Documentation
diff --git a/ml_model_1.py b/ml_model_1.py
@@ -0,0 +1,67 @@
+import re
+import torch
+from transformers import BertTokenizer, BertForSequenceClassification
+# from transformers import RobertaTokenizer, RobertaForSequenceClassification
+model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+
+def classify_action_item(text):
+    inputs = tokenizer.encode_plus(text, add_special_tokens=True, truncation=True, padding='max_length', max_length=512, return_tensors='pt')
+    input_ids = inputs['input_ids']
+    attention_mask = inputs['attention_mask']
+
+    outputs = model(input_ids, attention_mask=attention_mask)
+    logits = outputs.logits
+    probabilities = torch.softmax(logits, dim=1).squeeze().tolist()
+    predicted_label = torch.argmax(logits, dim=1).item()
+
+    return predicted_label, probabilities
+# Path: transcript_to_action_items.py
+transcripts = [
+    ["00:00:00", "00:00:15", "Alice", "Good afternoon, everyone!"],
+    ["00:00:15", "00:00:40", "Bob", "Good afternoon, Alice. How was your weekend?"],
+    ["00:00:40", "00:01:00", "Alice", "It was great, Bob. How about yours?"],
+    ["00:01:00", "00:01:30", "Bob", "I had a relaxing weekend, thanks for asking."],
+    ["00:01:30", "00:02:00", "Carol", "Hello, team! We have some exciting news to share."],
+    ["00:02:00", "00:02:45", "Alice", "Hello, Carol. Please go ahead."],
+    ["00:02:45", "00:03:30", "Carol", "We have secured a new client for our project!"],
+    ["00:03:30", "00:04:00", "Bob", "That's fantastic, Carol! Who is the new client?"],
+    ["00:04:00", "00:04:30", "Carol", "The new client is a leading tech company in our industry."],
+    ["00:04:30", "00:05:00", "Alice", "That's a significant achievement for our team."],
+    ["00:05:00", "00:05:30", "Carol", "Absolutely, Alice. We need to start preparations immediately."],
+    ["00:05:30", "00:06:00", "Bob", "What's the timeline for the project, Carol?"],
+    ["00:06:00", "00:06:30", "Carol", "We have six months to complete the project successfully."],
+    ["00:06:30", "00:07:00", "Alice", "Let's ensure we allocate resources accordingly for a timely delivery."],
+    ["00:07:00", "00:07:30", "Bob", "I'll coordinate with the development team to kickstart the project."],
+    ["00:07:30", "00:08:00", "Carol", "Great! Let's have a follow-up meeting next week to discuss further details."],
+    ["00:08:00", "00:08:15", "Alice", "Sounds good, Carol."],
+    ["00:08:15", "00:08:30", "Bob", "I'll have a look at the project timeline shortly."],
+    ["00:08:30", "00:08:45", "Carol", "Thanks, Bob. I'll see you all next week."],
+    ["00:08:45", "00:09:00", "Alice", "Have a great day, everyone!"],
+    ["00:09:00", "00:09:15", "Bob", "You too, Alice. See you next week."],
+    ["00:09:15", "00:09:30", "Carol", "Have a great day, Alice."],
+    ["00:09:30", "00:09:45", "Bob", "You too, Carol. See you next week."],
+]
+
+action_items = []
+# Path: transcript_to_action_items.py
+
+for transcript in transcripts:
+    text = transcript[3]
+    assignee = re.findall(r"([A-Za-z]+),", text)
+    assignee = assignee[0] if assignee else "UNKNOWN"
+    predicted_label, probabilities = classify_action_item(text)
+
+    if predicted_label == 1:  # Action item predicted
+        action_item = re.findall(r"[A-Za-z\s]+", text)
+        action_item = action_item[0].strip() if action_item else ""
+
+        if action_item:
+            action_items.append({"text": action_item, "assignee": assignee, "ts": transcript[0]})
+
+for item in action_items:
+    print(item)
+
+
+
+
diff --git a/transcript.csv b/transcript.csv
@@ -0,0 +1,57 @@
+//meeting 1
+
+"00:00:00","00:00:15","Alice","Good afternoon, everyone!"
+"00:00:15","00:00:40","Bob","Good afternoon, Alice. How was your weekend?"
+"00:00:40","00:01:00","Alice","It was great, Bob. How about yours?"
+"00:01:00","00:01:30","Bob","I had a relaxing weekend, thanks for asking."
+"00:01:30","00:02:00","Carol","Hello, team! We have some exciting news to share."
+"00:02:00","00:02:45","Alice","Hello, Carol. Please go ahead."
+"00:02:45","00:03:30","Carol","We have secured a new client for our project!"
+"00:03:30","00:04:00","Bob","That's fantastic, Carol! Who is the new client?"
+"00:04:00","00:04:30","Carol","The new client is a leading tech company in our industry."
+"00:04:30","00:05:00","Alice","That's a significant achievement for our team."
+"00:05:00","00:05:30","Carol","Absolutely, Alice. We need to start preparations immediately."
+"00:05:30","00:06:00","Bob","What's the timeline for the project, Carol?"
+"00:06:00","00:06:30","Carol","We have six months to complete the project successfully."
+"00:06:30","00:07:00","Alice","Let's ensure we allocate resources accordingly for a timely delivery."
+"00:07:00","00:07:30","Bob","I'll coordinate with the development team to kickstart the project."
+"00:07:30","00:08:00","Carol","Great! Let's have a follow-up meeting next week to discuss further details."
+"00:08:00","00:08:15","Alice","Sounds good, Carol."
+
+//meeting 2
+
+"00:08:15","00:08:30","Bob","Before we end this meeting, let's clarify the action items."
+"00:08:30","00:09:00","Carol","Good point, Bob. Alice, can you take note of the action items?"
+"00:09:00","00:09:30","Alice","Of course, Carol. I'll make sure we capture all the necessary tasks."
+"00:09:30","00:10:00","Bob","One action item is to finalize the project budget by the end of the week."
+"00:10:00","00:10:30","Carol","Agreed. Another action item is to schedule a kickoff meeting with the new client."
+"00:10:30","00:11:00","Alice","Got it. I'll coordinate with the client to find a suitable time for the kickoff meeting."
+"00:11:00","00:11:30","Bob","We should also assign team leads for different project tasks."
+"00:11:30","00:12:00","Carol","Yes, let's identify the team leads and inform them about their responsibilities."
+"00:12:00","00:12:30","Alice","I'll create a task assignment document and share it with the team leads."
+"00:12:30","00:13:00","Bob","Lastly, we need to prepare a project timeline and share it with the entire team."
+"00:13:00","00:13:30","Carol","Agreed. Alice, please make sure the timeline includes all major milestones."
+"00:13:30","00:14:00","Alice","Will do, Carol. I'll compile all the action items and circulate them to the team."
+"00:14:00","00:14:15","Bob","Great! Once we have the action items documented, we can adjourn the meeting."
+"00:14:15","00:14:30","Carol","Thank you all for your contributions. Let's make this project a success!"
+
+
+//meeting 3
+
+"17:00:00","17:00:15","Alice","Good evening, everyone!"
+"17:00:15","17:00:40","Bob","Good evening, Alice. How was your day?"
+"17:00:40","17:01:00","Alice","It was productive, Bob. How about yours?"
+"17:01:00","17:01:30","Bob","I had a busy day, but I managed to accomplish my tasks."
+"17:01:30","17:02:00","Carol","Hello, team! We have some updates to discuss."
+"17:02:00","17:02:45","Alice","Hello, Carol. Please go ahead."
+"17:02:45","17:03:30","Carol","We have received feedback from the client regarding our project."
+"17:03:30","17:04:00","Bob","That's important, Carol! What are the key points from the feedback?"
+"17:04:00","17:04:30","Carol","The client expressed satisfaction with our progress so far."
+"17:04:30","17:05:00","Alice","That's great to hear. Did they mention any specific areas for improvement?"
+"17:05:00","17:05:30","Carol","Yes, they suggested enhancing the user interface for better user experience."
+"17:05:30","17:06:00","Bob","We should prioritize addressing their suggestions in the upcoming tasks."
+"17:06:00","17:06:30","Carol","Absolutely, Bob. I'll assign the UI improvements to the design team."
+"17:06:30","17:07:00","Alice","Let's ensure we incorporate the client's feedback effectively."
+"17:07:00","17:07:30","Bob","I'll coordinate with the development team to implement the necessary changes."
+"17:07:30","17:08:00","Carol","Great! Let's schedule a meeting next week to review the updated UI."
+"17:08:00","17:08:15","Alice","Sounds good, Carol. We should also discuss the project timeline."