- Purpose: Stream audio from the client to the backend service.
- WebSocket Protocol: Allows for bidirectional communication between the client and server.
- Audio Streaming: Client sends raw audio data every 250 milliseconds.
- Transcription Frequency: Transcriptions are performed every second (4 chunks) to achieve near real-time speech-to-text conversion.
- Return transcribed text back to client in real time
- Purpose: Perform speech-to-text recognition.
- Model: Whisper model for efficient and accurate transcription.
- Model Size: Start with base/small first
- Deployment: Run locally on the host machine to avoid using OpenAI's API (quite expensive?)
- Other option: Hugging face's model
- Purpose: Perform error analysis on the transcriptions after the user has finished speaking.
- Output: Generate an error report and provide recommendations for improvement.
- Model Size: Start with 8B now, consider fine tune or upgrade to 70B later.
- Other option: Hugging face's model
- How to handle the poorly transcribed inputs gracefully
- Purpose: Orchestrate the NLP tasks and manage the workflow.
- Functionality: Chain Whisper and LLama3: Create a pipeline that first processes the audio with Whisper and then analyzes the text with LLama3.
- Prompt Engineering: Craft prompts for LLama3 to ensure accurate and relevant error analysis.
- RAG (Retrieval-Augmented Generation): Future integration with a customized curriculum database for enhanced feedback and learning resources.
- Connection: Accept any WebSocket connection.
- JWT Token: Require the client to send their JWT token as the first message.
- Verification: Authenticate the token with the backend.
- Access Control: Allow further communication only if the JWT token is valid.
- Encryption: Use SSL/TLS to secure the WebSocket connection.
- Certificates: Implement proper SSL certificates.
- Configuration: Configure the WebSocket server to use secure WebSockets (wss://).
- FE should display those 10 easy questions
- Should capture user's real time answers with Whisper.
- Should only save the final transcriptions for error analysis.
- Should count number of words spoken and record it.
- When users done answering questions
- 10 Questions will be used as part of prompt to give more context to LLama3
- llama3 generated report should include:
- Grammar error (you should have used "is" instead of "am")
- Check if answers to the questions make any sense
- Suggestion for improvements
- Focus on Past tense, subject-verb...
- Should send the word count to backend for record
- Attach date too?
- Set up the python environment
- Download CUDA12.1 (As Pytorch doesn't support 12.5 the latest version yet):
https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exe_local
- Download CUDA version of Pytorch:
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121
- Login to huggingface_hub using the CLI tool
- Getting
UserWarning: 1Torch was not compiled with flash attention.
Maybe it is because FlashAttentionV2 is not on Windows yet - Linux is faster (Linux subsystem?)
- Download ffmpeg: https://phoenixnap.com/kb/ffmpeg-windows
pip install ffmpeg-python
- Quantization doesn't work on Mac it is not compatible with CUDA
- BitsAndBytes needs CUDA
- Instead, we could use GGUF model (which apparently is Mac friendly)
- We could also use
mlx_lm
or just ollama :)