A fine-tuning implementation that teaches LLaMA 3 to mimic personal writing styles. The model analyzes writing samples and generates text that matches the author's unique writing characteristics such as vocabulary choices, sentence structures, and tone.
-
Writing Style Analysis: Analyzes the input text for:
- Vocabulary richness and common word usage
- Sentence structure and complexity
- Parts of speech distribution
- Readability metrics (Flesch-Kincaid)
-
Data Processing:
- Supports multiple input formats (TXT, DOC, DOCX, PDF)
- Converts writing samples into formatted training data
- Implements custom chat templates for LLaMA 3
-
Model Fine-tuning:
- Uses QLoRA/LoRA for efficient fine-tuning
- Implements 4-bit quantization for reduced memory usage
- Integrates WandB for training monitoring
-
Evaluation:
- Evaluates style similarity using multiple metrics (vocabulary, POS, sentence length, n-gram overlaps)
- Computes model perplexity
- Compares baseline and fine-tuned models
- Python 3.9+
- PyTorch
- Transformers
- Unsloth
- NLTK
- CUDA-compatible GPU (recommended)
- Minimum 16GB RAM
- At least 8GB free disk space for model weights
Install the required libraries using:
pip install -r requirements.txt
-
Running on Google Colab:
If you're using the Jupyter Notebook (or Google Colab) environment, follow the steps in theRun_in_Notebook.ipynb
file -
Running on Desktop:
If you're running the project on your desktop via the terminal, install the latest version of Unsloth directly from its GitHub repository and follow the steps below:pip install "unsloth[colab-new]@git+https://github.com/unslothai/unsloth.git"
-
Prepare Your Writing Sample:
Ensure you have a writing sample (minimum 3000 words) in one of the supported formats (TXT, DOC, DOCX, PDF). -
Preprocessing and Training: Run the following command to preprocess your data and fine-tune the model:
python main.py --mode train --input_file path/to/sample.txt
This command will:
- Analyze the writing style and create a style summary.
- Prepare the training dataset.
- Fine-tune the LLaMA 3 model using QLoRA with 4-bit quantization.
-
Monitoring:
Check the console logs and the WandB dashboard for training metrics such as loss, learning rate, and accuracy.
Generate text in your personalized writing style by providing a prompt:
python main.py --mode inference --prompt "Your prompt text here"
The generated text is printed to the console and saved in the generated_texts
folder with a timestamped filename.
The project includes an evaluation module that measures:
- Style Similarity: Compares the original writing style with the generated text.
- Perplexity: Computes a perplexity score to assess model performance.
- Baseline Comparison: Compares output from the base and fine-tuned models.
To run the evaluation script, use:
python src/evaluation/evaluate.py --original path/to/original.txt --generated path/to/generated.txt --prompt "Your prompt" --output evaluation_results/result.json
Examine the output JSON file for detailed metrics on style similarity, perplexity, and comparison between the models.
- Base Model: unsloth/llama-3-8b-Instruct
- Training Method: QLoRA fine-tuning
- Quantization: 4-bit for a lower memory footprint
- Max Sequence Length: 2048 tokens
-
DataPreprocessor
Prepares and analyzes input documents, extracting a style summary.analyze_writing_style(text: str)
: Generates a style summary for the provided text.
-
ModelTrainer
Handles the fine-tuning of LLaMA 3 using custom datasets. -
ModelInference
Loads the fine-tuned model to generate text based on a prompt.generate_text(prompt, style_summary)
: Produces text matching the writing style.
-
Evaluation Modules
- StyleEvaluator: Computes similarity metrics (vocabulary, POS, sentence length, n-grams).
- PerplexityEvaluator: Calculates model perplexity.
- BaselineComparator: Compares performance between the base and fine-tuned models.
This project is licensed under the MIT License. See the LICENSE file for details.
- LLaMA Team: For the cutting-edge base model.
- Unsloth: For the optimization tools.
- Open Source Community: For invaluable contributions and feedback.