This project implements:
- Extractive summarization using TF‑IDF features and multiple classifiers:
- RandomForest, Naive Bayes, SVM, Logistic Regression, Linear Regression.
- Abstractive summarization using:
- A Groq‑hosted LLM baseline.
- A fine‑tuned seq2seq model (e.g., BART) on the CNN/DailyMail dataset.
- Evaluation with sentence‑level classification metrics and document‑level ROUGE‑1 / ROUGE‑L.
The main entrypoint for extractive models is main.py.
The LLM fine‑tuning and evaluation script is llm_cnn.py.
From the project root:
cd /Users/shloknanani/Desktop/ai_project
# Using venv (Python 3.x)
python3 -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows PowerShellpython -m pip install --upgrade pip
python -m pip install -r requirements.txtrequirements.txt includes:
- scikit-learn: classical ML models and metrics
- nltk: sentence tokenization
- datasets: Hugging Face datasets (CNN/DailyMail, etc.)
- rouge-score: ROUGE‑1 / ROUGE‑L metrics
- joblib: saving/loading trained models
- requests: Groq API calls
- transformers, torch: LLM fine‑tuning and inference
The code already tries to download punkt if missing, but you can also run:
python -c "import nltk; nltk.download('punkt')"This project uses:
- CNN/DailyMail summarization dataset via Hugging Face
abisee/cnn_dailymail. - Local lecture notes stored under
data/(e.g.,lecture_notes.txt,lecture_notes_ml_intro.txt, etc.).
The Hugging Face datasets are downloaded automatically on first use and cached under your HF cache directory.
main.py has three modes:
train– train and save an extractive summarization model.test– load a saved model, evaluate it, and show example summaries.llm– evaluate a Groq LLM baseline on CNN/DailyMail.
You can also choose the classifier with --classifier:
rf(default) – RandomForestnb– Naive Bayessvm– Linear SVMlogreg– Logistic Regressionlinreg– Linear Regression (regression turned into binary via threshold)
From the project root (with your virtualenv activated):
# Example: RandomForest (default)
python main.py --mode train --classifier rf
# Example: Logistic Regression
python main.py --mode train --classifier logregThis will:
- Load labeled sentence‑level data via
get_training_datainsrc/data_loader.py(CNN/DailyMail + optional HF dataset / local notes). - Split into train/test (internally).
- Fit a
TextPreprocessor(TF‑IDF with English stop words). - Train the chosen classifier.
- Save:
- Model to
artifacts/summary_model_<classifier>.joblib
(e.g.,summary_model_rf.joblib). - TF‑IDF vectorizer to
artifacts/tfidf_preprocessor.joblib.
- Model to
# Evaluate RandomForest model
python main.py --mode test --classifier rf
# Evaluate SVM model
python main.py --mode test --classifier svmThis will:
- Load the appropriate model + vectorizer from
artifacts/. - Rebuild a train/test split for evaluation and print a classification report (precision, recall, F1 for labels 0/1).
- Compute document‑level ROUGE‑1 and ROUGE‑L on a subset of CNN/DailyMail test articles.
- Generate and print an extractive summary for a local lecture note file
(see
lecture_file_to_summarizeinmain.py).
Make sure you’ve run --mode train at least once before --mode test
so the artifacts exist.
main.py --mode llm evaluates an off‑the‑shelf Groq LLM on CNN/DailyMail.
Export your Groq API key as environment variables (shell example):
export GROQ_API_KEY="your_api_key_here"
# Optional: override default model
export GROQ_MODEL_NAME="llama-3.1-8b-instant"python main.py --mode llmThis will:
- Fetch a small subset of CNN/DailyMail test articles.
- Call the Groq API to generate abstractive summaries.
- Compute and print ROUGE‑1 / ROUGE‑L F1 scores for the LLM baseline.
If GROQ_API_KEY is not set, the script will print a warning and skip scoring.
llm_cnn.py handles fine‑tuning and evaluating a Hugging Face seq2seq model
(e.g., BART) on CNN/DailyMail.
Example (smaller run for local machines):
python llm_cnn.py \
--mode train \
--model_name facebook/bart-base \
--output_dir artifacts/llm_cnn \
--max_train_samples 20000 \
--max_val_samples 2000 \
--num_train_epochs 1This will:
- Load the CNN/DailyMail dataset (
3.0.0). - Select up to
max_train_samplestraining examples andmax_val_samplesvalidation examples. - Tokenize
article(inputs) andhighlights(targets). - Fine‑tune
model_nameusingSeq2SeqTrainer. - Save the fine‑tuned model and tokenizer to
output_dir.
After training, evaluate on a small test subset:
python llm_cnn.py \
--mode eval \
--output_dir artifacts/llm_cnn \
--split test[:20]This will:
- Load the tokenizer and model from
output_dir. - Run generation on the specified
splitof CNN/DailyMail. - Compute and print average ROUGE‑1 and ROUGE‑L F1 scores.
-
main.py
Orchestrates data loading, preprocessing, training/testing of extractive models, Groq LLM baseline, and ROUGE evaluation. -
src/data_loader.py- Loads CNN/DailyMail and optional HF summarization datasets.
- Uses ROUGE‑L overlap with gold summaries to label top‑k “important” sentences (label 1 vs 0).
- Loads local lecture notes and manual labels.
-
src/preprocessor.pyTextPreprocessorwrappingTfidfVectorizer(stop_words="english").
-
src/model.pySummaryModelwrapping aRandomForestClassifierwithclass_weight="balanced", OOB scoring, etc.
-
src/nb_model.py,src/svm_model.py,src/logreg_model.py,src/linreg_model.py- Wrapper classes for Naive Bayes, SVM, Logistic Regression, and Linear Regression summary models.
-
src/summarizer.pySummarizerthat:- Splits a document into sentences.
- Computes probabilities (if available) or predictions.
- Ranks sentences by probability and selects the top
num_sentencesas the summary.
-
llm_cnn.py- Fine‑tuning and evaluation script for a seq2seq LLM (e.g., BART) on CNN/DailyMail.
-
data/lecture_notes*.txt- Local lecture notes used as additional training and demo data.
-
requirements.txt- Python dependencies.
-
.gitignore- Ignores
artifacts/, large Parquet shards, etc. so model checkpoints and dataset caches are not committed.
- Ignores
-
Set up environment
- Create venv, install
requirements.txt, ensurenltkanddatasetswork.
- Create venv, install
-
Train an extractive model
python main.py --mode train --classifier rf
-
Evaluate extractive model + see example summaries
python main.py --mode test --classifier rf
-
(Optional) Compare alternative classifiers
- Train/test with
--classifier nb,svm,logreg,linreg.
- Train/test with
-
(Optional) Run Groq LLM baseline
- Set
GROQ_API_KEY, runpython main.py --mode llm.
- Set
-
(Optional) Fine‑tune BART on CNN/DailyMail
- Train with
llm_cnn.py --mode train ..., then evaluate with--mode eval.
- Train with
8. Utilizing YouTube Lecture Transcript Dataset, Train Respective Models on Dataset, Produce Respective Evaluation Metrics
This uses the environment set up in class. If packages/dependences are not installed, use pip or conda. Certain installations have been commented out in the .py files themselves.
-
Navigate to the project's home directory.
-
Run clean_preprocess.py using the YouTube lecture csv in the cited_datasets directory and the csv containing the handwritten summaries in the cleaned_datasets directory.
python clean_preprocess.py
This cleans and preprocesses the data, generating pickle files for later use.
- To run the improved logistic regression model and the FFN, run log_reg_and_ffn.py using the pickle files found in the pkl_files directory.
python log_reg_and_ffn.py
If you want to use your locally generated files from the previous step, modify the source code and change dependencies accordingly, or replace the existing pkl files.
- Run rouge_and_summarization.py using the respective pickle, npy and pth files found in their respective directories
python rouge_and_summarization.py
Again, if you want to use your locally generated files form the previous step, modify the source code and change dependencies accordingly, or replace existing files.
https://www.kaggle.com/datasets/jfcaro/5000-transcripts-of-youtube-ai-related-videos