Group: 33 | Project Number: 8
Faculty Advisor: Dr. Waseem Asghar (📧 [email protected])
Sponsors: Dr. Ali Ibrahim (📧 [email protected]), Dr. Hanqi Zhuang (📧 [email protected])
This project focuses on classifying lung cancer cases (cancer vs. healthy) using miRNA expression data and machine learning models. Our goal is to identify robust, interpretable miRNA biomarkers for early diagnosis.
While the primary focus is on binary diagnosis, we also explored:
- Stage: Pathologic stage (I–IV)
- Subtype: Adenocarcinoma vs. Squamous Cell Carcinoma
Lung cancer staging follows ajcc_pathologic_stage:
- 0 → Healthy/Non-cancerous
- 1 → Cancer Stage I (IA, IB)
- 2 → Cancer Stage II (IIA, IIB)
- 3 → Cancer Stage III (IIIA, IIIB)
- 4 → Cancer Stage IV (IVA, IVB)
We classify lung cancer into the following categories:
- 0 → Healthy
- 1 → Adenocarcinoma
- 2 → Squamous Cell Carcinoma
These tasks were found to be significantly affected by class imbalance and overlapping biological signals. Therefore, they are included as exploratory findings.
We applied a comprehensive suite of eight feature selection methods to reduce dimensionality and identify stable, high-impact biomarkers:
| Method | Purpose |
|---|---|
| Fold-Change | Detects differentially expressed miRNAs. |
| Chi-Squared | Assesses statistical dependence on labels. |
| Information Gain | Measures reduction in class uncertainty. |
| LASSO | Applies L1 penalty to remove irrelevant features. |
| Recursive Feature Elimination (RFE) | Eliminates least useful features iteratively. |
| Neighborhood Component Analysis (NCA) | Learns a distance metric to weight features. |
| Random Forest Importance | Uses tree-based impurity reduction for ranking. |
| VTFS | Deep learning method using attention weights (Transformer-based). |
VTFS was used for feature selection only and showed that subsets as small as 1–3 miRNAs could still yield high performance when used in classifiers like SVM.
SAINT is a transformer-based classifier designed for tabular data. It achieved the highest F1-score and best specificity among all tested models for binary classification.
We trained and evaluated the following classifiers on various tasks:
| Model | Role |
|---|---|
| Support Vector Machine (SVM) | Classical kernel-based model. |
| Random Forest (RF) | Tree-based ensemble classifier. |
| SAINT | Deep transformer-based classifier for tabular input. |
Each model was trained on:
- Diagnosis (primary)
- Stage classification (exploratory)
- Subtype prediction (exploratory)
All models were evaluated using the following metrics:
- Accuracy
- Sensitivity (Recall)
- Specificity
- F1-Score
- Confusion Matrix
All evaluations were performed after correcting for label leakage.
| Model | Accuracy | Recall | Specificity | F1-Score |
|---|---|---|---|---|
| SVM | 99.0% | 100% | 0% | ~0.99 |
| RF | 99.1% | 100% | 0% | ~0.99 |
| SAINT | 99.5% | 100% | 99.0% | 0.9976 |
| Model | Accuracy | Macro Recall |
|---|---|---|
| SVM | 52.1% | Low |
| RF | 51.6% | Slightly Better |
| SAINT | 59.7% | Moderate |
| Model | Accuracy | Recall (SqCC) |
|---|---|---|
| SVM | 49.1% | Collapsed |
| RF | 53.9% | Improved |
| SAINT | 47.0% | Balanced |
- Apply consensus features in external validations.
- Extend with additional data types (e.g., DNA methylation, proteomics).
- Build visual dashboards for model explainability.
- Expand SAINT and VTFS evaluation across other cancer datasets.
# Clone the repo
$ git clone https://github.com/your-repo/lung-cancer-classification.git
$ cd lung-cancer-classification
# Set up environment
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
# Run Feature Selection
$ python feature_selection/fold_change.ipynb
$ python feature_selection/chi_squared.ipynb
...
# Train Models
$ jupyter notebook classification/svm_classifier.ipynb
$ jupyter notebook classification/random_forest_classifier.ipynb
$ jupyter notebook classification/saint_classifier.ipynb