Skip to content

This project focuses on predicting lung cancer risk using a dataset that has been rigorously preprocessed, encompassing over 1,000 records. Through the implementation of multinomial logistic regression and the application of AutoML techniques, the project evaluates more than 50 distinct models.

Notifications You must be signed in to change notification settings

bharathbhaskr/LungCancerPredictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Lung Cancer Prediction Project

Overview

This project focuses on predicting lung cancer risk using a dataset that has been rigorously preprocessed, encompassing over 1,000 records. Through the implementation of multinomial logistic regression and the application of AutoML techniques, the project evaluates more than 50 distinct models. Special attention was given to Gradient Boosting Machine (GBM) and XGBoost models due to their robust performance in predictive analytics.

Key Features

Multinomial Logistic Regression: Initiated the analysis with a foundational statistical approach to categorize the data based on various predictors of lung cancer. AutoML Evaluation: Leveraged AutoML to systematically and efficiently assess over 50 distinct predictive models, focusing on performance and scalability. Advanced Model Analysis: Emphasized Gradient Boosting Machine (GBM) and XGBoost for their proven efficacy, guided by a comprehensive suite of metrics including accuracy, AUC, RMSE, and MSE. Performance Metrics: Achieved an accuracy rate surpassing 95%, indicating a high level of precision in lung cancer risk prediction. Attained an Area Under the Curve (AUC) greater than 0.99, showcasing the model's excellent capability in distinguishing between patients with and without lung cancer. Minimized Root Mean Square Error (RMSE) and Mean Squared Error (MSE) to less than 0.05, reflecting the model's accuracy in prediction. Key Predictors Identification: Through meticulous analysis, key predictors of lung cancer risk were identified. Notably, passive smoking and air pollution exposure were highlighted as significant risk factors. For instance, air pollution exposure was associated with a 1.5x increase in lung cancer risk among non-smokers.

Project Structure

Data Preprocessing: Details the steps taken to clean and prepare the dataset for analysis, ensuring the quality and integrity of the data. Model Training and Evaluation: Explains the methodologies used in training the models, along with a detailed comparison of performance metrics to select the top-performing models. Results and Interpretation: Provides a thorough analysis of the results, including the impact of key predictors on lung cancer risk and the statistical significance of the findings. Conclusion: Summarizes the insights gained from the project and suggests directions for future research.

Technologies Used

Python: The primary programming language for data processing and model implementation. Pandas & NumPy: Used for data manipulation and numerical calculations. matplotlib: For visualizing the data and model performance metrics. scikit-learn: Employed for implementing multinomial logistic regression and evaluating model performance. AutoML: Automated machine learning for efficient model evaluation and selection.

How to Use

To replicate this analysis or apply the methodology to a new dataset, follow these steps:

Prepare Your Dataset: Ensure your dataset is in a compatible format (CSV, Excel, etc.) and contains variables relevant to lung cancer risk. Data Preprocessing: Run the preprocessing scripts to clean and prepare your data. Model Training: Execute the model training scripts, selecting your desired algorithms and parameters. Evaluation and Selection: Use the provided evaluation metrics to select the best-performing model for your needs. Prediction: Apply the selected model to make predictions on new data.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

About

This project focuses on predicting lung cancer risk using a dataset that has been rigorously preprocessed, encompassing over 1,000 records. Through the implementation of multinomial logistic regression and the application of AutoML techniques, the project evaluates more than 50 distinct models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published