This project is my practice of basic Machine Learning by making submissions for the Kaggle Titanic competition. The goal of the competition is to predict whether passengers aboard the Titanic survived or not based on various features such as age, gender, passenger class, etc.
The dataset used for this project can be obtained from the Kaggle competition page. It consists of two separate CSV files: train.csv and test.csv. The train.csv file is used for training and building the machine learning model, while the test.csv file is used for making predictions on unseen data.
- Python 3.7+
- pandas
- numpy
- seaborn
- matplotlib
- missingno
- scikit-learn
- xgboost
- lightgbm
Exploring libraries for reading, analysing and visualising data.
Using different methods to find and get rid of outliers and NaNs (drop / replace with mean)
Generating new features, encoding categorical features with LabelEncoder.
Here are the models I used and their accuracy:
- Random Forest 0.842697
- SVC 0.831461
- XGBoost tuned 0.823970
- KNeighborsClassifier 0.812734
- logistic regression 0.801498
- XGBoost 0.801498
- Linear SVC 0.797753
Best public score: Random Forest (0.77511)
- Kaggle setup
- Kaggle ML competition tutorial
- Data preprocessing and model fitting tutorial
- ChatGPT Assistance
- Learn more about features encoding;
- Tune the remained untuned models;
- Code refactoring.