Introduction to machine learning projects

This repository consists of four projects that were part of the Introduction to Machine Learning lecture given by Prof. Dr. Andreas Krause and Prof. Dr. Fan Yang at ETH Zurich in spring semester 2023. These projects counted for 30% of the final grade of the course, and the remaining 70% was a final written exam. Full details of the course can be found on the official webpage.

The projects were done in groups of up to 3 students, and the entire code in this repository was written by our group consisting of Benjamin Dupont and Mateo Hamel. The code was written in Python and used libraries such as numpy, pandas, scikit-learn, and pytorch, sometimes using a base template provided by the course staff.

Summary of Projects

Project 1: Linear Regression
- Implemented cross-validation for ridge regression and explored feature transformations to improve model performance.
Project 2: Prediction of electricity price in Switzerland
- Developed a model to predict electricity prices in Switzerland using data from other countries at various seasons of the year. Implemented an imputer to fill in the missing data and a gaussian process regressor to predict the price on a test set.
Project 3: Classification of food preferences
- Created a classification model to determine food preferences based on image triplets. The project involved extracting embeddings from a pre-trained model and training a custom classifier neural network to determine for each image triplet (A, B, C) if A was closer to B or to C.
Project 4: Predicting HOMO-LUMO Gaps for Solar Cell Material Discovery
- Developed a pipeline combining transfer learning and an autoencoder-based feature extraction to predict HOMO-LUMO gaps for solar cell materials. Leveraged a large pretraining dataset for embedding generation and fine-tuned a Ridge regression model on a limited labeled dataset for accurate predictions.

Project 1: Linear Regression

This first project was an introduction to linear regression and ridge regression and was subdivided into two tasks:

1a - Cross validation for ridge regression

Implemented K-fold cross-validation with 10 folds to evaluate the performance of ridge regression for 5 different values of lambda. The RMSE was calculated for each lambda and averaged over the folds to determine the optimal regularization parameter. The process involved splitting the data into training and validation sets, fitting the model on the training set, and evaluating it on the validation set. The results were used to select the best lambda value that minimized the average RMSE.

Training Data Format

The training data is provided in a CSV format with the following columns:

y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13

y: Target variable
x1, x2, ..., x13: Feature variables

Exact implementation and datasets can be found in the task1a folder.

1b - Ridge regression with transformed features

The goal was to transform the original features from the data and then use ridge regression to predict the target variable by finding the optimal weights of the regression model on these transformed features. Specifically, the transformation of the 5 input features of matrix $X$ (where $x_i$ denotes the $i$-th component of $X$) into 21 new features $\phi(X)$ was done as follows:

Linear Features:
- $\phi_1(X) = x_1$
- $\phi_2(X) = x_2$
- $\phi_3(X) = x_3$
- $\phi_4(X) = x_4$
- $\phi_5(X) = x_5$
Quadratic Features:
- $\phi_6(X) = x_1^2$
- $\phi_7(X) = x_2^2$
- $\phi_8(X) = x_3^2$
- $\phi_9(X) = x_4^2$
- $\phi_{10}(X) = x_5^2$
Exponential Features:
- $\phi_{11}(X) = \exp(x_1)$
- $\phi_{12}(X) = \exp(x_2)$
- $\phi_{13}(X) = \exp(x_3)$
- $\phi_{14}(X) = \exp(x_4)$
- $\phi_{15}(X) = \exp(x_5)$
Cosine Features:
- $\phi_{16}(X) = \cos(x_1)$
- $\phi_{17}(X) = \cos(x_2)$
- $\phi_{18}(X) = \cos(x_3)$
- $\phi_{19}(X) = \cos(x_4)$
- $\phi_{20}(X) = \cos(x_5)$
Constant Feature:
- $\phi_{21}(X) = 1$

The ridge regression model was then trained on these transformed features to predict the target variable $y$.

The implementation involved the following steps:

Feature Transformation: The original features were transformed as described above to create a new feature matrix $\Phi$.
Model Training: Ridge regression was applied to the transformed features. The model was trained using the training data provided.
Cross-Validation: Cross-validation was performed to select the optimal regularization parameter lambda. The RMSE was calculated for each lambda and averaged over the folds to determine the optimal regularization parameter and therefore the best model.

The code for this task can be found in the task1b folder. The training data is provided in a CSV format with the following columns:

Id, y, x1, x2, x3, x4, x5

Id: Identifier for each data point
y: Target variable
x1, x2, ..., x5: Feature variables

Project 2: Prediction of electricity price in Switzerland

This project involved predicting the electricity price in Switzerland using data from other countries. The dataset included electricity prices from various countries across different seasons, with some missing values. The implementation steps were as follows:

Data Imputation: Missing values in the dataset were imputed using statistical methods to ensure a complete dataset for model training.
Model Training: A Gaussian Process Regressor with a RationalQuadratic kernel was trained on the imputed dataset to predict the electricity prices in Switzerland.
Evaluation: The model was evaluated on the test set, and the predictions were saved in a results file.

The code for this project can be found in the task2 folder. The training data is provided in a CSV format with the following columns:

Column	Description
`season`	Season of the year
`price_AUS`	Electricity price in Australia
`price_CHF`	Electricity price in Switzerland (to be predicted in the test set)
`price_CZE`	Electricity price in Czech Republic
`price_GER`	Electricity price in Germany
`price_ESP`	Electricity price in Spain
`price_FRA`	Electricity price in France
`price_UK`	Electricity price in the United Kingdom
`price_ITA`	Electricity price in Italy
`price_POL`	Electricity price in Poland
`price_SVK`	Electricity price in Slovakia

Project 3: Classification of food preferences

This project aimed to classify food preferences based on image triplets. The task was to determine which of two images (B or C) is more similar to a reference image (A). The dataset consisted of images of food items, and the training data was provided in the form of triplets, where it was always true that image A was closer to image B than to image C.

Data Format

The data was organized in the following structure:

Dataset Folder: /food
Training Triplets: Provided in a CSV file with columns A, B, and C, representing the file paths of the images in each triplet.

Goal

The goal was to train a model that could predict, for each triplet in the test set, whether image A is closer to image B than to image C. The predictions were binary, with 1 indicating that A is closer to B, and 0 indicating that A is closer to C.

Implementation

The implementation involved the following steps:

Data Preprocessing:
- Loaded the images from the dataset folder.
- Resized and normalized the images to ensure consistency in input dimensions.
Feature Extraction:
- Used a pre-trained convolutional neural network (CNN) to extract embeddings from the images. The embeddings captured the high-level features of the images.
Model Training:
- Constructed a custom neural network classifier that took the embeddings of the triplets as input.
- The network was trained to minimize a loss function that encouraged the model to correctly predict the similarity relationship in the training triplets.
Evaluation:
- Evaluated the model on the test set by predicting the similarity relationship for each triplet.
- Annotated the test triplets with 1 (true) or 0 (false) based on the model's predictions.

The code for this project can be found in the task3 folder. The main implementation details are in the main.py file, which includes the data loading, model definition, training loop, and evaluation logic.

The training data is provided in a CSV format with the following columns:

Column	Description
`A`	Number/File path of the reference image
`B`	Number/File path of the first comparison image
`C`	Number/File path of the second comparison image

The training data label is assumed to be 1 for all triplets, indicating that image A is closer to image B than to image C.

The test data follows the same format, and the goal is to predict whether image A is closer to image B than to image C for each triplet.

Project 4: Predicting HOMO-LUMO Gaps for Solar Cell Material Discovery

Overview

This project aims to identify materials suitable for solar cell manufacturing by predicting a molecule's HOMO-LUMO gap, a key feature that determines its usefulness in such applications. Directly testing for the HOMO-LUMO gap through computational chemistry simulations is time-consuming and computationally expensive. Instead, this project leverages machine learning to predict the HOMO-LUMO gap based on molecular structure.

The challenge lies in the limited availability of labeled data: the dataset contains only 100 examples of molecules with their corresponding HOMO-LUMO gap. However, a separate pretraining dataset with 50,000 molecule-energy pairs (with LUMO energy values but not the HOMO-LUMO gap) is available. To address this, a transfer learning approach is employed using an autoencoder strategy.

Problem Statement

Predicting the HOMO-LUMO gap of molecules is a computationally intensive task. The gap determines a material's suitability for solar cell applications. Given:

Pretraining data: 50,000 molecule-to-HOMO/LUMO energy pairs (unlabeled for HOMO-LUMO gaps).
Training data: Only 100 molecule-to-HOMO-LUMO gap pairs (labeled).
Test data: 10,000 molecules for which the HOMO-LUMO gap needs to be predicted.

The challenge lies in utilizing the pretraining dataset to improve predictions on the downstream task of HOMO-LUMO gap prediction.

Approach

To solve this problem, we employed a two-step pipeline combining transfer learning and regression:

1. Pretraining: Feature Extraction

We use a feedforward neural network (MLP) with the following architecture:

Input layer: Matches the feature size of the input molecules.
Hidden layers:
- Layer 1: 256 neurons, ReLU activation.
- Layer 2: 64 neurons, ReLU activation (penultimate layer).
Output layer: 1 neuron (to predict energy values in the pretraining task).

The penultimate layer (64 features) is used as the reduced feature representation for the downstream task.

Training Details:

Loss Function: Mean Squared Error (MSE).
Optimizer: Adam with a learning rate of 0.001.
Batch Size: 200.
Validation: 10% of the pretraining dataset is used for validation.
Stopping Criteria: Early stopping based on validation loss.

2. Downstream Task: Ridge Regression

Once the features are extracted using the pretrained model, we:

Normalize the features using StandardScaler from sklearn.
Train a Ridge regression model on the training dataset with the extracted 64 features.
Hyperparameter tuning: Ridge regression's alpha parameter is optimized using cross-validation over a predefined range ([0.1, 1.0, 10.0]).

The test dataset is processed using the same pipeline, and predictions are made using the trained Ridge regression model.

Results

The final output of the pipeline is saved as a CSV file (task4/results.csv), containing the predicted HOMO-LUMO gaps for the test molecules.

Files and Structure

Code:
- main.py: Contains the complete implementation of the pretraining and regression pipeline.
- MLPRegressor: Custom class defining the feedforward neural network.
- pretrain_model: Function for training the neural network on the pretraining dataset.
- Ridge regression and feature normalization implemented using sklearn.
Data:
- pretrain_features.csv.zip: Features for pretraining.
- pretrain_labels.csv.zip: Labels (energies) for pretraining.
- train_features.csv.zip: Features for training.
- train_labels.csv.zip: Labels (HOMO-LUMO gaps) for training.
- test_features.csv.zip: Features for testing.
Output:
- results.csv: Predictions for the test molecules.

Key Advantages

Transfer Learning: Efficiently utilizes the large pretraining dataset to learn meaningful features, reducing the dependency on a large labeled training dataset.
Feature Extraction: The penultimate layer (64 features) provides a compact, meaningful representation of molecular data for downstream tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
assets		assets
task0		task0
task1a		task1a
task1b		task1b
task2		task2
task3		task3
task4		task4
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction to machine learning projects

Summary of Projects

Project 1: Linear Regression

1a - Cross validation for ridge regression

Training Data Format

1b - Ridge regression with transformed features

Project 2: Prediction of electricity price in Switzerland

Project 3: Classification of food preferences

Data Format

Goal

Implementation

Project 4: Predicting HOMO-LUMO Gaps for Solar Cell Material Discovery

Overview

Problem Statement

Approach

1. Pretraining: Feature Extraction

2. Downstream Task: Ridge Regression

Results

Files and Structure

Key Advantages

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ben-du-pont/IML-Projects

Folders and files

Latest commit

History

Repository files navigation

Introduction to machine learning projects

Summary of Projects

Project 1: Linear Regression

1a - Cross validation for ridge regression

Training Data Format

1b - Ridge regression with transformed features

Project 2: Prediction of electricity price in Switzerland

Project 3: Classification of food preferences

Data Format

Goal

Implementation

Project 4: Predicting HOMO-LUMO Gaps for Solar Cell Material Discovery

Overview

Problem Statement

Approach

1. Pretraining: Feature Extraction

2. Downstream Task: Ridge Regression

Results

Files and Structure

Key Advantages

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages