An application that uses machine learning techniques to make a predictions on certain properties such as Solubility.
Explore the docs »
View Demo
.
Report Bug
.
Request Feature
- About the Project
- Built With
- Getting Started
- DevelopmentInformation
- Future Work
- Contributing
- Authors
- Acknowledgements
The food-processing industry utilizes a variety of (artificial) additives to improve the preservation, taste, texture, and cosmetics of food-based products, among other reasons. Throughout the past decade, various studies have shown that the additives used within processed foods may contain potentially harmful proteins for human consumption
In recent years, there has been an increased interest in using better, safer, and more sustainable sources of food additives, ingredients, and other materials.
The manual process of performing said research is a time-and resource consuming process. In an attempt to minimize the process, various institutes have spent research on the prediction of protein-based elements and the creation of various databases.
Previous research has mainly focused on predicting protein structures (Alphafold), protein alignments (PSI-BLAST), secondary structures (SPIDER3 and ProteinUnet), protein alignments with respect to locations (HH-Suit3) and tertiary protein structure prediction (SPOT-Contact).
On the other hand, effort and resources have been spent on preserving knowledge about proteins by the uniprot consortium and host institutes (EMBL-EBI, SIB and PIR) through the development of UniProt.
Most of these researches do not aim at providing information regarding behavioral similarities such as a specific property similarity. Different approaches have been researched to establish this aim, including graph machine learning (graphSol). Others only researched these properties on e.coli (eSOL).
Unfortunately, eSOL only aims at predicting ecoli bacteria, while graphSol utilizes state-of-the-art feature techniques such as PSI-BLAST, SPIDER3, and more to establish similar proteins, which is an extremely time-and resource-consuming approach, although seemingly performing well.
This application tries to combine the previously performed research together with existing datasets to create the building blocks for protein property prediction through the use of different machine learning models to ultimately utilize them within an ensemble learning protocol. While providing an easy to use interface for non-technical personnel.
The application makes use of various frameworks.
To get a local copy up and running follow these simple steps:
The development of this project is done in Python 3 with the use of PIP (package manager) and venv (virtual environment).
You can choose the install it manually through the terminal or use IDEs such as Pycharm to handle this installation process for you.
Install prerequisites through terminal:
- Install PIP
- Install Venv
- Create the Virtual Environment
- Activate the Virtual Environment
- Install the dependencies through requirements.txt
in Pycharm:
After having gone through the prerequisite steps, you can execute the following steps to get the application to work:
Terminal:
- run "manage.py migrate app" to create a local database instance. (SQLite)
- run "manage.py runserver" to start the webserver.
Pycharm:
- In the top menu click on "Tools" and then "Task manage.py". This will open a terminal.
- Type "migrate app" and press enter to create a local database instance (SQLite)
- On the top right click the green play button to start the application.
If there is no green play button, click on the select next to it and add a new configuration.
Make sure this configuration is set to "Django Server" with environmental variables set as: PYTHONUNBUFFERED=1;DJANGO_SETTINGS_MODULE=EatingInsects.settings and the rest should work fine for default settings.
Once the server is running you can go to "http://localhost:8000" in your browser to see the application. Note: 8000 may change depending on the port that is being used. In the terminal it should give you the correct link once the application is running.
The application makes use of the following machine learning models:
- Sequential Neural Network
- Random Forests
- Decision Tree
- Linear Regression
Details of these implementations can be found in the research paper.
A dataset example can be found in resources/private/datasets/sibo_dataset_example.csv.
The following columns are required to be present within these datasets:
- yield_um
- yield_ml
- calculated_mw
- calculated_pi
- sequence_length
- sequence_mass
- steric
- polarizability
- volume
- hydrophobicity
- helix
- sheet
It'll also require the prediction columns:
- solubility
Validations errors will be given when a user uploads a dataset that misses any of these fields.
New models can be added in a standardized way. This can be done as follows:
- Create a new directory that is named after your model in
app/networkse.g.app/networks/DT. - Create a python file inside the newly created directory with the same name. e.g.
app/networks/DT/dt.py - Create a new class named after this file and extend it with
app.networks.shared.BaseNetwork.BaseNetwork. - Override the
compilemethod with your model logic and make sure the model is returned.
An example model looks as follows:
from tensorflow import keras
from app.networks.shared.BaseNetwork import BaseNetwork
class SNN(BaseNetwork):
def compile(self):
if self.model:
raise Exception("Model has already been compiled")
self.model = keras.Sequential([
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(1)
])
self.model.compile(
loss=keras.metrics.RootMeanSquaredError(name="RMSE"),
optimizer=keras.optimizers.Adam(float(self.predictor.learning_rate)),
metrics=[
keras.metrics.MeanAbsoluteError(name="MAE")
]
)
return self.modelNote: The following methods can be overridden if needed: train, test, predict. An example of this can be seen in the DT
model.
Note: In the case of using Tensorflow, the loss and metrics are defined within the model compilation and will be used for evaluation later-on.
Now that we model is present, we need to couple it to the system. This can be done as follows:
- In
app.business.predictors.load_network(...)you must define the newly created model. - If needed the model specification needs to be added to
app.models.ModelType.
Example of adding a new model to load_network:
if predictor.model_type == ModelType.DT.value:
return DT(predictor)Example of a new a new type to ModelType:
DT = "DT", 'Decision Tree'This is more dependent on the implementation, and thus won't be fully explained. The main takeaways are the following:
app.formscontain the form specifications that can be seen in the GUI.app.businesscontains all the core business logic of the application.app.modelscontains the model classes of the application.app.migrationscontains the database migrations.- On a new install or when new migrations are added run
manage.py migrate app - When changes are made to any of the classes in
app.modelsthat may change the database structure, new migrations must be generated. This can easily be done by callingmanage.py makemigrations app. After that make sure to migrate again (see above statement).
- On a new install or when new migrations are added run
app.networkscontain all the machine learning models and their shared compounds.app.staticcontain files that won't change throughout runtime. Mainly css, js and static images.app.templatecontain the HTML code of all the GUI.app.viewscontain the frontend logic of all the GUI.
Django makes use of an MVT structure (Model-View-Template), which is similar to MVC (Model-View-Controller). The View is practically the Controller, while the Template is the View.
As we're working with datasets and models that can be uploaded and created, we've some dynamic media pipeline in place.
- The database can be found in
db.sqlite3and is a local in-memory database. Read the django documentation on databases in case you want to change to a proper database. - When a user creates a new model and uploads a dataset this dataset will be stored in
resources/public/datasets - When a user trains a new model this model will be stored in
resources/public/networkswhen it completed training.
Personal data is not stored in any of the databases and as we're working with proteins, should not be included in the datasets either.
- Dynamic Datasts and protein fields
- Dynamic machine learning model parameter fields
- Calculate PI and MW from amino acid sequence
- Compare model evaluations side-by-side or in visualized graphs
- Synthetic data generation or manually retrieve more data
- Predict different kinds of protein properties (instead of just solubility)
- Protein Alignments and Similarities based on predicted property values (and maybe amino acid sequence)
- Distinguish between beneficial and harmful proteins for recommendation
- Graph Machine Learning (GraphSol)
- Ensemble Learning Model
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- If you have suggestions for adding or removing projects, feel free to open an issue to discuss it, or directly create a pull request after you edit the README.md file with necessary changes.
- Please make sure you check your spelling and grammar.
- Create individual PR for each suggestion.
- Please also read through the Code Of Conduct before posting your first idea as well.
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Noah Scharrenberg - Bsc DSAI Student & Software Engineer - Noah Scharrenberg - Project Lead
- Bunyamin Thijssen - Bsc DSAI Student - Bunyamin Thijssen - Research & Developer
- Claudia Sanchez - Bsc DSAI Student - Claudia Sanchez - Research & Developer
- Parand Mohri - Bsc DSAI Student - Parand Mohri - Research & Developer
- Mohammad Fayazi - Bsc DSAI Student - Mohammad Fayazi - Research & Developer
- Laurence Manoukian - Bsc DSAI Student - Laurence Manoukian - Research & Developer
