-
Notifications
You must be signed in to change notification settings - Fork 74
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: initial release for GT4SD project.
Signed-off-by: Matteo Manica <[email protected]>
- Loading branch information
Matteo Manica
committed
Feb 11, 2022
1 parent
9e311bd
commit 8a65ba4
Showing
191 changed files
with
22,435 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
cff-version: 1.2.0 | ||
message: "If you use GT4SD, please consider citing as below." | ||
authors: | ||
- family-names: Team | ||
given-names: GT4SD | ||
title: "GT4SD (Generative Toolkit for Scientific Discovery)" | ||
version: 0.22.0 | ||
url: "https://github.com/GT4SD/gt4sd-core" | ||
# doi: TBD | ||
date-released: 2022-02-11 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
# Contributing | ||
|
||
<!-- add missing CLA --> | ||
|
||
## Contributing to GT4SD codebase | ||
|
||
If you would like to contribute to the package, we recommend the following development setup. | ||
|
||
1. Create a copy of the [repository](https://github.com/GT4SD/gt4sd-core) via the ‘Fork’ button. | ||
|
||
2. Clone the gt4sd-core repository: | ||
|
||
```sh | ||
git clone [email protected]:${GH_ACCOUNT_OR_ORG}/gt4sd-core.git | ||
``` | ||
|
||
3. Create a dedicated branch: | ||
|
||
```sh | ||
cd gt4sd-core | ||
git checkout -b a-super-nice-feature-we-all-need | ||
``` | ||
|
||
4. Create and activate a dedicated conda environment: | ||
|
||
```sh | ||
conda env create -f conda.yml | ||
conda activate gt4sd | ||
``` | ||
|
||
5. Install `gt4sd` in editable mode: | ||
|
||
```sh | ||
pip install -e. | ||
``` | ||
|
||
6. Implement your changes and once you are ready run the tests: | ||
|
||
```sh | ||
python -m pytest -sv | ||
``` | ||
|
||
And the style checks: | ||
|
||
```sh | ||
# blacking and sorting imports | ||
python -m black src/gt4sd | ||
python -m isort src/gt4sd | ||
# checking flake8 and mypy | ||
python -m flake8 --disable-noqa --per-file-ignores="__init__.py:F401" src/gt4sd | ||
python -m mypy src/gt4sd | ||
``` | ||
|
||
7. Once the tests and checks passes, but most importantly you are happy with the implemented feature commit your changes. | ||
|
||
```sh | ||
# add the changes | ||
git add | ||
# commit them | ||
git commit -s -m "feat: implementing super nice feature." -m "A feature we all need." | ||
# check upstream changes | ||
git fetch upstream | ||
git rebase upstream/main | ||
# push changes to your fork | ||
git push -u origin a-super-nice-feature-we-all-need | ||
``` | ||
|
||
8. Open a PR via the "Pull request" button, the maintainers will be happy to review it. | ||
|
||
## Contributing to GT4SD documentation | ||
|
||
We recommend the "Python Docstring Generator" extension in VSCode. | ||
|
||
However, the types should not be duplicated. | ||
The sphinx documentation will pick it up from [type annotations](https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html#type-annotations). | ||
Unfortunately, a custom template is required to not add any types at all. | ||
|
||
Its settings are: | ||
|
||
```json | ||
"autoDocstring.docstringFormat": "google", | ||
"autoDocstring.startOnNewLine": false, | ||
"autoDocstring.customTemplatePath": "/absolute_path_to/.google_pep484.mustache" | ||
``` | ||
|
||
where the last line would point to the custom template file (e.g. in your user home) | ||
with the following content: (just placeholders for types are removed): | ||
|
||
```tpl | ||
{{! Google Docstring Template }} | ||
{{summaryPlaceholder}} | ||
{{extendedSummaryPlaceholder}} | ||
{{#parametersExist}} | ||
Args: | ||
{{#args}} | ||
{{var}}: {{descriptionPlaceholder}} | ||
{{/args}} | ||
{{#kwargs}} | ||
{{var}}: {{descriptionPlaceholder}}. Defaults to {{&default}}. | ||
{{/kwargs}} | ||
{{/parametersExist}} | ||
{{#exceptionsExist}} | ||
Raises: | ||
{{#exceptions}} | ||
{{type}}: {{descriptionPlaceholder}} | ||
{{/exceptions}} | ||
{{/exceptionsExist}} | ||
{{#returnsExist}} | ||
Returns: | ||
{{#returns}} | ||
{{descriptionPlaceholder}} | ||
{{/returns}} | ||
{{/returnsExist}} | ||
{{#yieldsExist}} | ||
Yields: | ||
{{#yields}} | ||
{{descriptionPlaceholder}} | ||
{{/yields}} | ||
{{/yieldsExist}} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,164 @@ | ||
# gt4sd-core | ||
GT4SD (Generative Toolkit for Scientific Discovery) an open-source platform to accelerate hypothesis generation in the scientific discovery process. | ||
# GT4SD (Generative Toolkit for Scientific Discovery) | ||
|
||
<!-- commented badges to be renabled once the functionalities are active --> | ||
<!-- [](https://badge.fury.io/py/gt4sd) --> | ||
<!-- [](https://github.com/gt4sd/gt4sd-core/actions) --> | ||
[](https://opensource.org/licenses/MIT) | ||
[](https://github.com/psf/black) | ||
<!-- [](https://pepy.tech/project/gt4sd) --> | ||
<!-- [](https://pepy.tech/project/gt4sd) --> | ||
[]() | ||
<!-- [](https://pages.github.com/GT4SD/gt4sd-core/)) --> | ||
|
||
<img src="./docs/_static/gt4sd_logo.png" alt="logo" width="500"/> | ||
|
||
The GT4SD (Generative Toolkit for Scientific Discovery) is an open-source platform to accelerate hypothesis generation in the scientific discovery process. It provides a library for making state-of-the-art generative AI models easier to use. | ||
|
||
<!-- enable once docs are there --> | ||
<!-- For full details on the library API and examples see the [docs](https://pages.github.com/GT4SD/gt4sd-core/). --> | ||
|
||
## Installation | ||
|
||
### pip | ||
|
||
<!-- uncomment once the package is there --> | ||
<!-- If you simply want to use `gt4sd` in your projects, install it via `pip` from [PyPI](https://pypi.org/project/gt4sd/): | ||
```sh | ||
pip install gt4sd | ||
``` --> | ||
|
||
You can install `gt4sd` directly from GitHub: | ||
|
||
```sh | ||
pip install git+https://github.com/GT4SD/gt4sd-core | ||
``` | ||
|
||
### Development setup & installation | ||
|
||
If you would like to contribute to the package, we recommend the following development setup: | ||
Clone the gt4sd-core repository: | ||
|
||
```sh | ||
git clone [email protected]:GT4SD/gt4sd-core.git | ||
cd gt4ds-core | ||
conda env create -f conda.yml | ||
conda activate gt4sd | ||
pip install -e . | ||
``` | ||
|
||
Learn more in [CONTRIBUTING.md](./CONTRIBUTING.md) | ||
|
||
## Supported packages | ||
|
||
Beyond implementing various generative modeling inference and training pipelines GT4SD is designed to provide a high-level API that implement an harmonized interface for several existing packages: | ||
|
||
- [GuacaMol](https://github.com/BenevolentAI/guacamol): inference pipelines for the baselines models. | ||
- [MOSES](https://github.com/molecularsets/moses): inference pipelines for the baselines models. | ||
- [TAPE](https://github.com/songlab-cal/tape): encoder modules compatible with the protein language models. | ||
- [PaccMann](https://github.com/PaccMann/): inference pipelines for all algorithms of the PaccMann family as well as traiing pipelines for the generative VAEs. | ||
- [transformers](https://huggingface.co/transformers): training and inference pipelines for generative models from the [HuggingFace Models](https://huggingface.co/models) | ||
|
||
## Using GT4SD | ||
|
||
### Running inference pipelines | ||
|
||
Running an algorithm is as easy as typing: | ||
|
||
```python | ||
from gt4sd.algorithms.conditional_generation.paccmann_rl.core import ( | ||
PaccMannRLProteinBasedGenerator, PaccMannRL | ||
) | ||
target = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT' | ||
# algorithm configuration with default parameters | ||
configuration = PaccMannRLProteinBasedGenerator() | ||
# instantiate the algorithm for sampling | ||
algorithm = PaccMannRL(configuration=configuration, target=target) | ||
items = list(algorithm.sample(10)) | ||
print(items) | ||
``` | ||
|
||
Or you can use the `ApplicationRegistry` to run an algorithm instance using a | ||
serialized representation of the algorithm: | ||
|
||
```python | ||
from gt4sd.algorithms.registry import ApplicationsRegistry | ||
target = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT' | ||
algorithm = ApplicationsRegistry.get_application_instance( | ||
target=target, | ||
algorithm_type='conditional_generation', | ||
domain='materials', | ||
algorithm_name='PaccMannRL', | ||
algorithm_application='PaccMannRLProteinBasedGenerator', | ||
generated_length=32, | ||
# include additional configuration parameters as **kwargs | ||
) | ||
items = list(algorithm.sample(10)) | ||
print(items) | ||
``` | ||
|
||
### Running training pipelines via the CLI command | ||
|
||
GT4SD provides a trainer client based on the `gt4sd-trainer` CLI command. The trainer currently supports training pipelines for language modeling (`language-modeling-trainer`), PaccMann (`paccmann-vae-trainer`) and Granular (`granular-trainer`, multimodal compositional autoencoders). | ||
|
||
```console | ||
$ gt4sd-trainer --help | ||
usage: gt4sd-trainer [-h] --training_pipeline_name TRAINING_PIPELINE_NAME | ||
[--configuration_file CONFIGURATION_FILE] | ||
|
||
optional arguments: | ||
-h, --help show this help message and exit | ||
--training_pipeline_name TRAINING_PIPELINE_NAME | ||
Training type of the converted model, supported types: | ||
granular-trainer, language-modeling-trainer, paccmann- | ||
vae-trainer. (default: None) | ||
--configuration_file CONFIGURATION_FILE | ||
Configuration file for the trainining. It can be used | ||
to completely by-pass pipeline specific arguments. | ||
(default: None) | ||
``` | ||
|
||
To launch a training you have two options. | ||
|
||
You can either specify the training pipeline and the path of a configuration file that contains the needed training parameters: | ||
|
||
```sh | ||
gt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --configuration_file ${CONFIGURATION_FILE} | ||
``` | ||
|
||
Or you can provide directly the needed parameters as argumentsL | ||
|
||
```sh | ||
gt4sd-trainer --training_pipeline_name language-modeling-trainer --type mlm --model_name_or_path mlm --training_file /pah/to/train_file.jsonl --validation_file /path/to/valid_file.jsonl | ||
``` | ||
|
||
To get more info on a specific training pipeleins argument simply type: | ||
|
||
```sh | ||
gt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --help | ||
``` | ||
|
||
<!-- Adding examples and notebooks is a must here --> | ||
|
||
<!-- Having a list of all supported algorithms wouldn be nice! --> | ||
|
||
## References | ||
|
||
If you use `gt4sd` in your projects, please consider citing the following: | ||
|
||
```bib | ||
@misc{GT4SD, | ||
author = {GT4SD Team}, | ||
title = {GT4SD (Generative Toolkit for Scientific Discovery)}, | ||
year = {2022}, | ||
publisher = {GitHub}, | ||
journal = {GitHub repository}, | ||
howpublished = {\url{https://github.com/GT4SD/gt4sd-core}}, | ||
commit = {main} | ||
} | ||
``` | ||
|
||
## License | ||
|
||
The `gt4sd` codebase is under MIT license. | ||
For individual model usage, please refer to the model licenses found in the original packages. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
name: gt4sd | ||
dependencies: | ||
- python>=3.7,<3.8 | ||
- pip>=19.1,<20.3 | ||
- pip: | ||
- -r requirements.txt | ||
# development | ||
- -r dev_requirements.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
flake8==3.8.4 | ||
mypy==0.800 | ||
pytest==6.1.1 | ||
pytest-cov==2.10.1 | ||
black==20.8b1 | ||
isort==5.7.0 | ||
sphinx==3.4.3 | ||
sphinx-autodoc-typehints==1.11.1 | ||
better-apidoc==0.3.1 | ||
sphinx_rtd_theme==0.5.1 | ||
myst-parser==0.13.3 | ||
flask==1.1.2 | ||
flask_login==0.5.0 | ||
docutils==0.17.1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Minimal makefile for Sphinx documentation | ||
# | ||
|
||
# You can set these variables from the command line, and also | ||
# from the environment for the first two. | ||
SPHINXOPTS ?= | ||
SPHINXBUILD ?= sphinx-build | ||
SOURCEDIR = . | ||
BUILDDIR = _build | ||
|
||
# Put it first so that "make" without argument is like "make help". | ||
help: | ||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) | ||
|
||
.PHONY: help Makefile | ||
|
||
clean: | ||
@-rm -rf $(BUILDDIR)/* | ||
@-rm -rf api/*.rst | ||
|
||
# Catch-all target: route all unknown targets to Sphinx using the new | ||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). | ||
%: Makefile | ||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.