Skip to content

Commit

Permalink
feat: initial release for GT4SD project.
Browse files Browse the repository at this point in the history
Signed-off-by: Matteo Manica <[email protected]>
  • Loading branch information
Matteo Manica committed Feb 11, 2022
1 parent 9e311bd commit 8a65ba4
Show file tree
Hide file tree
Showing 191 changed files with 22,435 additions and 2 deletions.
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ instance/

# Sphinx documentation
docs/_build/
docs/api/*

# PyBuilder
target/
Expand Down Expand Up @@ -127,3 +128,14 @@ dmypy.json

# Pyre type checker
.pyre/

# Visual Studio Code settings
.vscode/

# PyCharm settings
.idea/

# custom
logs
test
.DS_Store
10 changes: 10 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
cff-version: 1.2.0
message: "If you use GT4SD, please consider citing as below."
authors:
- family-names: Team
given-names: GT4SD
title: "GT4SD (Generative Toolkit for Scientific Discovery)"
version: 0.22.0
url: "https://github.com/GT4SD/gt4sd-core"
# doi: TBD
date-released: 2022-02-11
125 changes: 125 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Contributing

<!-- add missing CLA -->

## Contributing to GT4SD codebase

If you would like to contribute to the package, we recommend the following development setup.

1. Create a copy of the [repository](https://github.com/GT4SD/gt4sd-core) via the ‘Fork’ button.

2. Clone the gt4sd-core repository:

```sh
git clone [email protected]:${GH_ACCOUNT_OR_ORG}/gt4sd-core.git
```

3. Create a dedicated branch:

```sh
cd gt4sd-core
git checkout -b a-super-nice-feature-we-all-need
```

4. Create and activate a dedicated conda environment:

```sh
conda env create -f conda.yml
conda activate gt4sd
```

5. Install `gt4sd` in editable mode:

```sh
pip install -e.
```

6. Implement your changes and once you are ready run the tests:

```sh
python -m pytest -sv
```

And the style checks:

```sh
# blacking and sorting imports
python -m black src/gt4sd
python -m isort src/gt4sd
# checking flake8 and mypy
python -m flake8 --disable-noqa --per-file-ignores="__init__.py:F401" src/gt4sd
python -m mypy src/gt4sd
```

7. Once the tests and checks passes, but most importantly you are happy with the implemented feature commit your changes.

```sh
# add the changes
git add
# commit them
git commit -s -m "feat: implementing super nice feature." -m "A feature we all need."
# check upstream changes
git fetch upstream
git rebase upstream/main
# push changes to your fork
git push -u origin a-super-nice-feature-we-all-need
```

8. Open a PR via the "Pull request" button, the maintainers will be happy to review it.

## Contributing to GT4SD documentation

We recommend the "Python Docstring Generator" extension in VSCode.

However, the types should not be duplicated.
The sphinx documentation will pick it up from [type annotations](https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html#type-annotations).
Unfortunately, a custom template is required to not add any types at all.

Its settings are:

```json
"autoDocstring.docstringFormat": "google",
"autoDocstring.startOnNewLine": false,
"autoDocstring.customTemplatePath": "/absolute_path_to/.google_pep484.mustache"
```

where the last line would point to the custom template file (e.g. in your user home)
with the following content: (just placeholders for types are removed):

```tpl
{{! Google Docstring Template }}
{{summaryPlaceholder}}
{{extendedSummaryPlaceholder}}
{{#parametersExist}}
Args:
{{#args}}
{{var}}: {{descriptionPlaceholder}}
{{/args}}
{{#kwargs}}
{{var}}: {{descriptionPlaceholder}}. Defaults to {{&default}}.
{{/kwargs}}
{{/parametersExist}}
{{#exceptionsExist}}
Raises:
{{#exceptions}}
{{type}}: {{descriptionPlaceholder}}
{{/exceptions}}
{{/exceptionsExist}}
{{#returnsExist}}
Returns:
{{#returns}}
{{descriptionPlaceholder}}
{{/returns}}
{{/returnsExist}}
{{#yieldsExist}}
Yields:
{{#yields}}
{{descriptionPlaceholder}}
{{/yields}}
{{/yieldsExist}}
```
166 changes: 164 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,164 @@
# gt4sd-core
GT4SD (Generative Toolkit for Scientific Discovery) an open-source platform to accelerate hypothesis generation in the scientific discovery process.
# GT4SD (Generative Toolkit for Scientific Discovery)

<!-- commented badges to be renabled once the functionalities are active -->
<!-- [![PyPI version](https://badge.fury.io/py/gt4sd.svg)](https://badge.fury.io/py/gt4sd) -->
<!-- [![build](https://github.com/gt4sd/gt4sd-core/workflows/build/badge.svg)](https://github.com/gt4sd/gt4sd-core/actions) -->
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
<!-- [![Downloads](https://pepy.tech/badge/gt4sd)](https://pepy.tech/project/gt4sd) -->
<!-- [![Downloads](https://pepy.tech/badge/gt4sd/month)](https://pepy.tech/project/gt4sd) -->
[![Contributions](https://img.shields.io/badge/contributions-welcome-blue)]()
<!-- [![Docs](https://img.shields.io/badge/website-live-brightgreen)](https://pages.github.com/GT4SD/gt4sd-core/)) -->

<img src="./docs/_static/gt4sd_logo.png" alt="logo" width="500"/>

The GT4SD (Generative Toolkit for Scientific Discovery) is an open-source platform to accelerate hypothesis generation in the scientific discovery process. It provides a library for making state-of-the-art generative AI models easier to use.

<!-- enable once docs are there -->
<!-- For full details on the library API and examples see the [docs](https://pages.github.com/GT4SD/gt4sd-core/). -->

## Installation

### pip

<!-- uncomment once the package is there -->
<!-- If you simply want to use `gt4sd` in your projects, install it via `pip` from [PyPI](https://pypi.org/project/gt4sd/):
```sh
pip install gt4sd
``` -->

You can install `gt4sd` directly from GitHub:

```sh
pip install git+https://github.com/GT4SD/gt4sd-core
```

### Development setup & installation

If you would like to contribute to the package, we recommend the following development setup:
Clone the gt4sd-core repository:

```sh
git clone [email protected]:GT4SD/gt4sd-core.git
cd gt4ds-core
conda env create -f conda.yml
conda activate gt4sd
pip install -e .
```

Learn more in [CONTRIBUTING.md](./CONTRIBUTING.md)

## Supported packages

Beyond implementing various generative modeling inference and training pipelines GT4SD is designed to provide a high-level API that implement an harmonized interface for several existing packages:

- [GuacaMol](https://github.com/BenevolentAI/guacamol): inference pipelines for the baselines models.
- [MOSES](https://github.com/molecularsets/moses): inference pipelines for the baselines models.
- [TAPE](https://github.com/songlab-cal/tape): encoder modules compatible with the protein language models.
- [PaccMann](https://github.com/PaccMann/): inference pipelines for all algorithms of the PaccMann family as well as traiing pipelines for the generative VAEs.
- [transformers](https://huggingface.co/transformers): training and inference pipelines for generative models from the [HuggingFace Models](https://huggingface.co/models)

## Using GT4SD

### Running inference pipelines

Running an algorithm is as easy as typing:

```python
from gt4sd.algorithms.conditional_generation.paccmann_rl.core import (
PaccMannRLProteinBasedGenerator, PaccMannRL
)
target = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'
# algorithm configuration with default parameters
configuration = PaccMannRLProteinBasedGenerator()
# instantiate the algorithm for sampling
algorithm = PaccMannRL(configuration=configuration, target=target)
items = list(algorithm.sample(10))
print(items)
```

Or you can use the `ApplicationRegistry` to run an algorithm instance using a
serialized representation of the algorithm:

```python
from gt4sd.algorithms.registry import ApplicationsRegistry
target = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'
algorithm = ApplicationsRegistry.get_application_instance(
target=target,
algorithm_type='conditional_generation',
domain='materials',
algorithm_name='PaccMannRL',
algorithm_application='PaccMannRLProteinBasedGenerator',
generated_length=32,
# include additional configuration parameters as **kwargs
)
items = list(algorithm.sample(10))
print(items)
```

### Running training pipelines via the CLI command

GT4SD provides a trainer client based on the `gt4sd-trainer` CLI command. The trainer currently supports training pipelines for language modeling (`language-modeling-trainer`), PaccMann (`paccmann-vae-trainer`) and Granular (`granular-trainer`, multimodal compositional autoencoders).

```console
$ gt4sd-trainer --help
usage: gt4sd-trainer [-h] --training_pipeline_name TRAINING_PIPELINE_NAME
[--configuration_file CONFIGURATION_FILE]

optional arguments:
-h, --help show this help message and exit
--training_pipeline_name TRAINING_PIPELINE_NAME
Training type of the converted model, supported types:
granular-trainer, language-modeling-trainer, paccmann-
vae-trainer. (default: None)
--configuration_file CONFIGURATION_FILE
Configuration file for the trainining. It can be used
to completely by-pass pipeline specific arguments.
(default: None)
```

To launch a training you have two options.

You can either specify the training pipeline and the path of a configuration file that contains the needed training parameters:

```sh
gt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --configuration_file ${CONFIGURATION_FILE}
```

Or you can provide directly the needed parameters as argumentsL

```sh
gt4sd-trainer --training_pipeline_name language-modeling-trainer --type mlm --model_name_or_path mlm --training_file /pah/to/train_file.jsonl --validation_file /path/to/valid_file.jsonl
```

To get more info on a specific training pipeleins argument simply type:

```sh
gt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --help
```

<!-- Adding examples and notebooks is a must here -->

<!-- Having a list of all supported algorithms wouldn be nice! -->

## References

If you use `gt4sd` in your projects, please consider citing the following:

```bib
@misc{GT4SD,
author = {GT4SD Team},
title = {GT4SD (Generative Toolkit for Scientific Discovery)},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/GT4SD/gt4sd-core}},
commit = {main}
}
```

## License

The `gt4sd` codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.
8 changes: 8 additions & 0 deletions conda.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: gt4sd
dependencies:
- python>=3.7,<3.8
- pip>=19.1,<20.3
- pip:
- -r requirements.txt
# development
- -r dev_requirements.txt
14 changes: 14 additions & 0 deletions dev_requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
flake8==3.8.4
mypy==0.800
pytest==6.1.1
pytest-cov==2.10.1
black==20.8b1
isort==5.7.0
sphinx==3.4.3
sphinx-autodoc-typehints==1.11.1
better-apidoc==0.3.1
sphinx_rtd_theme==0.5.1
myst-parser==0.13.3
flask==1.1.2
flask_login==0.5.0
docutils==0.17.1
24 changes: 24 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

clean:
@-rm -rf $(BUILDDIR)/*
@-rm -rf api/*.rst

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Binary file added docs/_static/gt4sd_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 8a65ba4

Please sign in to comment.