-
Notifications
You must be signed in to change notification settings - Fork 677
GSoC 2025 Project Ideas
IMPORTANT: THIS PAGE IS ONLY A DRAFT. Given that there is an application process for organizations — such as MDAnalysis — to participate in the GSoC program, there is no guarantee that MDAnalysis will be selected by Google as a participating GSoC organization. Once we know more, we will share updates on our public communication channels.
Hello, and welcome to MDAnalysis!
Please see our Google Summer of Code wiki page for general information, including advice on application writing, and our GSoC FAQ for commonly asked questions.
If you just found out about the MDAnalysis Python package from the GSoC website, you can watch the MDAnalysis 2021 Trailer [YouTube] to get an overview of the scope of the MDAnalysis package.
MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale, spanning use cases from interactions of drugs with proteins to novel materials. For Google Summer of Code, we are also collaborating with other organizations and software projects that use MDAnalysis. Our GSoC projects generally require a basic knowledge and hands-on experience in specific areas, so for our suggested projects, please check carefully the project descriptions to see the associated desirable skills. Broadly speaking, we found that applicants with experience in molecular dynamics (MD) simulations and the associated analyses -- or equivalent experience in simulations and modeling of molecular systems (physics, biophysics, chemistry, or materials) -- are very successful.
If you are interested in taking part, please get in touch on the GSoC with MDAnalysis Discussion Forum. Given the GSoC program structure (small, medium, and large projects), letting us know of your intentions to apply and getting acquainted with the project early will be very helpful.
MDAnalysis welcomes new mentors; please get in touch on the developer forum if you are interested in taking part. We typically expect mentors to be familiar with our development process, as evidenced by contributions to the code base and interactions on the developer forum.
See below for a list of projects ideas for Google Summer of Code 2025.
The currently proposed projects are:
- Integrating MDAnalysis streaming analysis within WESTPA propagators
- Dashboard for tracking WESTPA simulation progress
- Lazy trajectory analysis with Dask and a Lazy Timeseries API
- Better interfacing of Blender and MDAnalysis
- HBond interactions from implicit hydrogens
- Continuous (i.e., non-binary) Interaction Fingerprints (IFPs)
- Improving ProLIF's 2D interaction visualizations
- Benchmarking and performance optimization
- Integrating OpenFolds’ structural prediction confidence metrics into the topology system
Or work on your own idea! Get in contact with us to propose an idea and we will work with you to flesh it out into a full project. Contact us via the GSoC with MDAnalysis Discussion Forum (or if your project is a specific feature you'd want to add, raise an issue in the Issue Tracker.)
Look at the list of all available mentors for MDAnalysis for potential mentors for your project. Please send all communications to the discussion forum (and don't contact mentors privately). You can certainly ask for the opinion of a specific mentor if you know that their expertise is particularly suitable for your project.
WESTPA (The Weighted Ensemble Simulation Toolkit with Parallelization and Analysis) is a high-performance Python framework for applying the weighted ensemble (WE) path sampling strategy, which enables simulations of processes that are orders of magnitude longer than the simulations themselves. A WE simulation involves an iterative process: many MD simulations are executed in parallel and periodically evaluated to be replicated or terminated based on a set WE resampling criteria. To rigorously apply WE resampling, the MD simulations are analyzed during run time with tools such as MDAnalysis to determine the state of the simulated system.
Read the WE Overview; install the WESTPA package; and work through tutorials 7.1 and 7.5 of our tutorials suite to start learning more about WESTPA.
Built as an add-on for the popular and industry-leading 3D software Blender, Molecular Nodes(MN) enables import and visualization of complex molecular datasets inside of Blender. Many formats are supported such as static structures, electron density maps, EM (electromagnetic) tomography data and importantly, molecular dynamics simulation trajectories, powered by MDAnalysis. Blender is primarily intended for use via a GUI by artists, but scripting via the python API is also possible, with many potential avenues for automated animation and rendering.
A great overview of the project is the talk given at the Blender Conference in 2022. The MN documentation includes a lot of information about how to get started, including written and video tutorials, with one specific to MD trajectory import. Workshop materials are also publicly available for an online Introduction to MDAnalysis and Molecular Nodes workshop held in February 2024, which includes an interactive tutorial for visualizing imported MDAnalysis data in Molecular Nodes.
It’s important to first familiarize yourself with using Blender and Molecular Nodes via a GUI and some of the quirks that go along with it, before trying to write code for it.
The table summarizes the project ideas; long descriptions come after the table (or click on the links under each project name). The difficulty is a somewhat subjective ranking, where easy means that we know pretty much what needs to be done, medium requires some additional research into best solutions as part of the project, and hard is high risk/high reward where we think a solution exists but we will have to work with the student to find it and implement it. The project size is either 90 h (small), 175 h (medium) or 350 h (large) projects.
project | name | difficulty | project size | description | skills | mentors |
---|---|---|---|---|---|---|
1 | Integrating MDAnalysis streaming analysis within WESTPA propagators | medium | 175 hours | Integrate MDAnalysis with WESTPA to analyze streamed trajectory data | Python (multiprocessing), Networking (TCP/IP), MD Engines | @jeremyleung521, @ltchong, @fiona-naughton, @orbeckst |
2 | Dashboard for tracking WESTPA simulation progress | easy | 90 hours | Create a graphical user interface to report MD trajectory progress | Python (frontend UI, multiprocessing), Networking (TCP/IP) | @jeremyleung521, @ltchong, @fiona-naughton, @talagayev |
3 | Lazy trajectory analysis with Dask and a Lazy Timeseries API | medium | 175 hours | Build out a lazy reader and timestep interface | Dask or lazy computation paradigm, Object-oriented programming, Writing analysis code classes/scripts, Experience with a numpy-like-interface | @ljwoods2, @orbeckst, @yuxuanzhuang |
4 | Better interfacing of Blender and MDAnalysis | medium | 350 hours | Improve how Blender and Molecular Nodes interface with MDAnalysis to import and animate MD trajectories | Python, MDAnalysis, Blender (and programming via its Python API) | @yuxuanzhuang, @bradyajohnston |
5 | HBond interactions from implicit hydrogens | medium | 175/350 hours | Make interaction fingerprints analysis with ProLIF (an MDAKit) more accessible and faster to run | Python, RDKit, SMARTS, compchem | @cbouy, @talagayev |
6 | Continuous (i.e., non-binary) interaction fingerprints (IFPs) | hard | 350 hours | Define thresholds for interactions and implement continuous encoding for interactions | Python, RDKit, compchem | @cbouy, @talagayev |
7 | Improving ProLIF's 2D interaction visualizations | medium | 90/175 hours | Improve ProLIF's "LigNetwork" plot and add 2D visualizations to summarize information in IFPs | Python, JavaScript | @cbouy, @talagayev |
8 | Benchmarking and performance optimization | easy/medium | 175/350 hours | Write benchmarks for automated performance analysis and address performance bottlenecks | Python/ASV, Cython | @orbeckst, @ljwoods2 |
9 | Integrating OpenFolds’ structural prediction confidence metrics into the topology system | easy/medium | 90 hours | Expose the predicted local distance difference test metric (pLDDT) via the MDAnalysis topology system | OpenFold or structural prediction tools more generally, Python, Solving parsing problems | @ljwoods2, @orbeckst |
This project aims to integrate MDAnalysis with WESTPA to exploit MDAnalysis’s ability to analyze streamed trajectory data generated by WESTPA. This integration will reduce I/O bottlenecks and minimize the runtime needed to analyze a WESTPA simulation before intermittent restarting of the short, completed MD simulations.
MDAnalysis is currently an option for extracting and analyzing simulation data for WESTPA simulations. This project aims to expand those capabilities by integrating streaming directly into WESTPA’s propagator executables and work managers for analysis, reducing the need for users to configure the networking. This will include stress testing MDAnalysis’s streaming capabilities, as analysis might involve using networking configurations such as streaming data from many-nodes-to-many-nodes and many-nodes-to-one-nodes.
- Stress-testing MDAnalysis streaming
- Additional propagator executables for integrating MDAnalysis streaming within WESTPA
- Python (multiprocessing)
- Networking (TCP/IP)
- MD Engines
Large (175 hours)
Medium
WESTPA simulations involve running multiple MD trajectories in parallel, which makes it hard to track progress. This project aims to create a graphical user interface that exploits MDAnalysis’s streaming ability and WESTPA’s work managers to monitor the progress of a WESTPA simulation.
While WESTPA simulations report status at regular intervals, these iterations could last minutes to hours, leaving users unsure of the intermediate progress or time estimate. The task here will involve creating a graphical user interface reporting trajectory progress and completion time estimates through MDAnalysis’s streaming abilities and extracting relevant information from WESTPA’s work managers (ZMQ, python multiprocessing) and data managers.
- New CLI tool for WESTPA tracking simulation progress
- MDAnalysis module for aggregating/tracking multiple simulations
- Python (frontend UI, multiprocessing)
- Networking (TCP/IP)
Not applicable
Small (90 hours)
Easy
This project aims to improve MDAnalysis’s viability in high-performance clusters and high-throughput environments by building out a lazy (rather than eager) reader and timestep interface along with a sample H5MD implementation and basic analysis classes.
MDAnalysis’s core data structure for holding trajectory data, the timestep, is extremely useful for providing a highly uniform interface for various readers, however, its eager approach to memory management, where trajectory frames are loaded into the object one step at a time, constrains analysis speed compared to the increasingly popular lazily-loaded paradigm used in recent packages like Dask and Polars. In HPC or cloud computing environments where minimizing analysis time is a necessity for making MDAnalysis viable at scale, having a lazy interface for new readers to target and existing ones to adapt to along with a sample implementation with H5MD and basic analysis classes that build on it would provide immediate benefits for HPC MDAnalysis users and a future platform for ensuring MDAnalysis is a tool that can scale with its users’ projects.
MDAnalysis already has a timeseries
API for readers which provides a natural starting place for a similar lazy_timeseries
interface which would include an additional argument to select between coordinates, velocities, and forces. Existing readers can implement lazy_timeseries
by first simply passing the numpy.ndarray
result of calling timeseries
into a Dask array, but certain readers like the H5MDReader
or ZarrH5MDReader
can receive a proper lazy implementation.
- MDAKit with lazy reader and timeseries interface code
- A working H5MD implementation of the interface
- A lazy timeseries analysis base
- Implementation of at least one basic analysis algorithm using the interface (like RMSD)
- Experience with dask or lazy computation paradigm
- Knowledge of object-oriented programming
- Experience writing analysis code classes/scripts
- Any experience with a numpy-like-interface is useful
- https://github.com/MDAnalysis/mdanalysis/issues/4713
- https://github.com/MDAnalysis/mdanalysis/issues/4598
- https://github.com/MDAnalysis/mdanalysis/issues/4561
- https://github.com/MDAnalysis/mdanalysis/issues/2865
Medium (175 hours)
Medium
Improvements to how Blender and Molecular Nodes interface with MDAnalysis which powers the import and animation of MD trajectories inside of Blender. Simple import is currently available when using the GUI in Blender, but there is still a lot of potential for improvements in scriptability, automated rendering, and using Blender as an analysis tool for MD trajectories.
Blender is industry-leading 3D modeling and animation software. Through the add-on Molecular Nodes, MDAnalysis universes are able to be imported into the 3D scene, enabling advanced rendering of molecular dynamics trajectories that is not possible inside of any other molecule viewer. The ability to script and automate this rendering is possible but limited with lots of room for improvement for visualizing many common MD datasets. Blender also provides a great platform for implementing a potential GUI, to enable interactive analysis of MD trajectories with stunning visuals, all powered by MDAnalysis under the hood.
- Prototype improved API for scripting and working with Molecular Nodes from Jupyter Notebooks or other similar environments
- Prototyping common analysis and visualization tasks that could be performed from within Blender via the GUI
- Proficiency with Python
- Working knowledge of MDAnalysis
- Familiarity with Blender and programming via its Python API
- https://github.com/MDAnalysis/mdanalysis/discussions/4862
- https://github.com/yuxuanzhuang/ggmolvis
- https://github.com/yuxuanzhuang/ggmolvis/issues/11
- https://www.mdanalysis.org/2024/12/12/sdg_molecularnodes/
- https://github.com/BradyAJohnston/MolecularNodes/pull/719
Large (350 hours)
Medium
This project makes interaction fingerprints analysis with ProLIF (an MDAKit) more accessible and faster to run from PDB files for machine-learning (ML) practitioner.
Interaction fingerprints (IFPs) are a common strategy to filter docking poses that aren't able to recapitulate known interactions in molecular complexes, but their use require explicit hydrogens to model hydrogen bonds. While ML-based docking and cofolding tools have seen increased usage other the recent years, the files that these methods generate only contain heavy atoms. While it's possible to add hydrogens to a complex and optimize its hydrogen-bond network, this significantly slows down and ultimately hampers the use of IFPs to evaluate the quality of ML-based molecular complexes. By adding to ProLIF the ability to evaluate hydrogen bond interactions solely based on heavy atom positions (and with the assumptions that hydrogens are positioned ideally for such interactions), it would be possible to directly compare co-crystallized complexes (e.g., from the PDB) with ML-based poses without requiring the intermediate use of protonation tools (PDB2PQR, reduce...etc.).
- New set of hydrogen-bond interaction classes available using implicit hydrogens
- Helper function to load PDBs with heavy atoms only and common non-standard residues (HSD, HSE, HID…etc.) appropriately
- Python
- RDKit
- SMARTS
- Computational Chemistry
Not applicable
Medium/Large (175/350 hours)
Medium
IFPs use cutoff values for defining the different expected distances and angles that interactions must follow, leading to cases slightly outside of these thresholds to not be detected at all. This project aims to bring an alternative (continuous) encoding for interactions in ProLIF (an MDAKit) so that such cases may still be accounted for in subsequent analysis.
IFPs are inherently binary and based on distance and angles thresholds that can be hard to define robustly across the wide diversity of use cases. The idea here would be to define "ideal" and "suboptimal" thresholds for distances and angles, and for interacting atoms that satisfy the ideal thresholds encode them as 1 (as would be with a "traditional" IFPs), for those that don't satisfy the suboptimal thresholds encode them as 0, and for anything in between encode them through a sigmoid function that outputs a real value between 0 and 1. This would allow to deal more gracefully with cases that follow non-ideal geometries for interacting atoms. Note that the sigmoid function that transforms the input distances and angles into a real value will have to be determined during the project.
- The interaction classes have the ability to specify multiple thresholds for distances and angles, and additional metadata is returned when these classes are used to detect interactions
- The conversion of the resulting IFPs to a pandas DataFrame outputs continuous values between 0 and 1 instead of being binary
- From a user perspective, they should only have to toggle on a new parameter in the Fingerprint object initialization to enable this analysis
- Python
- RDKit
- Computational Chemistry
Not applicable
Large (350 hours)
Hard
This project involves improvements to the current “LigNetwork” plot produced by ProLIF to make it easier to use and more publication-ready, and leaves some room to add other 2D visualizations that can help summarize the information contained in interaction fingerprints.
When creating a "LigNetwork" plot with ProLIF, the protein residues are placed randomly and left for the user to drag and drop to make the figure readable. It would be beneficial to automate the placement of residues on the plot in a way that minimizes the overlap between residues and the ligand, and minimizes crossing between interaction edges.
Depending on the size of the project, additional visualizations could be added, such as: a "heatmap" of ligand atoms involved in different interactions by highlighting such atoms with a bivariate Gaussian distribution based on an atomic interaction-aware weighting (see doi.org/10.1186/1758-2946-5-43 for reference). Another one could be to generate a figure similar to the LigNetwork with InteractionDrawer, converting the ProLIF metadata to follow their JSON schema. For protein-protein systems and residue network analysis, Flareplot would be very beneficial.
- The placement of residues on the LigNetwork plot is made to be readable out of the box
- If time allows, additional kinds of plots are added to further improve ProLIF’s visualization capabilities
- Python
- JavaScript
Not applicable
Small/Medium (90/175 hours)
Medium
The goal of this project is to increase the performance assessment coverage (using the existing ASV framework), identify code that should be improved, and optimize code.
The MDAnalysis Roadmap emphasizes performance improvement. The performance of the MDAnalysis library is assessed by automated nightly benchmarks with ASV (see https://github.com/MDAnalysis/benchmarks/wiki) but coverage of the code base is low. The goal of this project is to substantially increase the performance assessment coverage, identify code that should be improved, and possibly implement performance optimizations.
- Write ASV benchmark cases for all major functionality in the core library
- Write ASV benchmark cases for often-used analysis tools
- Analyze performance history and generate a priority list of code that should be improved
- Document writing benchmarks with a short tutorial
- Optional: Optimize performance for at least one discovered performance bottleneck
- Python/ASV
- Cython
- https://github.com/MDAnalysis/mdanalysis/issues/1023
- https://github.com/MDAnalysis/mdanalysis/issues/1721
- https://github.com/MDAnalysis/mdanalysis/issues/4577
Medium/Large (175/350 hours)
Easy/Medium
The goal of this project is to expose per-molecule and inter-molecular structural prediction confidence metrics to users via the MDAnalysis topology system.
Structural prediction tools like OpenFold are increasingly important for in-silico estimations of protein structure and of binding probability between molecules. However, working with the raw output of prediction tools is challenging and often requires bespoke tools made by researchers prone to inefficiency and errors. MDAnalysis’s topology system provides a robust, natural interface for working with per-residue (like pLDDT) and between-residue/chain metrics (like a contact probabilities matrix). This project seeks to build the foundation for working with structural prediction confidence metrics in MDAnalysis.
After this project is complete, users will be able to access confidence metrics via Numpy arrays associated with (or between) each AtomGroup
in a way that is consistent with current MDA atom selection
- Modifications of the PDBParser to associate confidence metrics with and between each
AtomGroup
- OpenFold experience or structural prediction tools more generally
- Python
- Experience solving parsing problems
Small (90 hours)
Easy/Medium