GSoC 2025 Project Ideas

IMPORTANT: THIS PAGE IS ONLY A DRAFT. Given that there is an application process for organizations — such as MDAnalysis — to participate in the GSoC program, there is no guarantee that MDAnalysis will be selected by Google as a participating GSoC organization. Once we know more, we will share updates on our public communication channels.

Hello, and welcome to MDAnalysis!

Please see our Google Summer of Code wiki page for general information, including advice on application writing, and our GSoC FAQ for commonly asked questions.

If you just found out about the MDAnalysis Python package from the GSoC website, you can watch the MDAnalysis 2021 Trailer [YouTube] to get an overview of the scope of the MDAnalysis package.

Prerequisites

MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale, spanning use cases from interactions of drugs with proteins to novel materials. For Google Summer of Code, we are also collaborating with other organizations and software projects that use MDAnalysis. Our GSoC projects generally require a basic knowledge and hands-on experience in specific areas, so for our suggested projects, please check carefully the project descriptions to see the associated desirable skills. Broadly speaking, we found that applicants with experience in molecular dynamics (MD) simulations and the associated analyses -- or equivalent experience in simulations and modeling of molecular systems (physics, biophysics, chemistry, or materials) -- are very successful.

To Prospective Applicants

If you are interested in taking part, please get in touch on the GSoC with MDAnalysis Discussion Forum. Given the GSoC program structure (small, medium, and large projects), letting us know of your intentions to apply and getting acquainted with the project early will be very helpful.

To Prospective Mentors

MDAnalysis welcomes new mentors; please get in touch on the developer forum if you are interested in taking part. We typically expect mentors to be familiar with our development process, as evidenced by contributions to the code base and interactions on the developer forum.

Overview

See below for a list of projects ideas for Google Summer of Code 2025.

The currently proposed projects are:

Integrating MDAnalysis streaming analysis within WESTPA propagators
Dashboard for tracking WESTPA simulation progress
Lazy trajectory analysis with Dask and a Lazy Timeseries API
Better interfacing of Blender and MDAnalysis
HBond interactions from implicit hydrogens
Continuous (i.e., non-binary) Interaction Fingerprints (IFPs)
Improving ProLIF's 2D interaction visualizations
Benchmarking and performance optimization
Integrating OpenFolds’ structural prediction confidence metrics into the topology system

Or work on your own idea! Get in contact with us to propose an idea and we will work with you to flesh it out into a full project. Contact us via the GSoC with MDAnalysis Discussion Forum (or if your project is a specific feature you'd want to add, raise an issue in the Issue Tracker.)

Look at the list of all available mentors for MDAnalysis for potential mentors for your project. Please send all communications to the discussion forum (and don't contact mentors privately). You can certainly ask for the opinion of a specific mentor if you know that their expertise is particularly suitable for your project.

Collaborations

WESTPA

WESTPA (The Weighted Ensemble Simulation Toolkit with Parallelization and Analysis) is a high-performance Python framework for applying the weighted ensemble (WE) path sampling strategy, which enables simulations of processes that are orders of magnitude longer than the simulations themselves. A WE simulation involves an iterative process: many MD simulations are executed in parallel and periodically evaluated to be replicated or terminated based on a set WE resampling criteria. To rigorously apply WE resampling, the MD simulations are analyzed during run time with tools such as MDAnalysis to determine the state of the simulated system.

Read the WE Overview; install the WESTPA package; and work through tutorials 7.1 and 7.5 of our tutorials suite to start learning more about WESTPA.

Molecular Nodes

Built as an add-on for the popular and industry-leading 3D software Blender, Molecular Nodes(MN) enables import and visualization of complex molecular datasets inside of Blender. Many formats are supported such as static structures, electron density maps, EM (electromagnetic) tomography data and importantly, molecular dynamics simulation trajectories, powered by MDAnalysis. Blender is primarily intended for use via a GUI by artists, but scripting via the python API is also possible, with many potential avenues for automated animation and rendering.

A great overview of the project is the talk given at the Blender Conference in 2022. The MN documentation includes a lot of information about how to get started, including written and video tutorials, with one specific to MD trajectory import. Workshop materials are also publicly available for an online Introduction to MDAnalysis and Molecular Nodes workshop held in February 2024, which includes an interactive tutorial for visualizing imported MDAnalysis data in Molecular Nodes.

It’s important to first familiarize yourself with using Blender and Molecular Nodes via a GUI and some of the quirks that go along with it, before trying to write code for it.

Project summary

The table summarizes the project ideas; long descriptions come after the table (or click on the links under each project name). The difficulty is a somewhat subjective ranking, where easy means that we know pretty much what needs to be done, medium requires some additional research into best solutions as part of the project, and hard is high risk/high reward where we think a solution exists but we will have to work with the student to find it and implement it. The project size is either 90 h (small), 175 h (medium) or 350 h (large) projects.

project	name	difficulty	project size	description	skills	mentors
1	Integrating MDAnalysis streaming analysis within WESTPA propagators	medium	175 hours	Integrate MDAnalysis with WESTPA to analyze streamed trajectory data	Python (multiprocessing), Networking (TCP/IP), MD Engines	@jeremyleung521, @ltchong, @fiona-naughton, @orbeckst
2	Dashboard for tracking WESTPA simulation progress	easy	90 hours	Create a graphical user interface to report MD trajectory progress	Python (frontend UI, multiprocessing), Networking (TCP/IP)	@jeremyleung521, @ltchong, @fiona-naughton, @talagayev
3	Lazy trajectory analysis with Dask and a Lazy Timeseries API	medium	175 hours	Build out a lazy reader and timestep interface	Dask or lazy computation paradigm, Object-oriented programming, Writing analysis code classes/scripts, Experience with a numpy-like-interface	@ljwoods2, @orbeckst, @yuxuanzhuang
4	Better interfacing of Blender and MDAnalysis	medium	350 hours	Improve how Blender and Molecular Nodes interface with MDAnalysis to import and animate MD trajectories	Python, MDAnalysis, Blender (and programming via its Python API)	@yuxuanzhuang, @bradyajohnston
5	HBond interactions from implicit hydrogens	medium	175/350 hours	Make interaction fingerprints analysis with ProLIF (an MDAKit) more accessible and faster to run	Python, RDKit, SMARTS, compchem	@cbouy, @talagayev
6	Continuous (i.e., non-binary) interaction fingerprints (IFPs)	hard	350 hours	Define thresholds for interactions and implement continuous encoding for interactions	Python, RDKit, compchem	@cbouy, @talagayev
7	Improving ProLIF's 2D interaction visualizations	medium	90/175 hours	Improve ProLIF's "LigNetwork" plot and add 2D visualizations to summarize information in IFPs	Python, JavaScript	@cbouy, @talagayev
8	Benchmarking and performance optimization	easy/medium	175/350 hours	Write benchmarks for automated performance analysis and address performance bottlenecks	Python/ASV, Cython	@orbeckst, @ljwoods2
9	Integrating OpenFolds’ structural prediction confidence metrics into the topology system	easy/medium	90 hours	Expose the predicted local distance difference test metric (pLDDT) via the MDAnalysis topology system	OpenFold or structural prediction tools more generally, Python, Solving parsing problems	@ljwoods2, @orbeckst

Project 1: Integrating MDAnalysis streaming analysis within WESTPA propagators

Summary

This project aims to integrate MDAnalysis with WESTPA to exploit MDAnalysis’s ability to analyze streamed trajectory data generated by WESTPA. This integration will reduce I/O bottlenecks and minimize the runtime needed to analyze a WESTPA simulation before intermittent restarting of the short, completed MD simulations.

Detailed Description

MDAnalysis is currently an option for extracting and analyzing simulation data for WESTPA simulations. This project aims to expand those capabilities by integrating streaming directly into WESTPA’s propagator executables and work managers for analysis, reducing the need for users to configure the networking. This will include stress testing MDAnalysis’s streaming capabilities, as analysis might involve using networking configurations such as streaming data from many-nodes-to-many-nodes and many-nodes-to-one-nodes.

Expected Outcomes

Stress-testing MDAnalysis streaming
Additional propagator executables for integrating MDAnalysis streaming within WESTPA

Relevant Skills

Python (multiprocessing)
Networking (TCP/IP)
MD Engines

Related issues/PRs/etc.:

https://github.com/jeremyleung521/westpa/pull/28

Possible Mentors

Expected Size of Project

Large (175 hours)

Difficulty Rating

Medium

Project 2: Dashboard for tracking WESTPA simulation progress

Summary

WESTPA simulations involve running multiple MD trajectories in parallel, which makes it hard to track progress. This project aims to create a graphical user interface that exploits MDAnalysis’s streaming ability and WESTPA’s work managers to monitor the progress of a WESTPA simulation.

Detailed Description

While WESTPA simulations report status at regular intervals, these iterations could last minutes to hours, leaving users unsure of the intermediate progress or time estimate. The task here will involve creating a graphical user interface reporting trajectory progress and completion time estimates through MDAnalysis’s streaming abilities and extracting relevant information from WESTPA’s work managers (ZMQ, python multiprocessing) and data managers.

Expected Outcomes

New CLI tool for WESTPA tracking simulation progress
MDAnalysis module for aggregating/tracking multiple simulations

Relevant Skills

Python (frontend UI, multiprocessing)
Networking (TCP/IP)

Related issues/PRs/etc.:

Not applicable

Possible Mentors

Expected Size of Project

Small (90 hours)

Difficulty Rating

Easy

Project 3: Lazy trajectory analysis with Dask and a Lazy Timeseries API

Summary

This project aims to improve MDAnalysis’s viability in high-performance clusters and high-throughput environments by building out a lazy (rather than eager) reader and timestep interface along with a sample H5MD implementation and basic analysis classes.

Detailed Description

MDAnalysis’s core data structure for holding trajectory data, the timestep, is extremely useful for providing a highly uniform interface for various readers, however, its eager approach to memory management, where trajectory frames are loaded into the object one step at a time, constrains analysis speed compared to the increasingly popular lazily-loaded paradigm used in recent packages like Dask and Polars. In HPC or cloud computing environments where minimizing analysis time is a necessity for making MDAnalysis viable at scale, having a lazy interface for new readers to target and existing ones to adapt to along with a sample implementation with H5MD and basic analysis classes that build on it would provide immediate benefits for HPC MDAnalysis users and a future platform for ensuring MDAnalysis is a tool that can scale with its users’ projects.

MDAnalysis already has a timeseries API for readers which provides a natural starting place for a similar lazy_timeseries interface which would include an additional argument to select between coordinates, velocities, and forces. Existing readers can implement lazy_timeseries by first simply passing the numpy.ndarray result of calling timeseries into a Dask array, but certain readers like the H5MDReader or ZarrH5MDReader can receive a proper lazy implementation.

Expected Outcomes

MDAKit with lazy reader and timeseries interface code
A working H5MD implementation of the interface
A lazy timeseries analysis base
Implementation of at least one basic analysis algorithm using the interface (like RMSD)

Relevant Skills

Experience with dask or lazy computation paradigm
Knowledge of object-oriented programming
Experience writing analysis code classes/scripts
Any experience with a numpy-like-interface is useful

Related issues/PRs/etc.:

Possible Mentors

Expected Size of Project

Medium (175 hours)

Difficulty Rating

Medium

Project 4: Better interfacing of Blender and MDAnalysis

Summary

Improvements to how Blender and Molecular Nodes interface with MDAnalysis which powers the import and animation of MD trajectories inside of Blender. Simple import is currently available when using the GUI in Blender, but there is still a lot of potential for improvements in scriptability, automated rendering, and using Blender as an analysis tool for MD trajectories.

Detailed Description

Blender is industry-leading 3D modeling and animation software. Through the add-on Molecular Nodes, MDAnalysis universes are able to be imported into the 3D scene, enabling advanced rendering of molecular dynamics trajectories that is not possible inside of any other molecule viewer. The ability to script and automate this rendering is possible but limited with lots of room for improvement for visualizing many common MD datasets. Blender also provides a great platform for implementing a potential GUI, to enable interactive analysis of MD trajectories with stunning visuals, all powered by MDAnalysis under the hood.

Expected Outcomes

Prototype improved API for scripting and working with Molecular Nodes from Jupyter Notebooks or other similar environments
Prototyping common analysis and visualization tasks that could be performed from within Blender via the GUI

Relevant Skills

Proficiency with Python
Working knowledge of MDAnalysis
Familiarity with Blender and programming via its Python API

Related issues/PRs/etc.:

Possible Mentors

Expected Size of Project

Large (350 hours)

Difficulty Rating

Medium

Project 5: HBond interactions from implicit hydrogens

Summary

This project makes interaction fingerprints analysis with ProLIF (an MDAKit) more accessible and faster to run from PDB files for machine-learning (ML) practitioner.

Detailed Description

Interaction fingerprints (IFPs) are a common strategy to filter docking poses that aren't able to recapitulate known interactions in molecular complexes, but their use require explicit hydrogens to model hydrogen bonds. While ML-based docking and cofolding tools have seen increased usage other the recent years, the files that these methods generate only contain heavy atoms. While it's possible to add hydrogens to a complex and optimize its hydrogen-bond network, this significantly slows down and ultimately hampers the use of IFPs to evaluate the quality of ML-based molecular complexes. By adding to ProLIF the ability to evaluate hydrogen bond interactions solely based on heavy atom positions (and with the assumptions that hydrogens are positioned ideally for such interactions), it would be possible to directly compare co-crystallized complexes (e.g., from the PDB) with ML-based poses without requiring the intermediate use of protonation tools (PDB2PQR, reduce...etc.).

Expected Outcomes

New set of hydrogen-bond interaction classes available using implicit hydrogens
Helper function to load PDBs with heavy atoms only and common non-standard residues (HSD, HSE, HID…etc.) appropriately

Relevant Skills

Python
RDKit
SMARTS
Computational Chemistry

Related issues/PRs/etc.:

Not applicable

Possible Mentors

Expected Size of Project

Medium/Large (175/350 hours)

Difficulty Rating

Medium

Project 6: Continuous (i.e., non-binary) interaction fingerprints (IFPs)

Summary

IFPs use cutoff values for defining the different expected distances and angles that interactions must follow, leading to cases slightly outside of these thresholds to not be detected at all. This project aims to bring an alternative (continuous) encoding for interactions in ProLIF (an MDAKit) so that such cases may still be accounted for in subsequent analysis.

Detailed Description

IFPs are inherently binary and based on distance and angles thresholds that can be hard to define robustly across the wide diversity of use cases. The idea here would be to define "ideal" and "suboptimal" thresholds for distances and angles, and for interacting atoms that satisfy the ideal thresholds encode them as 1 (as would be with a "traditional" IFPs), for those that don't satisfy the suboptimal thresholds encode them as 0, and for anything in between encode them through a sigmoid function that outputs a real value between 0 and 1. This would allow to deal more gracefully with cases that follow non-ideal geometries for interacting atoms. Note that the sigmoid function that transforms the input distances and angles into a real value will have to be determined during the project.

Expected Outcomes

The interaction classes have the ability to specify multiple thresholds for distances and angles, and additional metadata is returned when these classes are used to detect interactions
The conversion of the resulting IFPs to a pandas DataFrame outputs continuous values between 0 and 1 instead of being binary
From a user perspective, they should only have to toggle on a new parameter in the Fingerprint object initialization to enable this analysis

Relevant Skills

Python
RDKit
Computational Chemistry

Related issues/PRs/etc.:

Not applicable

Possible Mentors

Expected Size of Project

Large (350 hours)

Difficulty Rating

Hard

Project 7: Improving ProLIF's 2D interaction visualizations

Summary

This project involves improvements to the current “LigNetwork” plot produced by ProLIF to make it easier to use and more publication-ready, and leaves some room to add other 2D visualizations that can help summarize the information contained in interaction fingerprints.

Detailed Description

When creating a "LigNetwork" plot with ProLIF, the protein residues are placed randomly and left for the user to drag and drop to make the figure readable. It would be beneficial to automate the placement of residues on the plot in a way that minimizes the overlap between residues and the ligand, and minimizes crossing between interaction edges.

Depending on the size of the project, additional visualizations could be added, such as: a "heatmap" of ligand atoms involved in different interactions by highlighting such atoms with a bivariate Gaussian distribution based on an atomic interaction-aware weighting (see doi.org/10.1186/1758-2946-5-43 for reference). Another one could be to generate a figure similar to the LigNetwork with InteractionDrawer, converting the ProLIF metadata to follow their JSON schema. For protein-protein systems and residue network analysis, Flareplot would be very beneficial.

Expected Outcomes

The placement of residues on the LigNetwork plot is made to be readable out of the box
If time allows, additional kinds of plots are added to further improve ProLIF’s visualization capabilities

Relevant Skills

Python
JavaScript

Related issues/PRs/etc.:

Not applicable

Possible Mentors

Expected Size of Project

Small/Medium (90/175 hours)

Difficulty Rating

Medium

Project 8: Benchmarking and performance optimization

Summary

The goal of this project is to increase the performance assessment coverage (using the existing ASV framework), identify code that should be improved, and optimize code.

Detailed Description

The MDAnalysis Roadmap emphasizes performance improvement. The performance of the MDAnalysis library is assessed by automated nightly benchmarks with ASV (see https://github.com/MDAnalysis/benchmarks/wiki) but coverage of the code base is low. The goal of this project is to substantially increase the performance assessment coverage, identify code that should be improved, and possibly implement performance optimizations.

Expected Outcomes

Write ASV benchmark cases for all major functionality in the core library
Write ASV benchmark cases for often-used analysis tools
Analyze performance history and generate a priority list of code that should be improved
Document writing benchmarks with a short tutorial
Optional: Optimize performance for at least one discovered performance bottleneck

Relevant Skills

Python/ASV
Cython

Related issues/PRs/etc.:

Possible Mentors

Expected Size of Project

Medium/Large (175/350 hours)

Difficulty Rating

Easy/Medium

Project 9: Integrating OpenFolds’ structural prediction confidence metrics into the topology system

Summary

The goal of this project is to expose per-molecule and inter-molecular structural prediction confidence metrics to users via the MDAnalysis topology system.

Detailed Description

Structural prediction tools like OpenFold are increasingly important for in-silico estimations of protein structure and of binding probability between molecules. However, working with the raw output of prediction tools is challenging and often requires bespoke tools made by researchers prone to inefficiency and errors. MDAnalysis’s topology system provides a robust, natural interface for working with per-residue (like pLDDT) and between-residue/chain metrics (like a contact probabilities matrix). This project seeks to build the foundation for working with structural prediction confidence metrics in MDAnalysis.

After this project is complete, users will be able to access confidence metrics via Numpy arrays associated with (or between) each AtomGroup in a way that is consistent with current MDA atom selection

Expected Outcomes

Modifications of the PDBParser to associate confidence metrics with and between each AtomGroup

Relevant Skills

OpenFold experience or structural prediction tools more generally
Python
Experience solving parsing problems

Related issues/PRs/etc.:

https://github.com/MDAnalysis/mdanalysis/issues/4134

Possible Mentors

Expected Size of Project

Small (90 hours)

Difficulty Rating

Easy/Medium