Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for DNA datasets #209

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

cristianregep
Copy link
Collaborator

@cristianregep cristianregep commented Sep 27, 2024

No description provided.

@cristianregep cristianregep changed the title initial commit for DNA datasets Initial commit for DNA datasets Sep 27, 2024
@cristianregep cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from e447987 to 701a343 Compare September 28, 2024 00:02
@jstjohn
Copy link
Collaborator

jstjohn commented Sep 28, 2024

/build-ci

@cristianregep cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from c443454 to 271a3fc Compare September 28, 2024 00:28
@jstjohn
Copy link
Collaborator

jstjohn commented Oct 2, 2024

/build-ci

@cristianregep cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from 271a3fc to 0aa8f6e Compare October 3, 2024 21:10
@cristianregep cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from 0aa8f6e to cea867c Compare October 3, 2024 21:15
Signed-off-by: cristianregep <[email protected]>
@cristianregep cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from cea867c to e663ca7 Compare October 3, 2024 21:44
@jstjohn
Copy link
Collaborator

jstjohn commented Oct 3, 2024

/build-ci

@jstjohn
Copy link
Collaborator

jstjohn commented Oct 4, 2024

/build-ci

Signed-off-by: cristianregep <[email protected]>
@malcolmgreaves
Copy link
Collaborator

Hey @cristianregep, in trying to get this MR to pass through CI more quickly, I took the liberty of committing to your branch again. I renamed the test modules to prevent the name collision bug. I also renamed some things in ESM2 to (hopefully) prevent this same issue in the future too. 🤞 Thank you for your contribution & your patience! 🙏

@malcolmgreaves
Copy link
Collaborator

/build-ci

@malcolmgreaves
Copy link
Collaborator

Hey @cristianregep that was the only pytest error -- CI is passing now! 🎉

@malcolmgreaves
Copy link
Collaborator

/build-ci

@cristianregep
Copy link
Collaborator Author

Hey @cristianregep that was the only pytest error -- CI is passing now! 🎉

Thanks a lot. I'm going to sort out some of the other comments that came in the meantime and update to the latest changes in main and resubmit

@cristianregep
Copy link
Collaborator Author

@malcolmgreaves did something change with the permissions for the repo? I can't seem to be able to push anymore. I've updated to the new git URL

@jstjohn
Copy link
Collaborator

jstjohn commented Oct 18, 2024

@malcolmgreaves did something change with the permissions for the repo? I can't seem to be able to push anymore. I've updated to the new git URL

Yes we have a new policy that external to nvidia users can't have write access to private repos. I re-enabled you for now though so we can get this in. In the future you can fork and open a PR from the fork. Also we're about a week away from open sourcing the whole thing and going public.

Signed-off-by: cristianregep <[email protected]>
@skothenhill-nv
Copy link
Collaborator

We should find an alternative for PyFaidx. I dont think there is a straightforward solution to integrating it in a way that works well with pytorch DataLoaders.

Reading in the pyfaidx inside the workers I think resolves the issue with pyfaidx (although i'm not 100% sure). But it has two problems:

  1. race condition is introduced if the .fai file does not already exist, each worker will try to create this at once.
  2. Creation of the pyfaidx.Fasta object is quite slow (order of seconds, tens of seconds for gigabase scale fastas).

Fortunately, in the last year the pyfaidx authors actually commented on the issue~ it seems fundamental to the c implementation of faidx- as htslib is similarly not process safe.

mdshw5/pyfaidx#211

We should look for an alternative. I have to suggestions:

Check what we did in bionemo1, we have a branch on github with an example. In this case, we used NeMo's CSVMemMapDataset with '>' as the key delimeter. To make this work, we had to drop the newlines from sequences in the underlying fasta file. Not a huge fan of this solution in general, but it has a very light memory footprint.

https://github.com/NVIDIA/bionemo-framework/blob/bionemo1/bionemo/data/fasta_dataset.py

Alternatively, we could implement a wrapper on a different fasta index implementation. I've been looking at the implementation here for sometime, unclear if its process safe, but its relatively straightforward to wrap rust objects as PyObjects (PyO3 does this, takes very little rust knowledge to implement).

https://github.com/zaeleus/noodles

Let me know if we can be of help with this. A safe fasta reader has been on our radar for a while.

@skothenhill-nv
Copy link
Collaborator

Alternatively, we could implement a wrapper on a different fasta index implementation. I've been looking at the implementation here for sometime, unclear if its process safe, but its relatively straightforward to wrap rust objects as PyObjects (PyO3 does this, takes very little rust knowledge to implement).

https://github.com/zaeleus/noodles

I went ahead and got started on this, the noodles-faidx reader is thread safe. Need to do a little more profiling but I should be able to get you an update soon @cristianregep

@cristianregep
Copy link
Collaborator Author

Alternatively, we could implement a wrapper on a different fasta index implementation. I've been looking at the implementation here for sometime, unclear if its process safe, but its relatively straightforward to wrap rust objects as PyObjects (PyO3 does this, takes very little rust knowledge to implement).
https://github.com/zaeleus/noodles

I went ahead and got started on this, the noodles-faidx reader is thread safe. Need to do a little more profiling but I should be able to get you an update soon @cristianregep

Oh wow, this is great. Let me know if I can help, really happy to do so

@skothenhill-nv
Copy link
Collaborator

@cristianregep feel free to work on it with us! effort is happening here (and in the target branch)

@skothenhill-nv
Copy link
Collaborator

@cristianregep hey Cristian, we just merged a python interface for safe fasta index backed by mmaps. It should also help with memory usage.

Checkout sub-packages/bionemo-noodles, the tests have some examples of using it with a dataset class. The interface is quite similar to outside, but if there is missing functionality feel free to add it to this PR.

If you have any issues just let me know!

@cristianregep
Copy link
Collaborator Author

@cristianregep hey Cristian, we just merged a python interface for safe fasta index backed by mmaps. It should also help with memory usage.

Checkout sub-packages/bionemo-noodles, the tests have some examples of using it with a dataset class. The interface is quite similar to outside, but if there is missing functionality feel free to add it to this PR.

If you have any issues just let me know!

Thanks, sorry to have been MIA, it's been intense delivery on the Relation side. This looks great

@cristianregep cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from 9ed84bd to b618829 Compare December 10, 2024 22:35
@cristianregep
Copy link
Collaborator Author

@skothenhill-nv @ntadimeti I've updated my branch to use NvFaidx. Thing this is good to go

Copy link
Collaborator

@skothenhill-nv skothenhill-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments almost entirely around docstrings and providing more visibility to the assumptions and limitations of these dataloaders. I love the work, and let me know if you need or want help with any of these things. Cheers.

class Genome:
"""Class that creates a genome object from a fasta file. It can be used to extract sequences from the genome."""

def __init__(self, fasta_file: str | os.PathLike):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, fasta_file: str | os.PathLike):
def __init__(self, fasta_file: str | Path, faidx_file: Optional[str | Path]):

I think we prefer pathlib for this kind of thing- also passing in a faidx file would be nice if users want a little initialization time boost. Just needs to be passed into NvFaidx.

"""Instantiate the class."""
fasta_file = Path(fasta_file)
assert fasta_file.exists(), "path to fasta file must exist"
self.seqs = NvFaidx(str(fasta_file))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.seqs = NvFaidx(str(fasta_file))
self.seqs = NvFaidx(str(fasta_file), faidx_path=faidx_path)

Comment on lines +83 to +97
class GenomeDataset(Dataset, ABC):
"""Abstract class for datasets that return a DNA sequence."""

def __init__(self, tokenizer: DNATokenizer):
"""Instantiate the class."""
self.tokenizer = tokenizer

def __getitem__(self, ind: int) -> torch.tensor:
"""Get an item from the dataset."""
sequence = self.get_dna_sequence(ind)
return self.tokenizer(sequence)

def get_dna_sequence(self, ind: int) -> str:
"""Get the DNA sequence."""
...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand the docstrings here? no specific feedback other than adding arguments, but here's my perspective as a reader:

  • the constructor implies this is doing a transformation with the tokenizer, like a MappedDataset
  • it looks like ind is meant to provide a deterministic mapping from dataset 'index' to some DNA sequence.
  • I can think of how I might want to use this, but maybe providing a few concise examples in text would help future implementors (or just point them at your implementations below).

Comment on lines +55 to +62
def adjust_interval(self, interval: GenomeInterval, context_length: int) -> GenomeInterval:
"""Used to extend an interval by a fixed amount on both sides."""
start, end = interval.start, interval.end
interval_length = interval.end - interval.start
chromosome_length = self.seqs[interval.chromosome].length
if interval_length >= context_length:
return interval

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring really needs help. I arrived here from the method below:

its clear context length expands the interval, and is truncated if it reaches the max. What are the bounds of sequence length that this can return? it seems like it could return any length region, which is okay, but would be best to be clear about that in the docstring.

Comment on lines +103 to +119
def __init__(
self,
genome: Genome,
bed_file_df: pl.DataFrame,
context_length: int | None = None,
chr_bed_to_fasta_map: dict = {},
**kwargs: Any,
):
"""Instantiate the class.

Args:
genome: Genome object
bed_file_df: A bed file in the form of a polars DataFrame with chrom, start, and end columns at least
context_length: Size of the window
chr_bed_to_fasta_map: Mapping between chromosome names in the bed file and the fasta file
**kwargs: Additional arguments
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused here. Context-length is used to set the window size, but the sequence returned seems to be a function of both the interval size- which could vary within the bedfile, the context length, (and potentially the end of the contig). Is this intentional? Or maybe more specifically, are there restrictions on the contents of bed_file_df that youre expecting?

If I were to use this dataset out of the box, depending on the model, I think I would need to transform the outputs to have a fixed size, with padding or truncation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually from the previous BioNeMo, but i think I know where it comes from because we also have this issue in-house. Back in the day when language modelling wasn't a thing in DNA there weren't tokenizers and tokens (e.g. Enformer). you one hot encoded the sequence. for a batch you needed to have windows of the same size, and you didn't have PAD tokens. It was more biologically relevant to include more of the actual DNA sequence than add Ns (if the context_length was longer than the intervals specified in your file), so you had to deal with this at the Dataset level, rather than at the point of collation.. The name is not vey informative, I can change to force_window_size but let's align on a strategy here.

Comment on lines +29 to +38
"""Get the maximum allele index from a tuple of allele indices.

`sample.allele_indices` is a tuple of two indices that associate the sample with alleles
listed in the list 'variant.alleles', where the zeroth entry is the REF allele. A sample
could have one or both entries being 0, indicating that one or both strands are REF at
the specified position. By taking max(sample.allele_indices) you ensure to always extract
an alternative allele if present or REF otherwise.

Sometimes, None is contained in sample.allele_indices. This is treated as if it refers
to the REF allele (ie 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice docstring, super helpful!



def _match_contig_name(chrm: str, available_contigs: set[int | str]) -> str:
"""Let's try either the standardized version that i expect to be "chr22" or only the number string "22"."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion: we should just fail in this case. It's too easy to end up with a reference that has the wrong amount of Ns or something else stupid.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to do that but FYI internally we did this because hg19 and hg38 VCF files used different conventions https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19 . We wanted it to work with both. Theoretically you could say that the user needs to take care of this, but it becomes less user friendly if you get into situations where windows on top of TSSs come from Gencode which uses a specific standard.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me, per our call- add a warning if the seqids dont match between the vcf and reference.

Comment on lines +81 to +83
# TODO: sometimes the variant overlaps the boundary of a window. The fetch function is zero base
# so the interval works fine the problem is that sometimes a variant can indeed
# overlap the region even if its start is outside.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( perhaps we should make this a param?

allow_partial_overlap

I'd set it to False by default. Also you should move this comment to the docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Comment on lines +179 to +181
def get_dna_sequence(self, ind: int) -> str:
"""Get an item from the dataset."""
interval_index = ind // len(self.sample_ids)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great place to surface and document all the edge-cases you handle when mutating the reference sequence.

Comment on lines +23 to +26
def test_mutate():
sequence = "AAAAAAAAAAAAAAAAAAAA"
interval = GenomeInterval("chr1", 1, 10)
variants_df = pd.DataFrame([{"POS": 5, "REF": "A", "ALT": "G"}, {"POS": 2, "REF": "AA", "ALT": "TC"}])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add tests for the _mutate function directly?

I think we should also cover:

  • variants that overlap region boundaries
  • variants that are not allowed (indels!)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with regards to _mutate that was the case previously but was asked to change it #209 (comment) . I can add the corner cases at any level of abstraction, but would be useful to have a general guideline on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i missed that. feel free to follow peter's advice, or to add the test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants