Initial commit for DNA datasets #209

cristianregep · 2024-09-27T23:59:14Z

No description provided.

jstjohn · 2024-09-28T00:03:48Z

/build-ci

jstjohn · 2024-10-02T21:59:57Z

/build-ci

Signed-off-by: cristianregep <[email protected]>

jstjohn · 2024-10-03T22:58:17Z

/build-ci

Signed-off-by: cristianregep <[email protected]>

jstjohn · 2024-10-04T13:55:26Z

/build-ci

sub-packages/bionemo-dnadl/LICENSE

Signed-off-by: cristianregep <[email protected]>

malcolmgreaves · 2024-10-08T16:33:24Z

Hey @cristianregep, in trying to get this MR to pass through CI more quickly, I took the liberty of committing to your branch again. I renamed the test modules to prevent the name collision bug. I also renamed some things in ESM2 to (hopefully) prevent this same issue in the future too. 🤞 Thank you for your contribution & your patience! 🙏

malcolmgreaves · 2024-10-08T16:33:28Z

/build-ci

malcolmgreaves · 2024-10-08T23:02:01Z

Hey @cristianregep that was the only pytest error -- CI is passing now! 🎉

malcolmgreaves · 2024-10-08T23:02:15Z

/build-ci

sub-packages/bionemo-dnadl/tests/bionemo/dnadl/io/test_dnadl_dataset.py

sub-packages/bionemo-dnadl/src/bionemo/dnadl/io/dataset.py

cristianregep · 2024-10-11T20:15:20Z

Hey @cristianregep that was the only pytest error -- CI is passing now! 🎉

Thanks a lot. I'm going to sort out some of the other comments that came in the meantime and update to the latest changes in main and resubmit

cristianregep · 2024-10-11T21:16:47Z

@malcolmgreaves did something change with the permissions for the repo? I can't seem to be able to push anymore. I've updated to the new git URL

jstjohn · 2024-10-18T20:08:22Z

@malcolmgreaves did something change with the permissions for the repo? I can't seem to be able to push anymore. I've updated to the new git URL

Yes we have a new policy that external to nvidia users can't have write access to private repos. I re-enabled you for now though so we can get this in. In the future you can fork and open a PR from the fork. Also we're about a week away from open sourcing the whole thing and going public.

Signed-off-by: cristianregep <[email protected]>

sub-packages/bionemo-dnadl/src/bionemo/dnadl/io/dataset.py

skothenhill-nv · 2024-11-13T00:02:44Z

We should find an alternative for PyFaidx. I dont think there is a straightforward solution to integrating it in a way that works well with pytorch DataLoaders.

Reading in the pyfaidx inside the workers I think resolves the issue with pyfaidx (although i'm not 100% sure). But it has two problems:

race condition is introduced if the .fai file does not already exist, each worker will try to create this at once.
Creation of the pyfaidx.Fasta object is quite slow (order of seconds, tens of seconds for gigabase scale fastas).

Fortunately, in the last year the pyfaidx authors actually commented on the issue~ it seems fundamental to the c implementation of faidx- as htslib is similarly not process safe.

mdshw5/pyfaidx#211

We should look for an alternative. I have to suggestions:

Check what we did in bionemo1, we have a branch on github with an example. In this case, we used NeMo's CSVMemMapDataset with '>' as the key delimeter. To make this work, we had to drop the newlines from sequences in the underlying fasta file. Not a huge fan of this solution in general, but it has a very light memory footprint.

https://github.com/NVIDIA/bionemo-framework/blob/bionemo1/bionemo/data/fasta_dataset.py

Alternatively, we could implement a wrapper on a different fasta index implementation. I've been looking at the implementation here for sometime, unclear if its process safe, but its relatively straightforward to wrap rust objects as PyObjects (PyO3 does this, takes very little rust knowledge to implement).

https://github.com/zaeleus/noodles

Let me know if we can be of help with this. A safe fasta reader has been on our radar for a while.

skothenhill-nv · 2024-11-14T01:24:39Z

Alternatively, we could implement a wrapper on a different fasta index implementation. I've been looking at the implementation here for sometime, unclear if its process safe, but its relatively straightforward to wrap rust objects as PyObjects (PyO3 does this, takes very little rust knowledge to implement).

https://github.com/zaeleus/noodles

I went ahead and got started on this, the noodles-faidx reader is thread safe. Need to do a little more profiling but I should be able to get you an update soon @cristianregep

cristianregep · 2024-11-14T13:59:40Z

Alternatively, we could implement a wrapper on a different fasta index implementation. I've been looking at the implementation here for sometime, unclear if its process safe, but its relatively straightforward to wrap rust objects as PyObjects (PyO3 does this, takes very little rust knowledge to implement).
https://github.com/zaeleus/noodles

I went ahead and got started on this, the noodles-faidx reader is thread safe. Need to do a little more profiling but I should be able to get you an update soon @cristianregep

Oh wow, this is great. Let me know if I can help, really happy to do so

skothenhill-nv · 2024-11-15T00:03:34Z

@cristianregep feel free to work on it with us! effort is happening here (and in the target branch)

skothenhill-nv · 2024-11-28T18:05:31Z

@cristianregep hey Cristian, we just merged a python interface for safe fasta index backed by mmaps. It should also help with memory usage.

Checkout sub-packages/bionemo-noodles, the tests have some examples of using it with a dataset class. The interface is quite similar to outside, but if there is missing functionality feel free to add it to this PR.

If you have any issues just let me know!

cristianregep · 2024-12-09T21:46:35Z

@cristianregep hey Cristian, we just merged a python interface for safe fasta index backed by mmaps. It should also help with memory usage.

Checkout sub-packages/bionemo-noodles, the tests have some examples of using it with a dataset class. The interface is quite similar to outside, but if there is missing functionality feel free to add it to this PR.

If you have any issues just let me know!

Thanks, sorry to have been MIA, it's been intense delivery on the Relation side. This looks great

Signed-off-by: cristianregep <[email protected]>

cristianregep · 2024-12-10T22:38:57Z

@skothenhill-nv @ntadimeti I've updated my branch to use NvFaidx. Thing this is good to go

skothenhill-nv

Left comments almost entirely around docstrings and providing more visibility to the assumptions and limitations of these dataloaders. I love the work, and let me know if you need or want help with any of these things. Cheers.

skothenhill-nv · 2024-12-13T22:55:52Z

sub-packages/bionemo-dnadl/src/bionemo/dnadl/io/dataset.py

+class Genome:
+    """Class that creates a genome object from a fasta file. It can be used to extract sequences from the genome."""
+
+    def __init__(self, fasta_file: str | os.PathLike):


Suggested change

def __init__(self, fasta_file: str | os.PathLike):

def __init__(self, fasta_file: str | Path, faidx_file: Optional[str | Path]):

I think we prefer pathlib for this kind of thing- also passing in a faidx file would be nice if users want a little initialization time boost. Just needs to be passed into NvFaidx.

skothenhill-nv · 2024-12-13T22:56:49Z

sub-packages/bionemo-dnadl/src/bionemo/dnadl/io/dataset.py

+        """Instantiate the class."""
+        fasta_file = Path(fasta_file)
+        assert fasta_file.exists(), "path to fasta file must exist"
+        self.seqs = NvFaidx(str(fasta_file))


Suggested change

self.seqs = NvFaidx(str(fasta_file))

self.seqs = NvFaidx(str(fasta_file), faidx_path=faidx_path)

skothenhill-nv · 2024-12-13T23:00:57Z

sub-packages/bionemo-dnadl/src/bionemo/dnadl/io/dataset.py

+class GenomeDataset(Dataset, ABC):
+    """Abstract class for datasets that return a DNA sequence."""
+
+    def __init__(self, tokenizer: DNATokenizer):
+        """Instantiate the class."""
+        self.tokenizer = tokenizer
+
+    def __getitem__(self, ind: int) -> torch.tensor:
+        """Get an item from the dataset."""
+        sequence = self.get_dna_sequence(ind)
+        return self.tokenizer(sequence)
+
+    def get_dna_sequence(self, ind: int) -> str:
+        """Get the DNA sequence."""
+        ...


Could you expand the docstrings here? no specific feedback other than adding arguments, but here's my perspective as a reader:

the constructor implies this is doing a transformation with the tokenizer, like a MappedDataset

it looks like ind is meant to provide a deterministic mapping from dataset 'index' to some DNA sequence.

I can think of how I might want to use this, but maybe providing a few concise examples in text would help future implementors (or just point them at your implementations below).

skothenhill-nv · 2024-12-13T23:05:45Z

sub-packages/bionemo-dnadl/src/bionemo/dnadl/io/dataset.py

+    def adjust_interval(self, interval: GenomeInterval, context_length: int) -> GenomeInterval:
+        """Used to extend an interval by a fixed amount on both sides."""
+        start, end = interval.start, interval.end
+        interval_length = interval.end - interval.start
+        chromosome_length = self.seqs[interval.chromosome].length
+        if interval_length >= context_length:
+            return interval
+


This docstring really needs help. I arrived here from the method below:

its clear context length expands the interval, and is truncated if it reaches the max. What are the bounds of sequence length that this can return? it seems like it could return any length region, which is okay, but would be best to be clear about that in the docstring.

skothenhill-nv · 2024-12-13T23:08:53Z

sub-packages/bionemo-dnadl/src/bionemo/dnadl/io/dataset.py

+    def __init__(
+        self,
+        genome: Genome,
+        bed_file_df: pl.DataFrame,
+        context_length: int | None = None,
+        chr_bed_to_fasta_map: dict = {},
+        **kwargs: Any,
+    ):
+        """Instantiate the class.
+
+        Args:
+            genome: Genome object
+            bed_file_df: A bed file in the form of a polars DataFrame with chrom, start, and end columns at least
+            context_length: Size of the window
+            chr_bed_to_fasta_map: Mapping between chromosome names in the bed file and the fasta file
+            **kwargs: Additional arguments
+        """


I'm confused here. Context-length is used to set the window size, but the sequence returned seems to be a function of both the interval size- which could vary within the bedfile, the context length, (and potentially the end of the contig). Is this intentional? Or maybe more specifically, are there restrictions on the contents of bed_file_df that youre expecting?

If I were to use this dataset out of the box, depending on the model, I think I would need to transform the outputs to have a fixed size, with padding or truncation.

This is actually from the previous BioNeMo, but i think I know where it comes from because we also have this issue in-house. Back in the day when language modelling wasn't a thing in DNA there weren't tokenizers and tokens (e.g. Enformer). you one hot encoded the sequence. for a batch you needed to have windows of the same size, and you didn't have PAD tokens. It was more biologically relevant to include more of the actual DNA sequence than add Ns (if the context_length was longer than the intervals specified in your file), so you had to deal with this at the Dataset level, rather than at the point of collation.. The name is not vey informative, I can change to force_window_size but let's align on a strategy here.

skothenhill-nv · 2024-12-13T23:17:28Z

sub-packages/bionemo-dnadl/src/bionemo/dnadl/tools/vcf.py

+    """Get the maximum allele index from a tuple of allele indices.
+
+    `sample.allele_indices` is a tuple of two indices that associate the sample with alleles
+    listed in the list 'variant.alleles', where the zeroth entry is the REF allele. A sample
+    could have one or both entries being 0, indicating that one or both strands are REF at
+    the specified position. By taking max(sample.allele_indices) you ensure to always extract
+    an alternative allele if present or REF otherwise.
+
+    Sometimes, None is contained in sample.allele_indices. This is treated as if it refers
+    to the REF allele (ie 0)


nice docstring, super helpful!

skothenhill-nv · 2024-12-13T23:18:16Z

sub-packages/bionemo-dnadl/src/bionemo/dnadl/tools/vcf.py

+
+
+def _match_contig_name(chrm: str, available_contigs: set[int | str]) -> str:
+    """Let's try either the standardized version that i expect to be "chr22" or only the number string "22"."""


My opinion: we should just fail in this case. It's too easy to end up with a reference that has the wrong amount of Ns or something else stupid.

Happy to do that but FYI internally we did this because hg19 and hg38 VCF files used different conventions https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19 . We wanted it to work with both. Theoretically you could say that the user needs to take care of this, but it becomes less user friendly if you get into situations where windows on top of TSSs come from Gencode which uses a specific standard.

fine with me, per our call- add a warning if the seqids dont match between the vcf and reference.

skothenhill-nv · 2024-12-13T23:19:42Z

sub-packages/bionemo-dnadl/src/bionemo/dnadl/tools/vcf.py

+                # TODO: sometimes the variant overlaps the boundary of a window. The fetch function is zero base
+                # so the interval works fine the problem is that sometimes a variant can indeed
+                # overlap the region even if its start is outside.


:( perhaps we should make this a param?

allow_partial_overlap

I'd set it to False by default. Also you should move this comment to the docstring.

skothenhill-nv · 2024-12-13T23:20:33Z

sub-packages/bionemo-dnadl/src/bionemo/dnadl/io/dataset.py

+    def get_dna_sequence(self, ind: int) -> str:
+        """Get an item from the dataset."""
+        interval_index = ind // len(self.sample_ids)


this is a great place to surface and document all the edge-cases you handle when mutating the reference sequence.

skothenhill-nv · 2024-12-13T23:22:58Z

sub-packages/bionemo-dnadl/tests/bionemo/dnadl/tools/test_genome_editing.py

+def test_mutate():
+    sequence = "AAAAAAAAAAAAAAAAAAAA"
+    interval = GenomeInterval("chr1", 1, 10)
+    variants_df = pd.DataFrame([{"POS": 5, "REF": "A", "ALT": "G"}, {"POS": 2, "REF": "AA", "ALT": "TC"}])


can we add tests for the _mutate function directly?

I think we should also cover:

variants that overlap region boundaries

variants that are not allowed (indels!)

with regards to _mutate that was the case previously but was asked to change it #209 (comment) . I can add the corner cases at any level of abstraction, but would be useful to have a general guideline on this.

ah, i missed that. feel free to follow peter's advice, or to add the test

cristianregep requested review from jomitchellnv, jstjohn, malcolmgreaves, skothenhill-nv, ohadmo, pstjohn, trvachov, dorotat-nv and ntadimeti as code owners September 27, 2024 23:59

cristianregep changed the title ~~initial commit for DNA datasets~~ Initial commit for DNA datasets Sep 27, 2024

cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from e447987 to 701a343 Compare September 28, 2024 00:02

cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from c443454 to 271a3fc Compare September 28, 2024 00:28

cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from 271a3fc to 0aa8f6e Compare October 3, 2024 21:10

cristianregep requested review from DejunL, farhadrgh, guoqing-zhou and gwarmstrong as code owners October 3, 2024 21:10

cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from 0aa8f6e to cea867c Compare October 3, 2024 21:15

initial commit for DNA datasets

e663ca7

Signed-off-by: cristianregep <[email protected]>

cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from cea867c to e663ca7 Compare October 3, 2024 21:44

jstjohn approved these changes Oct 3, 2024

View reviewed changes

jstjohn added the feature_for_v24.10 label Oct 4, 2024

cristianregep added 2 commits October 4, 2024 13:21

update toml and secrets

e2e52eb

Signed-off-by: cristianregep <[email protected]>

Merge branch 'main' into cristianregep/dna_models_vcf_dataloader

78bee97

pstjohn reviewed Oct 4, 2024

View reviewed changes

sub-packages/bionemo-dnadl/LICENSE Outdated Show resolved Hide resolved

update license

fd9d56a

Signed-off-by: cristianregep <[email protected]>

Merge branch 'main' into cristianregep/dna_models_vcf_dataloader

41a6b41

malcolmgreaves assigned cristianregep Oct 8, 2024

malcolmgreaves reviewed Oct 9, 2024

View reviewed changes

sub-packages/bionemo-dnadl/tests/bionemo/dnadl/io/test_dnadl_dataset.py Outdated Show resolved Hide resolved

malcolmgreaves added NOT_related_to_v24.10 and removed feature_for_v24.10 labels Oct 9, 2024

skothenhill-nv reviewed Oct 10, 2024

View reviewed changes

sub-packages/bionemo-dnadl/src/bionemo/dnadl/io/dataset.py Outdated Show resolved Hide resolved

merge with main

0f7dcc1

Signed-off-by: cristianregep <[email protected]>

skothenhill-nv reviewed Nov 13, 2024

View reviewed changes

sub-packages/bionemo-dnadl/src/bionemo/dnadl/io/dataset.py Outdated Show resolved Hide resolved

malcolmgreaves removed request for gwarmstrong and jomitchellnv November 14, 2024 21:18

jstjohn mentioned this pull request Nov 15, 2024

[WIP] hillst/noodles bionemo #437

Merged

cristianregep added 2 commits December 10, 2024 18:20

Merge branch 'main' into cristianregep/dna_models_vcf_dataloader

c48aac7

update to use NvFaidx from noodles

b618829

Signed-off-by: cristianregep <[email protected]>

cristianregep force-pushed the cristianregep/dna_models_vcf_dataloader branch from 9ed84bd to b618829 Compare December 10, 2024 22:35

skothenhill-nv requested changes Dec 13, 2024

View reviewed changes

yzhang123 removed the NOT_related_to_v24.10 label Jan 28, 2025

	def __init__(self, fasta_file: str \| os.PathLike):
	def __init__(self, fasta_file: str \| Path, faidx_file: Optional[str \| Path]):

	self.seqs = NvFaidx(str(fasta_file))
	self.seqs = NvFaidx(str(fasta_file), faidx_path=faidx_path)



		def _match_contig_name(chrm: str, available_contigs: set[int \| str]) -> str:
		"""Let's try either the standardized version that i expect to be "chr22" or only the number string "22"."""

Initial commit for DNA datasets #209

Are you sure you want to change the base?

Initial commit for DNA datasets #209

Conversation

cristianregep commented Sep 27, 2024 • edited by malcolmgreaves Loading

jstjohn commented Sep 28, 2024

jstjohn commented Oct 2, 2024

jstjohn commented Oct 3, 2024

jstjohn commented Oct 4, 2024

malcolmgreaves commented Oct 8, 2024

malcolmgreaves commented Oct 8, 2024

malcolmgreaves commented Oct 8, 2024

malcolmgreaves commented Oct 8, 2024

cristianregep commented Oct 11, 2024

cristianregep commented Oct 11, 2024

jstjohn commented Oct 18, 2024

skothenhill-nv commented Nov 13, 2024

skothenhill-nv commented Nov 14, 2024

cristianregep commented Nov 14, 2024

skothenhill-nv commented Nov 15, 2024

skothenhill-nv commented Nov 28, 2024

cristianregep commented Dec 9, 2024

cristianregep commented Dec 10, 2024

skothenhill-nv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cristianregep commented Sep 27, 2024 •

edited by malcolmgreaves

Loading