-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial commit for DNA datasets #209
base: main
Are you sure you want to change the base?
Conversation
e447987
to
701a343
Compare
/build-ci |
c443454
to
271a3fc
Compare
/build-ci |
271a3fc
to
0aa8f6e
Compare
0aa8f6e
to
cea867c
Compare
Signed-off-by: cristianregep <[email protected]>
cea867c
to
e663ca7
Compare
/build-ci |
Signed-off-by: cristianregep <[email protected]>
/build-ci |
Signed-off-by: cristianregep <[email protected]>
Hey @cristianregep, in trying to get this MR to pass through CI more quickly, I took the liberty of committing to your branch again. I renamed the test modules to prevent the name collision bug. I also renamed some things in ESM2 to (hopefully) prevent this same issue in the future too. 🤞 Thank you for your contribution & your patience! 🙏 |
/build-ci |
Hey @cristianregep that was the only |
/build-ci |
sub-packages/bionemo-dnadl/tests/bionemo/dnadl/io/test_dnadl_dataset.py
Outdated
Show resolved
Hide resolved
Thanks a lot. I'm going to sort out some of the other comments that came in the meantime and update to the latest changes in main and resubmit |
@malcolmgreaves did something change with the permissions for the repo? I can't seem to be able to push anymore. I've updated to the new git URL |
Yes we have a new policy that external to nvidia users can't have write access to private repos. I re-enabled you for now though so we can get this in. In the future you can fork and open a PR from the fork. Also we're about a week away from open sourcing the whole thing and going public. |
Signed-off-by: cristianregep <[email protected]>
We should find an alternative for PyFaidx. I dont think there is a straightforward solution to integrating it in a way that works well with pytorch DataLoaders. Reading in the pyfaidx inside the workers I think resolves the issue with pyfaidx (although i'm not 100% sure). But it has two problems:
Fortunately, in the last year the pyfaidx authors actually commented on the issue~ it seems fundamental to the c implementation of faidx- as htslib is similarly not process safe. We should look for an alternative. I have to suggestions: Check what we did in https://github.com/NVIDIA/bionemo-framework/blob/bionemo1/bionemo/data/fasta_dataset.py Alternatively, we could implement a wrapper on a different fasta index implementation. I've been looking at the implementation here for sometime, unclear if its process safe, but its relatively straightforward to wrap rust objects as PyObjects (PyO3 does this, takes very little rust knowledge to implement). https://github.com/zaeleus/noodles Let me know if we can be of help with this. A safe fasta reader has been on our radar for a while. |
I went ahead and got started on this, the noodles-faidx reader is thread safe. Need to do a little more profiling but I should be able to get you an update soon @cristianregep |
Oh wow, this is great. Let me know if I can help, really happy to do so |
@cristianregep feel free to work on it with us! effort is happening here (and in the target branch) |
@cristianregep hey Cristian, we just merged a python interface for safe fasta index backed by mmaps. It should also help with memory usage. Checkout If you have any issues just let me know! |
Thanks, sorry to have been MIA, it's been intense delivery on the Relation side. This looks great |
Signed-off-by: cristianregep <[email protected]>
9ed84bd
to
b618829
Compare
@skothenhill-nv @ntadimeti I've updated my branch to use NvFaidx. Thing this is good to go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments almost entirely around docstrings and providing more visibility to the assumptions and limitations of these dataloaders. I love the work, and let me know if you need or want help with any of these things. Cheers.
class Genome: | ||
"""Class that creates a genome object from a fasta file. It can be used to extract sequences from the genome.""" | ||
|
||
def __init__(self, fasta_file: str | os.PathLike): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, fasta_file: str | os.PathLike): | |
def __init__(self, fasta_file: str | Path, faidx_file: Optional[str | Path]): |
I think we prefer pathlib for this kind of thing- also passing in a faidx file would be nice if users want a little initialization time boost. Just needs to be passed into NvFaidx.
"""Instantiate the class.""" | ||
fasta_file = Path(fasta_file) | ||
assert fasta_file.exists(), "path to fasta file must exist" | ||
self.seqs = NvFaidx(str(fasta_file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.seqs = NvFaidx(str(fasta_file)) | |
self.seqs = NvFaidx(str(fasta_file), faidx_path=faidx_path) |
class GenomeDataset(Dataset, ABC): | ||
"""Abstract class for datasets that return a DNA sequence.""" | ||
|
||
def __init__(self, tokenizer: DNATokenizer): | ||
"""Instantiate the class.""" | ||
self.tokenizer = tokenizer | ||
|
||
def __getitem__(self, ind: int) -> torch.tensor: | ||
"""Get an item from the dataset.""" | ||
sequence = self.get_dna_sequence(ind) | ||
return self.tokenizer(sequence) | ||
|
||
def get_dna_sequence(self, ind: int) -> str: | ||
"""Get the DNA sequence.""" | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand the docstrings here? no specific feedback other than adding arguments, but here's my perspective as a reader:
- the constructor implies this is doing a transformation with the tokenizer, like a
MappedDataset
- it looks like
ind
is meant to provide a deterministic mapping from dataset 'index' to some DNA sequence. - I can think of how I might want to use this, but maybe providing a few concise examples in text would help future implementors (or just point them at your implementations below).
def adjust_interval(self, interval: GenomeInterval, context_length: int) -> GenomeInterval: | ||
"""Used to extend an interval by a fixed amount on both sides.""" | ||
start, end = interval.start, interval.end | ||
interval_length = interval.end - interval.start | ||
chromosome_length = self.seqs[interval.chromosome].length | ||
if interval_length >= context_length: | ||
return interval | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring really needs help. I arrived here from the method below:
its clear context length expands the interval, and is truncated if it reaches the max. What are the bounds of sequence length that this can return? it seems like it could return any length region, which is okay, but would be best to be clear about that in the docstring.
def __init__( | ||
self, | ||
genome: Genome, | ||
bed_file_df: pl.DataFrame, | ||
context_length: int | None = None, | ||
chr_bed_to_fasta_map: dict = {}, | ||
**kwargs: Any, | ||
): | ||
"""Instantiate the class. | ||
|
||
Args: | ||
genome: Genome object | ||
bed_file_df: A bed file in the form of a polars DataFrame with chrom, start, and end columns at least | ||
context_length: Size of the window | ||
chr_bed_to_fasta_map: Mapping between chromosome names in the bed file and the fasta file | ||
**kwargs: Additional arguments | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused here. Context-length is used to set the window size, but the sequence returned seems to be a function of both the interval size- which could vary within the bedfile, the context length, (and potentially the end of the contig). Is this intentional? Or maybe more specifically, are there restrictions on the contents of bed_file_df
that youre expecting?
If I were to use this dataset out of the box, depending on the model, I think I would need to transform the outputs to have a fixed size, with padding or truncation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually from the previous BioNeMo, but i think I know where it comes from because we also have this issue in-house. Back in the day when language modelling wasn't a thing in DNA there weren't tokenizers and tokens (e.g. Enformer). you one hot encoded the sequence. for a batch you needed to have windows of the same size, and you didn't have PAD tokens. It was more biologically relevant to include more of the actual DNA sequence than add Ns (if the context_length was longer than the intervals specified in your file), so you had to deal with this at the Dataset level, rather than at the point of collation.. The name is not vey informative, I can change to force_window_size but let's align on a strategy here.
"""Get the maximum allele index from a tuple of allele indices. | ||
|
||
`sample.allele_indices` is a tuple of two indices that associate the sample with alleles | ||
listed in the list 'variant.alleles', where the zeroth entry is the REF allele. A sample | ||
could have one or both entries being 0, indicating that one or both strands are REF at | ||
the specified position. By taking max(sample.allele_indices) you ensure to always extract | ||
an alternative allele if present or REF otherwise. | ||
|
||
Sometimes, None is contained in sample.allele_indices. This is treated as if it refers | ||
to the REF allele (ie 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice docstring, super helpful!
|
||
|
||
def _match_contig_name(chrm: str, available_contigs: set[int | str]) -> str: | ||
"""Let's try either the standardized version that i expect to be "chr22" or only the number string "22".""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion: we should just fail in this case. It's too easy to end up with a reference that has the wrong amount of N
s or something else stupid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to do that but FYI internally we did this because hg19 and hg38 VCF files used different conventions https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19 . We wanted it to work with both. Theoretically you could say that the user needs to take care of this, but it becomes less user friendly if you get into situations where windows on top of TSSs come from Gencode which uses a specific standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine with me, per our call- add a warning if the seqids dont match between the vcf and reference.
# TODO: sometimes the variant overlaps the boundary of a window. The fetch function is zero base | ||
# so the interval works fine the problem is that sometimes a variant can indeed | ||
# overlap the region even if its start is outside. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:( perhaps we should make this a param?
allow_partial_overlap
I'd set it to False by default. Also you should move this comment to the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
def get_dna_sequence(self, ind: int) -> str: | ||
"""Get an item from the dataset.""" | ||
interval_index = ind // len(self.sample_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great place to surface and document all the edge-cases you handle when mutating the reference sequence.
def test_mutate(): | ||
sequence = "AAAAAAAAAAAAAAAAAAAA" | ||
interval = GenomeInterval("chr1", 1, 10) | ||
variants_df = pd.DataFrame([{"POS": 5, "REF": "A", "ALT": "G"}, {"POS": 2, "REF": "AA", "ALT": "TC"}]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add tests for the _mutate
function directly?
I think we should also cover:
- variants that overlap region boundaries
- variants that are not allowed (indels!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with regards to _mutate that was the case previously but was asked to change it #209 (comment) . I can add the corner cases at any level of abstraction, but would be useful to have a general guideline on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, i missed that. feel free to follow peter's advice, or to add the test
No description provided.