Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error after not finding pre-training models (ir.dill) #21

Open
ivandatasci opened this issue Apr 25, 2022 · 5 comments · May be fixed by JakeLehle/CellO#1 or #31
Open

Error after not finding pre-training models (ir.dill) #21

ivandatasci opened this issue Apr 25, 2022 · 5 comments · May be fixed by JakeLehle/CellO#1 or #31

Comments

@ivandatasci
Copy link

Hello everyone.

I just installed the latest cello-classify version with pip. Version 2.0.3. Installation went fine.

When I try to use it, I get an error.

I execute this cell in a JupyterLab notebook:

cello.scanpy_cello(adata0a,
                   clust_key='leiden',
                   rsrc_loc='/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130',
                   out_prefix='/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130/cello_model00',
                   log_dir='/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130')

and this is the error:

       Could not find the CellO resources directory called
        'resources' in '/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130'. Will download resources to current 
        directory.
        
Running command: curl http://deweylab.biostat.wisc.edu/cell_type_classification/resources_v2.0.0.tar.gz > /mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130/resources_v2.0.0.tar.gz
Running command: tar -C /mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130 -zxf resources_v2.0.0.tar.gz
Running command: rm /mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130/resources_v2.0.0.tar.gz
Checking if any pre-trained model is compatible with this input dataset...
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-45-2570c77cb08a> in <module>
      4 #sc.pp.neighbors(adata0a, n_neighbors=15)
      5 #sc.tl.leiden(adata0a, resolution=2.0)
----> 6 cello.scanpy_cello(adata0a,
      7                    clust_key='leiden',
      8                    rsrc_loc='/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130',

/usr/local/lib/python3.9/site-packages/cello/scanpy_cello.py in cello(adata, clust_key, rsrc_loc, algo, out_prefix, model_file, log_dir, term_ids, remove_anatomical_subterms)
    127     else:
    128         # Load or train a model
--> 129         mod = ce._retrieve_pretrained_model(adata, algo, rsrc_loc)
    130         if mod is None:
    131             mod = ce.train_model(

/usr/local/lib/python3.9/site-packages/cello/cello.py in _retrieve_pretrained_model(ad, algo, rsrc_loc)
    329                 model_fname
    330             )
--> 331             with open(model_f, 'rb') as f:
    332                 mod = dill.load(f)
    333             feats = mod.classifier.features

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130/resources/trained_models/ir.dill'

Could somebody offer some guidance?

Thank you.

Ivan

@alexisthermofisher
Copy link

Hi Ivan,

Were you ever able to figure it out? I am also having issues with my CellO resources directory.

Thanks!

@mbernste
Copy link
Member

mbernste commented Jul 8, 2023

My apologies for the trouble. I just fixed the bug. If you re-install CellO via pip install -U cello-classify, then the new version should hopefully work now. Let me know if you still experience an issue.

@TomSmithCGAT
Copy link

I'm getting the same error. I downloaded the resources and unpacked with:

wget https://deweylab.biostat.wisc.edu/cell_type_classification/resources_v2.0.0.tar.gz; tar -xvf resources_v2.0.0.tar.gz

which gives me the following output:

x resources/
x resources/gene_metadata/
x resources/trained_models/
x resources/README
x resources/training_set/
x resources/training_set/labels.json
x resources/training_set/experiment_to_study.json
x resources/training_set/experiment_to_tags.json
x resources/training_set/log_tpm.h5: truncated gzip input
tar: Error exit delayed from previous errors.

and the following files:

 ls resources/* 
resources/README

resources/gene_metadata:

resources/trained_models:

resources/training_set:
experiment_to_study.json	experiment_to_tags.json		labels.json			log_tpm.h5

I installed cello-classify v 2.1.1 in my conda env with pip:

$conda list|grep cello
cello-classify            2.1.1                    pypi_0    pypi

@JakeLehle
Copy link

Circling back to this issue as I hit it too today. I was able to get around it but in a rather dumb way.

So the curl command don't respect the assignment of the cello_resource_loc and even if you set them those commands can't find the dir and will default to try and download the resource dir in the current dir. Then the tar command tries to unpack the archive in the cello_resource_loc rather than the current dir and can't find it and the script errors out. Here is how I was able to get around this issue.

import cello
import os

path = os.path.join(os.environ['HOME'], "CellO/test_cello")

try:
    os.makedirs(path, exist_ok=True)
    print(f"Successfully created the directory {path}")
except Exception as e:
    print(f"An error occurred: {e}")

# Change the current dir to where you want the model to be trained to avoid curl issue
os.chdir(path)

# Set the working dir for the CellO resources
# Note, these resources require approximately 5GB of disk space.
cello_resource_loc = os.path.join(os.environ['HOME'], "CellO/test_cello/")

model_prefix = "Trained_CellO_Model" # <-- The trained model will be stored in a file called Trained_CellO_Model.model.dill 

cello.scanpy_cello(
    adata, 
    'leiden',
    cello_resource_loc, 
    out_prefix=model_prefix
)

@JakeLehle
Copy link

Looks like this is the issue from the cello.py file

def _download_resources(rsrc_loc):
    if not os.path.isdir(join(rsrc_loc, "resources")):
        msg = """
        Could not find the CellO resources directory called
        'resources' in '{}'. Will download resources to current 
        directory.
        """.format(rsrc_loc)
        print(msg)
        download_resources.download(rsrc_loc)
    else:
        print("Found CellO resources at '{}'.".format(join(rsrc_loc, 'resources')))

The rsrc_loc is I believe the dir where the resource folder lives or at least thats what the rest of the script uses it as but the download.py function you define should treat this as the cello_resource_loc which is not the actuall cell_resource_loc but really the working dir or at least that is how users are using it often, and then hit this issue. The function works but only if you run things in the current dir the first time you run the pipeline and then set the resource dir that is made to the cello_resource_loc every time after.

As I type this and read it that makes sense to name the cello_resource_location to the location of the resource dir but first-time users will find this confusing when looking at the documentation. Also you kinda set users up for a hard time because they will never have the resource dir untill after they run the pipeline the first time and unless the run it in the current dir it will crash out.

If I have time I can rewrite stuff and make a pull request to change download and the scanpy_cello to take a new imput which would be the "work dir" which people are right now setting the cello_resource_dir to. However that likley will throw other errors and might be a pain because you use the rsrc_loc for so much stuff when you train the local model, so users would likely still be confused about what to set the cello_resource_dir to. The more easy thing would be to change nothing and just update the documentation to explicitly tell people the first time they run this they have to run it in the local dir and then set the cell_resource_loc the resource dir in the future. Or if all else fails kindly point people to the start protocol paper you guys put out. for more information on the resource dir they will be downloading.
https://pmc.ncbi.nlm.nih.gov/articles/PMC8379521/

JakeLehle added a commit to JakeLehle/CellO that referenced this issue Mar 26, 2025
Fixed issue deweylab#21 with not being able to download resource dir for first-time users
JakeLehle added a commit to JakeLehle/CellO that referenced this issue Mar 26, 2025
Fixed deweylab#21 with not being able to download resource dir for first-time users
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants