Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End-to-end example for generating embeddings with get_embeddings.py #12

Open
justincosentino opened this issue Aug 19, 2024 · 0 comments

Comments

@justincosentino
Copy link

Hi @AnnaChristina ,

I am interested in generating nicheformer embeddings using the available checkpoint. Is there an end-to-end tutorial showing how to tokenize a data source into the format expected by the model? I've tried the following but it seems like the data/model_means/model.h5ad file contains only a single observation (which appears to be unexpected in the related tokenization notebooks).

  1. I installed the nicheformer package and downloaded the pretained model weights from Mendeley.
  2. I downloaded and preprocessed the exemplar spatial and dissociated datasets using the download_* and preprocess_* scripts in data/spatialcorpus-110M/spatial/examplary-Xenium and data/spatialcorpus-110M/dissociated/Lu_2021, respectively. I updated the default paths in the constants file.
  3. Following nicheformer/tree/main/notebooks/tokenization/xenium_human_lung.ipynb, I tried to run the tokenization process for .../spatial/preprocessed/Xenium_Preview_Human_Non_diseased_Lung_With_Add_on_FFPE_outs.h5ad. I mapped DATA_PATH to this h5ad (corresponds to healthy in the notebook?), xenium_mean to data/model_means/xenium_mean_script.npy, and model to data/model_means/model.h5ad.

It appears that my xenium object contains the expected obs, var, etc. subobjects with fewer samples than the shapes logged in the notebook. However, model seems to be missing observations?

AnnData object with n_obs × n_vars = 1 × 20310
    obs: 'soma_joinid', 'is_primary_data', 'dataset_id', 'donor_id', 'assay', 'cell_type', 'development_stage', 'disease', 'tissue', 'tissue_general', 'specie', 'technology', 'dataset', 'x', 'y', 'assay_ontology_term_id', 'sex_ontology_term_id', 'organism_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'condition_id', 'tissue_type', 'library_key', 'organism', 'sex', 'niche', 'region', 'nicheformer_split', 'author_cell_type', 'batch'

I believe this breaks the inner join in the following block since the resulting post-join xenium object ends up with a shape of AnnData object with n_obs × n_vars = 295883 × 391 rather than the notebook's logged AnnData object with n_obs × n_vars = 827048 × 20310:

adata = ad.concat([model, xenium], join='inner', axis=0)
# dropping the first observation 
xenium = adata[1:].copy()
# for memory efficiency <
del adata

Would it be possible to add an updated model.h5ad file and a more detailed end-to-end example so that we can try to format out datasets to match the tokenized representations expected by the Nicheformer.get_embeddings() method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant