End-to-end example for generating embeddings with `get_embeddings.py` #12

justincosentino · 2024-08-19T16:24:20Z

I am interested in generating nicheformer embeddings using the available checkpoint. Is there an end-to-end tutorial showing how to tokenize a data source into the format expected by the model? I've tried the following but it seems like the data/model_means/model.h5ad file contains only a single observation (which appears to be unexpected in the related tokenization notebooks).

I installed the nicheformer package and downloaded the pretained model weights from Mendeley.
I downloaded and preprocessed the exemplar spatial and dissociated datasets using the download_* and preprocess_* scripts in data/spatialcorpus-110M/spatial/examplary-Xenium and data/spatialcorpus-110M/dissociated/Lu_2021, respectively. I updated the default paths in the constants file.
Following nicheformer/tree/main/notebooks/tokenization/xenium_human_lung.ipynb, I tried to run the tokenization process for .../spatial/preprocessed/Xenium_Preview_Human_Non_diseased_Lung_With_Add_on_FFPE_outs.h5ad. I mapped DATA_PATH to this h5ad (corresponds to healthy in the notebook?), xenium_mean to data/model_means/xenium_mean_script.npy, and model to data/model_means/model.h5ad.

It appears that my xenium object contains the expected obs, var, etc. subobjects with fewer samples than the shapes logged in the notebook. However, model seems to be missing observations?

AnnData object with n_obs × n_vars = 1 × 20310
    obs: 'soma_joinid', 'is_primary_data', 'dataset_id', 'donor_id', 'assay', 'cell_type', 'development_stage', 'disease', 'tissue', 'tissue_general', 'specie', 'technology', 'dataset', 'x', 'y', 'assay_ontology_term_id', 'sex_ontology_term_id', 'organism_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'condition_id', 'tissue_type', 'library_key', 'organism', 'sex', 'niche', 'region', 'nicheformer_split', 'author_cell_type', 'batch'

I believe this breaks the inner join in the following block since the resulting post-join xenium object ends up with a shape of AnnData object with n_obs × n_vars = 295883 × 391 rather than the notebook's logged AnnData object with n_obs × n_vars = 827048 × 20310:

adata = ad.concat([model, xenium], join='inner', axis=0)
# dropping the first observation 
xenium = adata[1:].copy()
# for memory efficiency <
del adata

Would it be possible to add an updated model.h5ad file and a more detailed end-to-end example so that we can try to format out datasets to match the tokenized representations expected by the Nicheformer.get_embeddings() method?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end example for generating embeddings with `get_embeddings.py` #12

End-to-end example for generating embeddings with `get_embeddings.py` #12

justincosentino commented Aug 19, 2024

End-to-end example for generating embeddings with get_embeddings.py #12

End-to-end example for generating embeddings with get_embeddings.py #12

Comments

justincosentino commented Aug 19, 2024

End-to-end example for generating embeddings with `get_embeddings.py` #12

End-to-end example for generating embeddings with `get_embeddings.py` #12