Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Positive examples of using microscopy image-based chemical screening #12

Open
agitter opened this issue Feb 27, 2019 · 8 comments
Open
Labels
related work Related manuscript

Comments

@agitter
Copy link
Member

agitter commented Feb 27, 2019

Our latest results in #9 and #7 have given no indication that the cell images are meaningful for predicting chemical effects. There seems to be very little signal in this type of data. We may need to find some positive success stories of how this type of imaging data has been used for chemical screening, drug discovery, etc. to convince ourselves there is a meaningful way to link ChEMBL assays to these images or the Sanger drug sensitivity to these images.

https://www.recursionpharma.com/ works specifically in this area, so reminding ourselves of their successes may be a good place to start.

@agitter
Copy link
Member Author

agitter commented Mar 10, 2019

Here is a paper Anne Carpenter shared recently that reports positive results:
Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery
https://doi.org/10.1016/j.chembiol.2018.01.015

@agitter
Copy link
Member Author

agitter commented Mar 19, 2019

My notes on Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery

  • Use a "three-channel glucocorticoid receptor (GCR) HTI assay" from which they extract 842 Cell Profiler features
  • Normalize with mean and standard deviation of each feature, then compute median of normalized values for all cells in an image to get an image-derived fingerprint
  • Screen 524,371 proprietary compounds
  • Do not directly apply CNNs to the images, future work
  • Focus on multi-task feed forward neural networks and Bayesian matrix factorization for supervised learning
  • The NNs are 1-3 layers with 1024-4096 hidden units per layer.
  • Random forest and k-NN in supplement
  • Use AUC-ROC to find assays that produce reliable models, keep those for which 3-fold CV gives AUC-ROC > 0.9. This selects < 10% of the 535 assays in their dataset (Table 1).
  • Start with 1200 assays but require 25 actives and 25 inactives along with other criteria to get 535 assays
  • Careful design of cross validation folds using ECFP6 fingerprint clustering
  • Use the matrix factorization model predictions to do in vitro validation for an oncology project and a central nervous system (CNS) project. Not explained why they prioritize those two projects.
  • Their image-based prioritization gives more chemically diverse compounds than a random selection (Figure 5).
  • Oncology project screened 342 top-ranked compounds, 124 are hits. A 36.3% hit rate is excellent in my opinion, and much better than the full screen hit rate of 0.725%. They only considered 60k compounds.
  • They also train a matrix factorization model on the ECFP fingerprints but it is unclear whether they experimentally tested all of these. They look at where the top image-based compounds and hits lie in the ECFP-based ranked list. They claim "This shows that the image fingerprints clearly provided an additional source of information that is not encoded in the chemical fingerprints." but do the chemical fingerprints yield a better hit rate?
  • For the CNS project they consider all 500k compounds and select those predicted to be highly active and apply a PAINS filter and CNS filter.
  • Use a prioritization strategy that explicitly promotes chemical diversity. Select 141 compounds and find 36 hits (25.5% hit rate). Much better than initial screen with a 0.088% hit rate.
  • No discussion of ECFP based predictions in this part. "We leave for future work the head-to-head comparison of chemistry-based and image-based fingerprints... In the case of a well-covered chemical space, we would not expect image-based fingerprints to outperform a well-designed chemical fingerprint like ECFP".
  • They argue image-based features would be better for scaffold hopping, which is possibly true. Similar arguments have been made when using high-throughput assay activity as the chemical feature (HTS fingerprints).
  • Image-based fingerprinting is applicable to RNAi, antibodies, and other perturbations that are not small molecules.
  • Full AUC-ROC values are in Data S1. Their assays look better than what we have because many of them have thousands of compounds screened. We may be able to learn something by plotting their performance in the same style you use to evaluate your performance. Here's a quick look at the NN performance for all assays:
    image

Comments:

  • We have intentionally avoided AUC-ROC in this domain because it is easy to get a good score even if the classifier or regression model performs poorly.
  • The idea to focus only on assays that can be predicted well from the images is very interesting. There is no need for the images to be useful for predicting activity on all assays.
  • 1/2 million compounds is a huge screen compare to the Cell Painting data
  • The things we are interested in - CNNs and benchmarking with ECFP fingerprints - are both areas of future work they explicitly call out. Hopefully they are not too far along in testing these things.
  • The direct pharma involvement shows. They are not descriptive about their compounds or assays and have a very large initial screen to start from. "Due to the proprietary nature of the drug development process, we are unable to disclose specific information related to the chemical compounds and specific protein targets."
  • Therefore, we could not replicate or build upon this study.
  • This paper is somewhat encouraging. We may not need to have good performance on all assays. If we spend more time building strong Cell Profiler baselines and compare them to the ECFP baseline and something with a CNN, that would address a lot of important future work that they did not cover.

@xiaohk
Copy link
Member

xiaohk commented Mar 20, 2019

Some adds on comments:

  • Their image features are based on single cells. "For each compound, we compute a vector of feature medians across all cells in its image, producing a single image-based fingerprint". There are two fields for each well, so there are at least two images for each compound. Not sure if they aggregate across all images, or have multiple feature vectors for one compound.
  • They used negative controlled images to z-score normalize features within each plate before aggregating.
  • Cross-validation should not have random splits, since compounds have correlations. They used a stratified sampling method based on compound clusters.
  • Compound diversity is greatly valued for screening, they use ECFP fingerprint to compute compound similarities.
  • As you noted, one ECFP fingerprint model is implemented, but not fully validated.

@agitter
Copy link
Member Author

agitter commented Mar 20, 2019

Paper, supplement, and Data S1
1-s2.0-S2451945618300370-mmc3.pdf
1-s2.0-S2451945618300370-mmc2.xlsx

@agitter
Copy link
Member Author

agitter commented Mar 20, 2019

I can contact the authors to see if there is any chance they'd be willing to share the data. It is unlikely, but there is nothing to lose. Maybe they could anonymize the assay labels.

@agitter
Copy link
Member Author

agitter commented Mar 20, 2019

Their supplement describes the three channels

Hoechst 33258 (Invitrogen H3569, dilution 1/5000) to label the nucleus, CellMask Deep Red (Invitrogen H32721, dissolved in 100 ml DMSO, then diluted 1/4000) to delineate cell boundaries, and an Alexa-568 labeled goat anti-rabbit secondary antibody (Invitrogen A11011, 1/500) to detect the GCR.

For Hoechst, a 405-nm laser was used and a 445/45 bandpass emission filter; for Alexa 568 a 561-nm excitation and a 600/37 filter, and for CellMask Deep Red a 635-nm laser and a 676/29 filter.

  • Hoechst is for DNA staining
  • Alexa 568 is used to detect the GCR
  • CellMask Deep Red is for cell boundaries

We also noted that they are specifically targeting Glucocorticoid Receptor, a single protein target. This may mean that the "hit ratio" of the image-based screen is more like the hit ratio of traditional assays. In addition, it is likely that there is a stronger contrast between the hits and the controls.

@xiaohk
Copy link
Member

xiaohk commented Mar 20, 2019

Here are the five channels used in our U2OS cell-painting dataset.

Dye Alternative Position
ERSyto ER Endoplasmic reticulum
ERSytoBleed RNA RNA
Hoechst DNA Nucleus
Mito Mito Mitochondria
Ph_golgi AGP plasma membrane

@agitter
Copy link
Member Author

agitter commented Mar 23, 2019

I contacted the last author of this paper asking about data availability but received an out of office response. I can follow up in a week or two.

@agitter agitter added the related work Related manuscript label Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
related work Related manuscript
Projects
None yet
Development

No branches or pull requests

2 participants