-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High-throughput run over all plates (continued) #9
Comments
Heatmap for DMSO imagesSince there are many corrupted raw images, we only had features extracted from 317 plates. Then, for each plate, I randomly sample 5 images, from random DMSO wells and random field of view. It gave me 317 * 5 = 1585 rows. Then we can visualize this 1585 * 4096 matrix. Heatmap for Non-DMSO imagesFollowing the same procedure, I sampled 1585 non-DMSO images. We also can visualize this 1585 * 4096 matrix. Comments
Also see our batch grouping in the DMSO distribution. |
I subsampled features of 5 plates in batch 3, and 5 plates in batch 4 to test batch effect adjustment algorithms. It gave me a (41410, 4096) feature matrix. Then, I used the I visualized the heat map (4096, 4096) for the feature matrix. In each heatmap, top 2048 rows are randomly sampled from the rows in batch 3, while the bottom 2048 rows are from batch 4. Two heatmaps have the same sampling index. Before adjustmentAfter adjustmentComments
Some other ideas for normalization
|
We can try the We will need a way to evaluate the three current normalization strategies:
One approach would be to normalize, run UMAP on the feature matrix, and visualize the DMSO images in the new 2D space from UMAP. If the normalization worked well, we might hope to see a single DMSO cluster now instead of the dual clusters we saw for some batches in #5. We will also continue thinking about the calibration transfer strategies, possibly including the CNN approach we discussed before. We could also consider encoding the batch (or plate?) as a feature and use that when predicting drug responses. The batches have clear visual differences but we created them, whereas the plates are directly technical indices. |
Use UMAP to measure normalization performancebc2547b adds UMAP plots for CombatP (Parametric Combat) normalization. Surprisingly, two batches are separated in the UMAP 2D space after normalization, even though we have seen a consistent heat map in #9 (comment).
Other normalization
|
The standard batch effect correction methods are either too slow or unsatisfactory. The network-based normalization could still possibly work but will be hard to implement and execute. We could also contact the Broad team again to point out what we perceive as a batch effect and ask for suggestions. Instead of going to the authors directly, we can post to the meta-image analysis board that supports Cell Profiler, ImageJ, etc. |
The imaging forum (image.sc) is very helpful. Someone actually asked questions about the Cell Painting data we are using. Based on the replies, Carpenter Lab has noticed the batch effects. For the provided CellProfiler profiling features, they have processed per-plate normalization using z-score method. Carpenter Lab has published a very nice paper discussing the cell image processing procedure in their lab. For batch effect detection, it suggests to aggregate the negative control feature for each plate (median is recommended), and then visualize the correlation matrix. For batch effect removal, it recommends us to use DMSO (negative control) as the "normalizing population," then apply z-score normalization (similar to the mean-method we have used) within each plate, directly on every feature.
|
Using the recommended batch effect detection method, we can get the following correlation heatmap. I used the "elemental-wise" median of DMSO CNN feature (4096) in each plate to compute this 317*317 correlation matrix. The horizontal and vertical axises are the sorted 317 plate ID's.
Next step:
|
I have tried the z-score normalization on all extracted features. The results do not look very good, so I tried some other methods. One issue I have encountered is the variance of some features on the normalizing population is 0, but they are not 0 on treated images. I have discussed it with Shantanu on image.sc forum. I believe he knows we are working on the Cellpainting data. My solution is to use the feature variance of all images instead of DMSO images if the latter variance is 0. If the variance for all images in that plate is 0, I then stop normalizing that feature (mean = 0, std=1) and report it. The idea is applied to all three normalization below (use super set to compute std).
Comments
Next Steps
|
These next steps make sense. UMAP has been helpful in visualizing the batch effects in the past. We are comfortable sharing these heatmaps in the image.sc forum if it can help us determine the best normalization. |
I asked for suggestions here, but Shantanu didn't reply. I guess these questions are too detailed, and not related to CellProfilers. Then I tried to revitalize the normalization results using UMAP with some slight visualization changes. Workflow
Comments
|
I agree Now that we are happy with the 2D UMAP representations after this normalization, we can spot check individual images to confirm the extracted features and UMAP 2D coordinates make sense with respect to the original images. If we fit a 2D Gaussian to the DMSO images in the 2D space, we can score each treated image by the likelihood it was generated from that Gaussian. Then we can look at 10-100 images with the highest likelihood and 10-100 with the lowest. Perhaps we do this for more than one batch. One hypothesis is still that the outliers here are those where the chemicals have the greatest effect on the cancer cells. That may be a research question we can explore next. Before trying to systematically validate the outliers, we could look at some that are strong outliers and manually search ChEMBL or other databases to see what evidence exists that we can use to support our predictions. |
I was preparing for an admission interview this week. To show that professor I am capable of designing and implementing visual analytics systems, I designed a visualization for flexibly viewing a million of UMAP dots. The link above should be private. He gave us lots of suggestions in terms of this visualization. The plot is still sketchy. I can give you a demo in our meeting, and we can discuss if this plot is helpful. |
Very nice! I'm curious to hear the suggestions. Did you do this directly in D3 or does it use some library? |
UMAP Viewer VisualizationYes, I am directly using D3.js and HTML SVG. It might not be our main interest, but I got many fun suggestions from my interviewer and Michael Gleicher. Functions and goals
Limitations
Feedbacks
Gaussian Mixture 2D ModelI have tried to fit a Gaussian 2D distribution on all normalized DMSO UMAP values, then I can automatically detect outliers of non-DMSO UMAP points. If I fit the mixture model using If I fit the mixture model using I manually picked outliers from 3 regions: far top (y > 10), far right (x > 10), cluster bottom edge area (y < -5) to inspect the raw images. I am fixing a bug in my image inspect code. |
Even with these outliers, it ends up being difficult for us to manually evaluate whether they are reasonable or not. The Another direction would be to look for positive controls, known effective drugs. If these are reported in the Gigascience paper, we could find the corresponding 2D coordinates and look at the images. The Broad Institute and Sanger have projects that screen drugs against many types of cancer cell lines. If they have U2OS data, it could also give us positive controls. |
Based on #7, we have concerns about over-normalization and removing the signal. We may be able to design more direct tests of the normalization. For instance, if we select some training batches and a different test batch, can we predict DMSO versus non-DMSO on the test batch successfully? We could directly compare results for the original features and normalized features. The normalization is within plate, so it should be fair to normalize even the test batch images. |
Both of these include U2OS drug screening data and can serve as positive controls of drugs that kill the cancer cells if we need those. We should also check whether the Gigascience paper discusses these screening datasets and any overlapping chemicals. The Broad data was originally reported in https://www.nature.com/articles/nature11003#supplementary-information I checked the supplementary Excel file to confirm U2OS is there. It may be more direct to register for their data portal to access the data though: https://portals.broadinstitute.org/ccle/about The Sanger drug sensitivity data is https://www.cancerrxgene.org/translation/CellLine/909776 They used z-scores to assess sensitivity and resistance. From the plot, it looks like no drugs passed their -2.0 threshold for sensitivity. We could still potentially learn what the most sensitive versus the most resistant chemicals do to the cells. |
e10df9b adds analysis of these positive control compounds.
Below is their UMAP project on the pre-normalized DMSO space: Below is their UMAP project on the normalized DMSO space:
|
The running jobs to look at the images from the max and min z-scores may still be informative. Otherwise, these results are somewhat confusing. We might expect that the images for the sensitive drugs with the most positive z-scores have fewer cells or show some sign that the cells are being affected or killed. In the 2D plots, however, the yellow and dark blue points are sometimes adjacent and the yellow points are distributed throughout the 2D space. |
We have 11 compounds with known effects overlapping with our image dataset. This corresponds to 504 images. I randomly sampled 100 images from channel 123 and channel 45 respectively. Then, I sorted these raw images by their drug sensitivity Z scores. Channel 123Channel 45Comment
|
My first observations were the same as yours, which is not encouraging. However, there may be more subtle biological cues for which we both lack suitable expertise to pick up. For instance, the cell shape could be changing in a consistent way even though I can't see any trends. |
Continuing the discussion from #5.
We will select two batches that have two DMSO clusters and a subset of plates to explore normalization options. Three normalization ideas are:
For the extracted features, we may also try visualizing the data matrix as a heat map if it is feasible. We would keep the plate ordering and use hierarchical clustering of the features to group similar features.
The text was updated successfully, but these errors were encountered: