-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High-throughput run over all plates #5
Comments
I tried to extract features from the first 243 plates (out of about 360 plates). It uses up 1.6TB/2TB in Gluster. The extracted features are just 5GB in total.
|
Some of the SQL files were corrupted, even after multiple attempts to transfer the files. We cannot find md5 checksums, so we may need to ask Anne Carpenter for help. We may have sufficient plates to continue our exploratory analysis and can follow up about the corrupted files later. We believe that UMAP should work on all of the extracted images. Hierarchical clustering may not. There are special techniques for clustering large datasets, and we may want to look for Python implementations of those. AnHai Doan has worked on this problem. I will contact him. We will also consider extracting features from earlier layers based on what we observed with the T cells. |
AnHai and Adel's clustering algorithm is for bit vectors only. We can look at the clustering program matrix from Adel to see if he has identified more general approximate clustering packages. We could approximate the hierarchical clustering ourselves by doing initial clustering per plate, merging compounds in the same clustering, and then clustering across plates. We prefer to use and off-the-shelf method though. CHTC has high-memory machines that could be necessary for a full hierarchical clustering. We can contact them to learn what special considerations there may be for using these machines. We can check whether scikit-learn uses multiple cores during hierarchical clustering. This may be controlled by NumPy or a lower library. We can confirm it works with our small scale tests by scaling from 1, 2, 4 cores. UMAP likely can handle the full dataset, but if it fails it may be able to learn a stable mapping from the original feature space to the lower dimension space. If so, we could run UMAP on X% of the data and then map the 1-X % into the existing lower dimensional space. Then we could visualize compounds by plate or in smaller groups. |
@xiaohk for the UMAP dimension reduction, what distance measure are you using? Should we use cosine similarity there as well to be consistent with the hierarchical clustering decision in #4? |
Good suggestion! I was using the default |
The link above (https://umap-learn.readthedocs.io/en/latest/parameters.html#metric) had different UMAP metrics that are supported. Are those available in the version you have? |
Oops, sorry. I was trying to say it "does support". |
I tried to do UMAP feature visualization and hierarchical clustering on all combined features. Each image has 4,068 features and we have 340,304 images from 150 plates. The total feature size is 5 GB. UMAPI use The result of plots looks weird. In half of the plates, majority points are in the upper cluster while points are located in the lower region in the other half.
Hierarchical ClusteringI use It seems tricky to run hierarchical clustering on multi-threads. |
It is possible that these DMSO clusters represent technical or "batch" effects and that the signal in the image-based features is mostly driven by those instead of something biological. We could cluster the 2D UMAP representation into 3 groups and then ask for each plate
Hopefully the representative images may show us whether there are obvious factors, such as the number of cells in the image, that dominate the signal and have a stronger effect than the morphology of the cells in the image. If there are batch effects, we would need to work on a plate normalization. This could involve comparing images only to the DMSO images on the same plate. |
This is an important observation. We can see whether the entropy or total intensity (per channel) distributions are also bimodal. That could indicate one of the UMAP clusters is empty images. Revising the plan from #5 (comment), now it looks like there are only 2 clusters instead of 3. |
To keep things simple, we could use 2 clusters at first. The clustering is very robust and we would not have to spend more time trying to perfect the clusters. My main interest here is to see whether there is a cluster bias plate-by-plate and whether random examples from each cluster are obviously different from each other in some way. Those should be attainable even with 2 clusters. |
1bf4f4a adds the inspection code for 2 clusters. I randomly sampled 100 non-DMSO shots from 150 plates, and get the following inspection result. Group 0 Channel 123Group 1 Channel 123Group 0 Channel 45Group 1 Channel 45Comments
|
Some ideas for systematically exploring the trends noted above:
|
Since the changes we have inspected are related to channel intensities, I first explored the relationship with Below is the plot of 80 randomly sampled plates (total 244) of five channels. From this sample, we can infer a large variance of intensities over all plates.
|
In the short term, we can send the final figure to Alex and Melissa to see if these intensity biases are common in microscopy and whether there are standard corrections. There are standard ways to normalize images to have the same mean intensity, but if other attributes of the images are affected by the batch effect then a simple intensity correction will not work. Ann Carpenter has work on weak supervision that could be relevant. In this dataset, we could constrain the representation of DMSO treated cells to be the same in all instances. That would force the network to learn how to remove the batch effect. We can assess whether it worked by inspecting the UMAP visualization. There is also a less-related paper on batch normalization for single-cell RNA-seq data: https://www.biorxiv.org/content/early/2018/08/27/237065.1 For expediency, we could start by subsampling the DMSO treated images and and equal number of non-DMSO images. We can also regenerate a version of the last plot for only the DMSO images. |
Alex suggested normalizing images by their nearest DMSO image or by the cell count. Before attempting this normalization, should we look at the ratio of intensity / cell count for all plates to see if that is more homogeneous? |
Here is the distribution of cell count over 244 plates: The distribution of mean intensity of DMSO over 244 plates: The ratio of intensity / cell count over 244 plates: The averages of ratios across all plates look more consistent than the intensity plot. The distribution of outliers matches the pattern we have seen in the intensity plot. |
We have downloaded all the plates. Here are the new bias inspection plots for all plates. As expected, the batch effects still exist. From the plots we can also find some problematic plate. Mean intensityMean intensity of DMSOCell number |
I have checked and re-downloaded the corrupted mega data files. Even though technically we can download 406 plates, there are only 349 entries listed on the provided Mean IntensityMean Intensity of DMSOComments
|
Closed by #9 |
We can discuss whether this is an appropriate time to try a large-scale run over all the images. We could see if our conclusions about how few compounds affect the cells hold once we look at all of the compounds. This would also give us an initial estimate of how long it takes to do any processing on the entire dataset.
The text was updated successfully, but these errors were encountered: