This repository has been archived by the owner on Aug 25, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
2017.12.10
dmtr13 edited this page Dec 11, 2017
·
13 revisions
Written as it progresses
What I have done so far today (general stuff):
- Messed up my git, so had to spend 2H just trying to fix it (local repo was 6 commits ahead?!)
- Been reading about and trying to understand pandas data frame.
What I have done so far today (scripting-related):
- Noticed that there were column number mismatches between samples because I did not realise that the way I excluded the low-qual samples still retained their full length counts. So fixing this in all the scripts that have been made so far: SampleID_mapping.py, ensembl_reference_dict.py, sum_transcripts_to_genes.py.
- Now working on furthering sum_transcripts_to_genes.py,
such that I can get to the point where the data is split into 3:
- Healthy <--> Steatosis,
- Steatosis <--> Non-alcoholic steatohepatitis (NASH), and
- NASH <--> Hepatocellular carcinoma (HCC).
In hindsight, I now realise that splitting it into 3 files may not be necessary, I just need to better understand how I manipulate dataframes in R.
-
Until this point, I have finally managed to sum all the genes if they have the same ID. Next step is to remove the low-qual samples and sum/merge the ones that have replicates.Success!!Now need to find out how to write out pandas dataframe into a csv file (which I think should be very easy - I hope).Sample processed data - Obviously then need to load it into R (which I honestly am not very proficient at), for DESeq2 analysis. Will also try to check how different it is if processed with DESeq1, from what I have learnt so far is that DESeq is much more conservative.
- Created and finished the presentation. Have asked others to give comments. Not sure if we have any problems (the hardest part, i.e. preprocessing data, is over), timeline/workplan is still pretty much the same as originally presented in S1. Some differences or new ideas have been discussed in diary entry 2017.12.06.
- Sucessfully loaded the data into R(studio), now writing that part on Rmarkdown notebook. After loading, the data is then split into 4 variable containers: H(ealthy), S(teatosis), N(on-alcoholic steatohepatitis), and C (Hepatocellular carcinoma). And that's it for today.
This makes me hungry. (C)2017-2018 • DMTR13
- Description
- Scripts
- Sample Results
-
Poster^
^Requires Canvas access