2017.12.10

Written as it progresses
What I have done so far today (general stuff):

Messed up my git, so had to spend 2H just trying to fix it (local repo was 6 commits ahead?!)
Been reading about and trying to understand pandas data frame.

What I have done so far today (scripting-related):

Noticed that there were column number mismatches between samples because I did not realise that the way I excluded the low-qual samples still retained their full length counts. So fixing this in all the scripts that have been made so far: SampleID_mapping.py, ensembl_reference_dict.py, sum_transcripts_to_genes.py.
Now working on furthering sum_transcripts_to_genes.py, such that I can get to the point where the data is split into 3:
- Healthy <--> Steatosis,
- Steatosis <--> Non-alcoholic steatohepatitis (NASH), and
- NASH <--> Hepatocellular carcinoma (HCC).
  In hindsight, I now realise that splitting it into 3 files may not be necessary, I just need to better understand how I manipulate dataframes in R.
~~Until this point, I have finally managed to sum all the genes if they have the same ID. Next step is to remove the low-qual samples and sum/merge the ones that have replicates.~~ Success!! ~~Now need to find out how to write out pandas dataframe into a csv file (which I think should be very easy - I hope).~~ Sample processed data
Obviously then need to load it into R (which I honestly am not very proficient at), for DESeq2 analysis. Will also try to check how different it is if processed with DESeq1, from what I have learnt so far is that DESeq is much more conservative.
Created and finished the presentation. Have asked others to give comments. Not sure if we have any problems (the hardest part, i.e. preprocessing data, is over), timeline/workplan is still pretty much the same as originally presented in S1. Some differences or new ideas have been discussed in diary entry 2017.12.06.
Sucessfully loaded the data into R(studio), now writing that part on Rmarkdown notebook. After loading, the data is then split into 4 variable containers: H(ealthy), S(teatosis), N(on-alcoholic steatohepatitis), and C (Hepatocellular carcinoma). And that's it for today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2017.12.10

Project

2018

January

2017

December

November

Clone this wiki locally