Skip to content
This repository has been archived by the owner on Aug 25, 2019. It is now read-only.

2017.12.10

dmtr13 edited this page Dec 11, 2017 · 13 revisions

Written as it progresses
What I have done so far today (general stuff):

  • Messed up my git, so had to spend 2H just trying to fix it (local repo was 6 commits ahead?!)
  • Been reading about and trying to understand pandas data frame.

What I have done so far today (scripting-related):

  1. Noticed that there were column number mismatches between samples because I did not realise that the way I excluded the low-qual samples still retained their full length counts. So fixing this in all the scripts that have been made so far: SampleID_mapping.py, ensembl_reference_dict.py, sum_transcripts_to_genes.py.
  2. Now working on furthering sum_transcripts_to_genes.py, such that I can get to the point where the data is split into 3:
    • Healthy <--> Steatosis,
    • Steatosis <--> Non-alcoholic steatohepatitis (NASH), and
    • NASH <--> Hepatocellular carcinoma (HCC).
      In hindsight, I now realise that splitting it into 3 files may not be necessary, I just need to better understand how I manipulate dataframes in R.
  3. Until this point, I have finally managed to sum all the genes if they have the same ID. Next step is to remove the low-qual samples and sum/merge the ones that have replicates. Success!! Now need to find out how to write out pandas dataframe into a csv file (which I think should be very easy - I hope). Sample processed data
  4. Obviously then need to load it into R (which I honestly am not very proficient at), for DESeq2 analysis. Will also try to check how different it is if processed with DESeq1, from what I have learnt so far is that DESeq is much more conservative.
  5. Created and finished the presentation. Have asked others to give comments. Not sure if we have any problems (the hardest part, i.e. preprocessing data, is over), timeline/workplan is still pretty much the same as originally presented in S1. Some differences or new ideas have been discussed in diary entry 2017.12.06.
  6. Sucessfully loaded the data into R(studio), now writing that part on Rmarkdown notebook. After loading, the data is then split into 4 variable containers: H(ealthy), S(teatosis), N(on-alcoholic steatohepatitis), and C (Hepatocellular carcinoma). And that's it for today.
Clone this wiki locally