This is a step-by-step guide on how to reproduce exactly the analysis performed in the paper.
Note #1: To run all steps, you can type bash run_all_steps.sh
at the command line while inside
This directory. This script presumes that you've already run Fatiscan and FatiGO. If you do not have the fatiscan_output
and fatigo_output
directories, the script currently breaks at step 7 or step 9, respectively.
Note #2: Each bash script for each step assumes that they are run from their directory. The scripts will break otherwise.
Note #3: The annotations and the TCGA data are not included with this repository because of their size (several GB total). This pipeline assumes that you do step 1a and 1b.
- run
bash step_0.sh
to install the necessary packages (usesudo bash step_0.sh
to install system-wide)- this presumes that R is already installed: to run this particular analysis, you need to install R-3.1.3
- If you need to install it, follow these instructions:
- download R-3.1.3 from here: https://cran.r-project.org/src/base/R-3/R-3.1.3.tar.gz
- run
tar -xzf R-3.1.3.tar.gz
in the directory with the downloaded tarball. - navigate into the
R-3.1.3
directory, and then run either of the following set of commands:- if you have no other versions of R, run:
./configure
and thenmake
, thenmake install
- if you have other versions of R, run
./configure --prefix=/some/other/path
before runningmake
andmake install
- if you have no other versions of R, run:
- be sure to use the right path to Rscript in the following steps
Note: you may need to change the top line of the install_packages.R
script to reflect the proper path to Rscript. If there is an error, you can check the path by typing which Rscript
at the command line.
Note: you will need to change the directory paths at the beginning of each bash script currently to run this on your own machine.
- LUAD: 441 (19 control samples, 422 tumor samples)
- use
luad_file_manifest.txt
with the GDC data transfer tool to download the same files yourself
- use
- LUSC: 367 (37 control samples, 330 tumor sample)
- use
lusc_file_manifest.txt
with the GDC data transfer tool to download the same files yourself
- use
The easiest way to do this is to follow the steps in the tcga_file_manifests
directory using the
GDC data transfer tool. Click here to follow those steps.
If you download the data manually, please keep in mind the following note:
NOTE: For the downstream steps to work without modification, you must rename the top level directories for this data
as LUAD_data
and LUSC_data
, respectively, and place these directories in this repository.
If you wish to use different names, or place directories elsewhere on your computer, you need to modify the
LUAD_DIR
and LUSC_DIR
variables at the top of the following scripts: step_2c.sh, step_3.sh, step_4.sh, step_5.sh, step_7.sh, and step_9.sh
- RNA: UCSC, hg19, June 2011; miRNA: miRBase v21 and miRanda 08/2010 release
- run
step_1b.sh
to do two things:- download UCSC tables, the miRNA annotations and initial miRanda miRNA-target predictions
- generate the table that matches pathway IDs to intelligible pathway names NOTE: this download takes up approximately 880MB of space
- run
step_2a.sh
to prepare the UCSC annotations to facilitate downstream analyses Output: a combined kgXref table to facilitate mapping UCSC IDs to gene names
- run
step_2b.sh
Outputs:- a list of all miRNAs (and accessions) for both the target matrix and miRBase v21
- a list of all targets
- a matrix of interactions (1 if yes, 0 if no)
- run
step_2c.sh
- this automatically runs on both LUAD and LUSCS directories
Outputs:
- miRNA count matrix ready for step 3 is in
(LUAD|LUSC)_data
directory - miRNA counts ready for 4 are in
(LUAD|LUSC)_data/miRNASeq/BCGSC__50/Level_3/compressed_miRNAcounts
directory - RNA normalized isoform counts ready for steps 3 and 4 are in
(LUAD|LUSC)_data/RNASeqV2/UNC__58/Level_3/TCGA_isoform_normalized_results
directory
- miRNA count matrix ready for step 3 is in
- run
step_3.sh
Output: DESeq2 results are in(LUAD|LUSC)_data/deseq_results
directories. The *Rank.txt file is the one used in step 6 for the Fatiscan analysis.
- run
step_4.sh
Output: the promise RData will be stored in a directory calledoutput
- run
step_5.sh
to process the ProMISe results Output: this produces a GMT gene sets file with transcripts, and an extended gene set file. The latter is used for step 6, along with the rank file generated by step 3.
This requires manual manipulation of files
- Make an account on Babelomics v4 (http://v4.babelomics.org)
- Upload the extended miRNA-target gene set file with data type
Annotation > Extended Annotation
- Upload the DESeq2 rank list of transcripts with data type
ID List > Ranked
- Choose
Functional Analysis
, thenGene Set Analysis
underSet Enrichment Analysis
- Choose the rank list for input data; choose
fatiscan
for test, and choosetwo-tailed
- choose
your own annotations
and select the extended list
Precomputed results from Fatiscan are found in the fatiscan_output
directory.
- first download the results from the Babelomics website. On the bottom of the results page, click
Download Job
- move the resulting job folder to the
(LUAD|LUSC)_data
directory and rename itfatiscan_output
- run
step_7.sh
Output: apost_babelomics_results
directory with processed results for up-regulated and down-regulated genes separately- in these directories (specifically, the
downregulatedGenes/up-regulated_miRNAs/
andupregulatedGenes/down-regulated_miRNAs
), you'll find*FatigoInput.txt
lists of the genes we'll use as input for FatiGO in step 8.
- in these directories (specifically, the
Like step 6, this also requires manual input of data
- Upload the up-regulated and down-regulated list of mRNA targets as
ID List > Gene
- Choose
Functional Analysis
thenFatiGO
underSingle Enrichment Analysis
- Choose
Id list vs Rest of Genome
- Choose the up- or down-regulated list (run separate analysis for each)
Remove duplicates?
should be set toRemove from list2 those appearing in list1 (complementary list)
- For databases, choose
human
for organism, and check boxes forGO biological process
,GO cellular component
,GO molecular function
,KEGG pathways
,Reactome
, andBiocarta
. Note: For this step, if you do both LUSC and LUAD, you'll run four separate jobs.
Precomputed results from this step are found in the fatigo_output
directory
- first download the results from the Babelomics website. Like before, you click on
Download job
- in the
(LUAD|LUSC)_data
directory, create a directory calledfatigo_out
- move the resulting job folder to the
fatigo_out
and rename it eitherupGene
ordownGene
, depending on whether the FatiGO job focused on up-regulated or down-regulated genes. - run
step_9.sh
Output: PDF files of the hive plots.
Note: You'll have to manually add in annotations afterward using the tables and Adobe, Preview, or Powerpoint.
Note 2: If there are any issues with reproducing this analysis, please submit an issue on Github.