Skip to content

Preprocessing of RNA-seq data using salmon and tximport

Notifications You must be signed in to change notification settings

ATpoint/rnaseq_preprocess

Repository files navigation

rnaseq_preprocess

CI Nextflow run with docker run with singularity

Introduction

rnaseq_preprocess is a Nextflow pipeline for RNA-seq quantification with salmon. The full processing steps are fastqc first, optional trimming with seqtk, then quantification with salmon, aggregation to gene level with tximport and a small summary report with MultiQC. Multiple fastq files per sample are supported. These technical replicates will be merged prior to quantification. The input are fastq files, provided via a samplesheet:

sample,r1,r2,libtype
sampleA,/path/to/r1.fq.gz,/path/to/r2.fq.gz,A
(...and...so...on)

Sample is a user-chosen name for this set of fastq files. Fastq files with the same sample entry are concatenated before quantification. Libtype is the library type argument from salmon. "A" means automatic detection.

Details

Indexing

De novo indexing is supported and assumes that a gentrome (genome-decoyed transcriptome) is to be created. For this at minimum we need a genome and transcriptome fasta file and a GTF:

NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main  -profile singularity,slurm --only_idx \
    --genome path/to/genome.fa.gz --txtome path/to/txtome.fa.gz --gtf path/to/foo.gtf.gz \
    -with-report indexing_report.html -with-trace indexing_report.trace -bg > indexing_report.log

The indexing step must be run first and separately using the --only_idx flag.

--only_idx: trigger the indexing process
--idx_name: name of the produced index, default idx
--idx_dir: name of the directory inside rnaseq_preprocess_results/ storing the index, default salmon_idx
--idx_additional: additional arguments to salmon index beyond the defaults which are --no-version-check -t -d -i -p --gencode
--txtome: path to the gzipped transcriptome fasta
--genome: path to the gzipped genome fasta
--gtf: path to the gzipped GTF file
--transcript_id: name of GTF column storing transcript ID, default transcript_id
--transcript_name: name of GTF column storing transcript name, default transcript_name
--gene_id: name of GTF column storing gene ID, default gene_id
--gene_name: name of GTF column storing gene name, default gene_name
--gene_type: name of GTF column storing gene biotype, default gene_type

For the indexing process, 30GB of RAM and 6 CPUs are required/hardcoded.

Quantification/tximport

Quantification command line:

NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile singularity,slurm \
    --idx path/to/idx/folder/ --tx2gene path/to/tx2gene.txt --samplesheet path/to/samplesheet.csv \
    -with-report quant_report.html -with-trace quant_report.trace -bg > quant_report.log

--idx: path to the salmon index folder --tx2gene: path to the tx2gene map matching transcripts to genes
--samplesheet: path to the input samplesheet
--trim_reads: logical, whether to trim reads to a fixed length
--trim_length: numeric, length for trimming
--quant_additional: additional options to salmon quant beyond --gcBias --seqBias --posBias

We hardcoded 30GB RAM and 6 CPUs for the quantification. On our HPC we use:

Other available options

--merge_keep: logical, whether to keep the merged fastq files
--merge_dir: folder inside the output directory to store the merged fastq files
--trim_keep: logical, whether to keep the trimmed fastq files
--trim_dir: folder inside the output directory to store the trimmed fastq files
--skip_fastqc: logical, whether to skip fastqc
--only_fastqc: logical, whether to only run fastqc and skip quantification
--skip_multiqc: logical, whether to skip multiqc
--skip_tximport: logical, whether to skip the tximport process downstream of the quantification
--fastqc_dir: folder inside the output directory to store the fastqc results
--multiqc_dir: folder inside the output directory to store the multiqc results

Output is a folder "rnaseq_preprocess_results with self-explainatory content. See the misc folder which contains the software versions used in the pipeline and the exact command lines. In case of running the pipeline this output will be in the pipeline_info folder of the output directory.

About

Preprocessing of RNA-seq data using salmon and tximport

Resources

Stars

Watchers

Forks

Packages

No packages published