rnaseq_preprocess is a Nextflow pipeline for RNA-seq quantification with salmon
. The full processing steps are fastqc
first, optional trimming with seqtk
, then quantification with salmon
, aggregation to gene level with tximport
and a small summary report with MultiQC
. Multiple fastq files per sample are supported. These technical replicates will be merged prior to quantification. The input are fastq files,
provided via a samplesheet:
sample,r1,r2,libtype
sampleA,/path/to/r1.fq.gz,/path/to/r2.fq.gz,A
(...and...so...on)
Sample is a user-chosen name for this set of fastq files. Fastq files with the same sample entry are concatenated before quantification. Libtype is the library type argument from salmon. "A" means automatic detection.
Indexing
De novo indexing is supported and assumes that a gentrome (genome-decoyed transcriptome) is to be created. For this at minimum we need a genome and transcriptome fasta file and a GTF:
NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile singularity,slurm --only_idx \
--genome path/to/genome.fa.gz --txtome path/to/txtome.fa.gz --gtf path/to/foo.gtf.gz \
-with-report indexing_report.html -with-trace indexing_report.trace -bg > indexing_report.log
The indexing step must be run first and separately using the --only_idx
flag.
--only_idx
: trigger the indexing process
--idx_name
: name of the produced index, default idx
--idx_dir
: name of the directory inside rnaseq_preprocess_results/
storing the index, default salmon_idx
--idx_additional
: additional arguments to salmon index
beyond the defaults which are --no-version-check -t -d -i -p --gencode
--txtome
: path to the gzipped transcriptome fasta
--genome
: path to the gzipped genome fasta
--gtf
: path to the gzipped GTF file
--transcript_id
: name of GTF column storing transcript ID, default transcript_id
--transcript_name
: name of GTF column storing transcript name, default transcript_name
--gene_id
: name of GTF column storing gene ID, default gene_id
--gene_name
: name of GTF column storing gene name, default gene_name
--gene_type
: name of GTF column storing gene biotype, default gene_type
For the indexing process, 30GB of RAM and 6 CPUs are required/hardcoded.
Quantification/tximport
Quantification command line:
NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile singularity,slurm \
--idx path/to/idx/folder/ --tx2gene path/to/tx2gene.txt --samplesheet path/to/samplesheet.csv \
-with-report quant_report.html -with-trace quant_report.trace -bg > quant_report.log
--idx
: path to the salmon index folder
--tx2gene
: path to the tx2gene map matching transcripts to genes
--samplesheet
: path to the input samplesheet
--trim_reads
: logical, whether to trim reads to a fixed length
--trim_length
: numeric, length for trimming
--quant_additional
: additional options to salmon quant
beyond --gcBias --seqBias --posBias
We hardcoded 30GB RAM and 6 CPUs for the quantification. On our HPC we use:
Other available options
--merge_keep
: logical, whether to keep the merged fastq files
--merge_dir
: folder inside the output directory to store the merged fastq files
--trim_keep
: logical, whether to keep the trimmed fastq files
--trim_dir
: folder inside the output directory to store the trimmed fastq files
--skip_fastqc
: logical, whether to skip fastqc
--only_fastqc
: logical, whether to only run fastqc
and skip quantification
--skip_multiqc
: logical, whether to skip multiqc
--skip_tximport
: logical, whether to skip the tximport
process downstream of the quantification
--fastqc_dir
: folder inside the output directory to store the fastqc results
--multiqc_dir
: folder inside the output directory to store the multiqc results
Output is a folder "rnaseq_preprocess_results with self-explainatory content. See the misc folder which contains the software versions used in the pipeline and the exact command lines. In case of running the pipeline this output will be in the pipeline_info
folder of the output directory.