In this module we will learn:
- about the cutadapt software and its uses
- how to use the cutadapt tool for trimming adapters
- how to trim all of our samples in a for-loop
- about the MultiQC tool and its capabilities
- how to run multiQC on a remote system, transfer and view the reports locally
As a reminder, our overall differential expression workflow is shown below. In this lesson, we will go over the bold part of the workflow.
Step | Task |
---|---|
1 | Experimental Design |
2 | Biological Samples / Library Preparation |
3 | Sequence Reads |
4 | Assess Quality of Reads |
5 | Splice-aware Mapping to Genome |
6 | Count Reads Associated with Genes |
:--: | ---- |
7 | Organize project files locally |
8 | Initialize DESeq2 and fit DESeq2 model |
9 | Assess expression variance within treatment groups |
10 | Specify pairwise comparisons and test for differential expression |
11 | Generate summary figures for comparisons |
12 | Annotate differential expression result tables |
Cutadapt is a very widely used read trimming and fastq processing software, cited several thousands of times. It's written in python, and is user-friendly and reasonably fast.
It is used for removing adapter sequences, primers, and poly-A tails, for trimming based on quality thresholds, for filtering reads based on characteristics, etc.
It can operate on both FASTA and FASTQ file formats, and it supports compressed or raw inputs and outputs.
Notably, cutadapt's error-tolerant adapter trimming likely contributed greatly to its early popularity. We will use it to trim the adapters from our reads. As usual, we'll view the help page to get a sense for how to structure our command.
Cutadapt Exercise:
- View the help page of the cutadapt tool
- Construct a cutadapt command to trim the adapters from paired-end reads
- View the output of cutadapt, and verify that it's correct
- Construct commands to trim the reads for all of our samples
Click here for solution - cutadapt exercise
-
Log back in to aws instance with
ssh <username>@50.17.210.255
-
View cutadapt help page
cutadapt --help | less # Will need to type `q` to exit from `less`
-
Trim the adapters from sample_01 with cutadapt
# Need to create directory for trimmed outputs mkdir ~/analysis/trimmed cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/sample_01.trimmed.fastq.gz -p ~/analysis/trimmed/sample_01_R2.trimmed.fastq.gz ~/data/reads/sample_01_R1.fastq.gz ~/data/reads/sample_01_R2.fastq.gz
-
View cutadapt output
ls -l ~/analysis/trimmed
-
Construct commands to trim the reads for all of our samples
Note: We're re-using the same command. We can update $SAMPLE
, then press 'up' to re-run cutadapt command with newly defined variable.
SAMPLE=sample_02
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz -p ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/reads/${SAMPLE}_R1.fastq.gz ~/data/reads/${SAMPLE}_R2.fastq.gz
SAMPLE=sample_03
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz -p ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/reads/${SAMPLE}_R1.fastq.gz ~/data/reads/${SAMPLE}_R2.fastq.gz
SAMPLE=sample_04
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz -p ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/reads/${SAMPLE}_R1.fastq.gz ~/data/reads/${SAMPLE}_R2.fastq.gz
Now that we've run cutadapt and trimmed the adapters from our reads, we will quickly re-run FastQC on these trimmed read FASTQs. This will confirm that we've successfully trimmed the adapters, and we'll see that our FASTQ files are ready for sequencing.
Re-running FastQC Exercise:
- Create directory for new fastqc results
- Construct and execute FastQC command to evaluate trimmed read FASTQ files
- View the output (filenames)
Click here for solution - re-running FastQC exercise
-
Create directory for new fastqc results
mkdir ~/analysis/fastqc_trimmed
-
FastQC command to evaluate trimmed FASTQ files
fastqc -o ~/analysis/fastqc_trimmed ~/analysis/trimmed/*.fastq.gz
FastQC is an excellent tool, but it can be tedious to look at the report for each sample separately, while keeping track of what trends emerge. It would be much easier to look at all the FastQC reports compiled into a single report. MultiQC is a tool that does exactly this.
MultiQC is designed to interpret and aggregate reports from various tools and output a single report as an HTML document.
MultiQC Exercise:
- View the multiQC help page
- Construct a MultiQC command to aggregate our QC results into a single report
- Transfer the MultiQC report to personal computer using scp
- View the MultiQC report
MultiQC solution
-
View MultiQC help page
multiqc --help | less # Will need to type `q` to exit from `less`
-
MultiQC command to process our trimmed read results
multiqc --outdir ~/analysis/multiqc ~/analysis/fastqc_trimmed/
-
Log out of aws instance and use scp to transfer MultiQC report to local computer
exit # log out from remote # Now on local scp <username>@50.17.210.255:~/analysis/multiqc/multiqc_report.html ~/workshop_rsd/
-
View MultiQC report Use GUI file manager to find your ~/workshop_rsd folder Double-click multiqc_report.html (open it with an internet browser)
Opening the HTML report, we see it is organized by the same modules and each plot has all samples for which FastQC was run. We can see the report confirms that the adapters have been trimmed from our sequence.
These materials have been adapted and extended from materials created by the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.