Snakemake Pipeline: parse_star_junctions

Overview

Pipeline for taking STAR's SJ.out files and parsing the counts for a given bed of named spliced junctions.

This fork deviates from the original repository in the following ways:

Inputs and parameters are specified in a config YAML file
Software dependencies are managed using a conda environment
Only spliced read counting/extraction from STAR SJ.out files is performed. If you wish to calculate Percent Spliced In (PSI) values with respect to annotation, see the parent repository

Prerequesites

Snakemake (version 7.8.2 has been most recently tested, but earlier versions should also work)
cat and awk (usually satisfied with a bash installation)
<sample>.SJ.out.tab files generated by STAR
BED file of splice junction coordinates of interest (see section #input-bed-file-specifications for required coordinate convention)

The pipeline also has the following dependencies, all of which are automatically installed using a conda environment:

bedops v2.4.39
bedtools v2.3.0

Input BED file specifications

The coordinate format is slightly non-conventional owing to differences in how STAR defines intron/SJ coordinates and how they are processed using custom scripts. The simplest way (in my opinion) to generate coordinates that will work with this pipeline is to do the following:

Assume you are generating BED coordinates of the intron sequence i.e. the interval stretches from the first base of the intron up to and including the last base of the intron.
Add 1 to all 'End' (3rd field) values irrespective of strand

It is also highly recommended to add informative, unique identifiers to the 'Name' (4th) field of each junction in the input BED file. This field will be appended to final output BED file alongside the coordinates, sample name and spliced read counts. The 'Score' (5th) field of the input BED file is ignored but should be populated to maintain consistency with the BED6 format.

Usage

1. Clone the Repository

git clone https://github.com/SamBryce-Smith/bedops_parse_star_junctions.git
cd bedops_parse_star_junctions

2. Configure the pipeline

Before running the pipeline, make sure to update the config.yaml file with your run-specific information (see comments in config.yaml for more details):

project_dir: Top-level directory where pipeline outputs will be stored.
out_spot: Subdirectory underneath project_dir where sorted beds and output will appear.
bam_spot: Folder containing the BAM/SJ.out files. The pipeline will use wildcards to match samples in this folder
pt1_sj_suffix: Suffix of the STAR splice junction tables for sample name extraction/definition.
bed_file: Path to the BED file of junctions you want to quantify.
final_output_name: Name for your final output BED file.

3. Dry-Run

Before running the pipeline, it is recommended to perform a dry-run to ensure everything is set up correctly. If you haven't updated the config with your own inputs and just want to check the pipeline is structured correctly, you can generate empty input files to match the base config file with the create_dryrun_data.sh script:

bash create_dryrun_data.sh

Irrespectively, you can perform a dry run with the following command:

snakemake -n -p -s parse_star_junctions.smk --use-conda

4. Run the pipeline

If everything looks good with the dry run, you can minimally execute the pipeline locally with the following command:

snakemake -n -p -s parse_star_junctions.smk --use-conda --cores <number of cores>

replacing <number_of_cores> with your desired number of cores if wish to run in parallel. Please note the first time you run the pipeline the conda environment will be installed, which can be a little slow.

If you wish to submit the pipeline to the UCL Computer Science cluster, please see the submit_parse.sh submit script:

bash submit_parse.sh

For submission to other compute systems, we recommending studying the 'Snakemake Profiles' documentation (github, docs)

Pipeline outputs

`{project_dir}/{out_spot}/{final_output_name}.aggregated.clean.annotated.bed` - Main output BED-like file

This contains junctions from the input BED file and associated counts in a given sample e.g.

chromosome	start	end	filename_this_count_comes_from	count	strand	name_of_junction_in_your_input
chr19	7168094	7170537	Cont-B_S2.SJ.out	49	-	INSR_annotated
chr19	7168094	7170537	Cont-C_S3.SJ.out	30	-	INSR_annotated
chr19	7168094	7170537	Cont-D_S4.SJ.out	35	-	INSR_annotated
chr19	7168094	7170537	control_fluorescent_2.SJ.out	9	-	INSR_annotated
chr19	7168094	7170537	control_fluorescent_3.SJ.out	5	-	INSR_annotated
chr19	7168094	7170537	control_none_1.SJ.out	20	-	INSR_annotated

In effect, this is a standard BED file with the 'name' field from the input BED file appended as a 7th field. Note the header is not included.

`{project_dir}/{out_spot}/{final_output_name}.aggregated.bed` - All junctions that overlap with the junctions in the input BED file

Useful as a sanity check to see if junctions in main output file are missing because of no expression or e.g. one-off errors

`{project_dir}/{out_spot}/{sample_name}.bed` - STAR SJ.out.tab files converted to BED format

Here, score field (5th) corresponds to the read count for the given sample. The name field (4th) corresponds to the inferred sample name (essentially filename with pt1_sj_suffix removed). A single file is generated for each sample inferred from the input directory

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
.gitattributes		.gitattributes
README.md		README.md
bedops_parse_star.yaml		bedops_parse_star.yaml
cluster.yaml		cluster.yaml
cluster_qsub.sh		cluster_qsub.sh
config.yaml		config.yaml
create_dryrun_data.sh		create_dryrun_data.sh
parse_star_junctions.smk		parse_star_junctions.smk
splicejunction2bed.py		splicejunction2bed.py
submit_parse.sh		submit_parse.sh
submit_parse_nosub.sh		submit_parse_nosub.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakemake Pipeline: parse_star_junctions

Overview

Prerequesites

Input BED file specifications

Usage

1. Clone the Repository

2. Configure the pipeline

3. Dry-Run

4. Run the pipeline

Pipeline outputs

`{project_dir}/{out_spot}/{final_output_name}.aggregated.clean.annotated.bed` - Main output BED-like file

`{project_dir}/{out_spot}/{final_output_name}.aggregated.bed` - All junctions that overlap with the junctions in the input BED file

`{project_dir}/{out_spot}/{sample_name}.bed` - STAR SJ.out.tab files converted to BED format

About

Releases 1

Packages

Languages

SamBryce-Smith/bedops_parse_star_junctions

Folders and files

Latest commit

History

Repository files navigation

Snakemake Pipeline: parse_star_junctions

Overview

Prerequesites

Input BED file specifications

Usage

1. Clone the Repository

2. Configure the pipeline

3. Dry-Run

4. Run the pipeline

Pipeline outputs

{project_dir}/{out_spot}/{final_output_name}.aggregated.clean.annotated.bed - Main output BED-like file

{project_dir}/{out_spot}/{final_output_name}.aggregated.bed - All junctions that overlap with the junctions in the input BED file

{project_dir}/{out_spot}/{sample_name}.bed - STAR SJ.out.tab files converted to BED format

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`{project_dir}/{out_spot}/{final_output_name}.aggregated.clean.annotated.bed` - Main output BED-like file

`{project_dir}/{out_spot}/{final_output_name}.aggregated.bed` - All junctions that overlap with the junctions in the input BED file

`{project_dir}/{out_spot}/{sample_name}.bed` - STAR SJ.out.tab files converted to BED format

Packages