Skip to content

Pipeline for taking STAR's SJ.out files and parsing the counts for a given bed of named spliced junctions

Notifications You must be signed in to change notification settings

SamBryce-Smith/bedops_parse_star_junctions

 
 

Repository files navigation

Snakemake Pipeline: parse_star_junctions

Overview

Pipeline for taking STAR's SJ.out files and parsing the counts for a given bed of named spliced junctions.

This fork deviates from the original repository in the following ways:

  • Inputs and parameters are specified in a config YAML file
  • Software dependencies are managed using a conda environment
  • Only spliced read counting/extraction from STAR SJ.out files is performed. If you wish to calculate Percent Spliced In (PSI) values with respect to annotation, see the parent repository

Prerequesites

  • Snakemake (version 7.8.2 has been most recently tested, but earlier versions should also work)
  • cat and awk (usually satisfied with a bash installation)
  • <sample>.SJ.out.tab files generated by STAR
  • BED file of splice junction coordinates of interest (see section #input-bed-file-specifications for required coordinate convention)

The pipeline also has the following dependencies, all of which are automatically installed using a conda environment:

Input BED file specifications

The coordinate format is slightly non-conventional owing to differences in how STAR defines intron/SJ coordinates and how they are processed using custom scripts. The simplest way (in my opinion) to generate coordinates that will work with this pipeline is to do the following:

  • Assume you are generating BED coordinates of the intron sequence i.e. the interval stretches from the first base of the intron up to and including the last base of the intron.
  • Add 1 to all 'End' (3rd field) values irrespective of strand

It is also highly recommended to add informative, unique identifiers to the 'Name' (4th) field of each junction in the input BED file. This field will be appended to final output BED file alongside the coordinates, sample name and spliced read counts. The 'Score' (5th) field of the input BED file is ignored but should be populated to maintain consistency with the BED6 format.

Usage

1. Clone the Repository

git clone https://github.com/SamBryce-Smith/bedops_parse_star_junctions.git
cd bedops_parse_star_junctions

2. Configure the pipeline

Before running the pipeline, make sure to update the config.yaml file with your run-specific information (see comments in config.yaml for more details):

  • project_dir: Top-level directory where pipeline outputs will be stored.
  • out_spot: Subdirectory underneath project_dir where sorted beds and output will appear.
  • bam_spot: Folder containing the BAM/SJ.out files. The pipeline will use wildcards to match samples in this folder
  • pt1_sj_suffix: Suffix of the STAR splice junction tables for sample name extraction/definition.
  • bed_file: Path to the BED file of junctions you want to quantify.
  • final_output_name: Name for your final output BED file.

3. Dry-Run

Before running the pipeline, it is recommended to perform a dry-run to ensure everything is set up correctly. If you haven't updated the config with your own inputs and just want to check the pipeline is structured correctly, you can generate empty input files to match the base config file with the create_dryrun_data.sh script:

bash create_dryrun_data.sh

Irrespectively, you can perform a dry run with the following command:

snakemake -n -p -s parse_star_junctions.smk --use-conda

4. Run the pipeline

If everything looks good with the dry run, you can minimally execute the pipeline locally with the following command:

snakemake -n -p -s parse_star_junctions.smk --use-conda --cores <number of cores>

replacing <number_of_cores> with your desired number of cores if wish to run in parallel. Please note the first time you run the pipeline the conda environment will be installed, which can be a little slow.

If you wish to submit the pipeline to the UCL Computer Science cluster, please see the submit_parse.sh submit script:

bash submit_parse.sh

For submission to other compute systems, we recommending studying the 'Snakemake Profiles' documentation (github, docs)

Pipeline outputs

{project_dir}/{out_spot}/{final_output_name}.aggregated.clean.annotated.bed - Main output BED-like file

This contains junctions from the input BED file and associated counts in a given sample e.g.

chromosome start end filename_this_count_comes_from count strand name_of_junction_in_your_input
chr19 7168094 7170537 Cont-B_S2.SJ.out 49 - INSR_annotated
chr19 7168094 7170537 Cont-C_S3.SJ.out 30 - INSR_annotated
chr19 7168094 7170537 Cont-D_S4.SJ.out 35 - INSR_annotated
chr19 7168094 7170537 control_fluorescent_2.SJ.out 9 - INSR_annotated
chr19 7168094 7170537 control_fluorescent_3.SJ.out 5 - INSR_annotated
chr19 7168094 7170537 control_none_1.SJ.out 20 - INSR_annotated

In effect, this is a standard BED file with the 'name' field from the input BED file appended as a 7th field. Note the header is not included.

{project_dir}/{out_spot}/{final_output_name}.aggregated.bed - All junctions that overlap with the junctions in the input BED file

Useful as a sanity check to see if junctions in main output file are missing because of no expression or e.g. one-off errors

{project_dir}/{out_spot}/{sample_name}.bed - STAR SJ.out.tab files converted to BED format

Here, score field (5th) corresponds to the read count for the given sample. The name field (4th) corresponds to the inferred sample name (essentially filename with pt1_sj_suffix removed). A single file is generated for each sample inferred from the input directory

About

Pipeline for taking STAR's SJ.out files and parsing the counts for a given bed of named spliced junctions

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 80.3%
  • Shell 19.7%