Pipeline for taking STAR's SJ.out files and parsing the counts for a given bed of named spliced junctions.
This fork deviates from the original repository in the following ways:
- Inputs and parameters are specified in a config YAML file
- Software dependencies are managed using a conda environment
- Only spliced read counting/extraction from STAR SJ.out files is performed. If you wish to calculate Percent Spliced In (PSI) values with respect to annotation, see the parent repository
- Snakemake (version 7.8.2 has been most recently tested, but earlier versions should also work)
cat
andawk
(usually satisfied with a bash installation)<sample>.SJ.out.tab
files generated by STARBED
file of splice junction coordinates of interest (see section #input-bed-file-specifications for required coordinate convention)
The pipeline also has the following dependencies, all of which are automatically installed using a conda environment:
The coordinate format is slightly non-conventional owing to differences in how STAR defines intron/SJ coordinates and how they are processed using custom scripts. The simplest way (in my opinion) to generate coordinates that will work with this pipeline is to do the following:
- Assume you are generating BED coordinates of the intron sequence i.e. the interval stretches from the first base of the intron up to and including the last base of the intron.
- Add 1 to all 'End' (3rd field) values irrespective of strand
It is also highly recommended to add informative, unique identifiers to the 'Name' (4th) field of each junction in the input BED file. This field will be appended to final output BED file alongside the coordinates, sample name and spliced read counts. The 'Score' (5th) field of the input BED file is ignored but should be populated to maintain consistency with the BED6 format.
git clone https://github.com/SamBryce-Smith/bedops_parse_star_junctions.git
cd bedops_parse_star_junctions
Before running the pipeline, make sure to update the config.yaml
file with your run-specific information (see comments in config.yaml
for more details):
project_dir
: Top-level directory where pipeline outputs will be stored.out_spot
: Subdirectory underneath project_dir where sorted beds and output will appear.bam_spot
: Folder containing the BAM/SJ.out files. The pipeline will use wildcards to match samples in this folderpt1_sj_suffix
: Suffix of the STAR splice junction tables for sample name extraction/definition.bed_file
: Path to the BED file of junctions you want to quantify.final_output_name
: Name for your final output BED file.
Before running the pipeline, it is recommended to perform a dry-run to ensure everything is set up correctly. If you haven't updated the config with your own inputs and just want to check the pipeline is structured correctly, you can generate empty input files to match the base config file with the create_dryrun_data.sh
script:
bash create_dryrun_data.sh
Irrespectively, you can perform a dry run with the following command:
snakemake -n -p -s parse_star_junctions.smk --use-conda
If everything looks good with the dry run, you can minimally execute the pipeline locally with the following command:
snakemake -n -p -s parse_star_junctions.smk --use-conda --cores <number of cores>
replacing <number_of_cores>
with your desired number of cores if wish to run in parallel. Please note the first time you run the pipeline the conda environment will be installed, which can be a little slow.
If you wish to submit the pipeline to the UCL Computer Science cluster, please see the submit_parse.sh
submit script:
bash submit_parse.sh
For submission to other compute systems, we recommending studying the 'Snakemake Profiles' documentation (github, docs)
{project_dir}/{out_spot}/{final_output_name}.aggregated.clean.annotated.bed
- Main output BED-like file
This contains junctions from the input BED file and associated counts in a given sample e.g.
chromosome | start | end | filename_this_count_comes_from | count | strand | name_of_junction_in_your_input |
---|---|---|---|---|---|---|
chr19 | 7168094 | 7170537 | Cont-B_S2.SJ.out | 49 | - | INSR_annotated |
chr19 | 7168094 | 7170537 | Cont-C_S3.SJ.out | 30 | - | INSR_annotated |
chr19 | 7168094 | 7170537 | Cont-D_S4.SJ.out | 35 | - | INSR_annotated |
chr19 | 7168094 | 7170537 | control_fluorescent_2.SJ.out | 9 | - | INSR_annotated |
chr19 | 7168094 | 7170537 | control_fluorescent_3.SJ.out | 5 | - | INSR_annotated |
chr19 | 7168094 | 7170537 | control_none_1.SJ.out | 20 | - | INSR_annotated |
In effect, this is a standard BED file with the 'name' field from the input BED file appended as a 7th field. Note the header is not included.
{project_dir}/{out_spot}/{final_output_name}.aggregated.bed
- All junctions that overlap with the junctions in the input BED file
Useful as a sanity check to see if junctions in main output file are missing because of no expression or e.g. one-off errors
Here, score field (5th) corresponds to the read count for the given sample. The name field (4th) corresponds to the inferred sample name (essentially filename with pt1_sj_suffix
removed). A single file is generated for each sample inferred from the input directory