INPUT_PROCESSING

Types of Input
--------------

1. Reads in fastq format.
1a. Unpaired reads
1b. Paired reads in single files
1c. Paired reads in separate files
2. Mappings in sam/sam.gz/bam format.

In every run, there can be only 1 type of input: 1a or 1b or 1c or 2. Several
files of the same input type can be given. To combine several types, you must
first bring them to a common type "by hand" (usually 2).


Details
-------

1. Reads in fastq format.

Whenever input consists of several sets of reads in fastq format, alu-detect
will first map all reads to the reference using bowtie2. It will then proceed
with the workflow as if the input was of type 2.

The read group of each set of reads must be specifed. (Read groups may be
identical, but there must be as many read groups given as there are fastq
files.) Also, a common INPUT_PHRED value must be given. There are several types
of fastq inputs.


1a. Unpaired reads:

$ ORIG_UNPAIRED_READS="file1.fq file2.fq" ORIG_READ_GROUPS="rg1 rg2" INPUT_PHRED=33 \
[...] ./alu-detect detect [...]


1b. Paired reads in single files:

$ ORIG_PAIRED_READS="file1.fq file2.fq" ORIG_READ_GROUPS="rg1 rg2" INPUT_PHRED=33 \
[...] ./alu-detect detect [...]

Here, file1.fq contains consecutive reads from each pair:

@read1/1
AAAA
+
####
@read1/2
AAAA
+
####
@read2/1
AAAA
+
####
@read2/2
AAAA
+
####


1c. Paired reads in separate files:

$ ORIG_READS_1="file1.1.fq file2.1.fq" ORIG_READS_1="file1.2.fq file2.2.fq" \
ORIG_READ_GROUPS="rg1 rg2" INPUT_PHRED=33 [...] ./alu-detect detect [...]

Here, file1.1.fq contains:

@read1/1
AAAA
+
####
@read2/1
AAAA
+
####

and file1.2.fq contains:

@read1/2
AAAA
+
####
@read2/2
AAAA
+
####


2. Mappings in sam/sam.gz/bam format.

$ ORIG_MAPPINGS="file1.bam file2.sam.gz" [...] ./alu-detect detect [...]


Read groups: Downstream processing steps need information about what read group
each read belongs to. For every read found in a SAM record, its read group is
determined from the RG:Z:<read_group> tag. If that is missing, a default value
of "00" is used. This is ok if and only if the pairing information is very
similar for all reads missing their RG tag.

To specify a default read group for some of the sam/sam.gz/bam files, you can
also set the variable ORIG_READ_GROUPS. Unlike the case of fastq input, there
may be fewer read groups specifed than there are input sam/sam.gz/bam files. In
that case, read group 1 is used as the default for sam/sam.gz/bam file 1, read
group 2 for file 2, and so on. Files with no corresponding read group will use
"00" as the default read group. E.g.:

$ ORIG_MAPPINGS="file1.bam file2.sam.gz" ORIG_READ_GROUPS="rg1" [...] \
  ./alu-detect detect [...]

All reads missing the RG tag from file1.bam will be assigned to the read group
"rg1", and all reads the RG tag from file2.sam.gz will be assigned the read
group "00".


Ordering: When dealing with paired reads, downstream processing steps require
that mappings for both reads in each read pair be found in consecutive SAM
records. A common situation when this is not the case is when a sam/sam.gz/bam
file is sorted by coordinate. This is usually (but not always) specified by the
"SO:coordiante" tag on the first line of the file (see SAM spec). If that is the
case, alu-detect will re-sort the file appropriately.

Unfortunately, some programs do not honour this tag, notably samtools. So,
samtools may produce bam files sorted by coordinate, but without the
"SO:coordinate" tag. This breaks alu-detect. To fix it, set the additional RSORT
variable to a non-empty value. This will cause alu-detect to sort all input
files by read name, regardless of the contents of the SO tag:

$ ORIG_MAPPINGS="file1.bam file2.sam.gz" RSORT=1 [...] ./alu-detect detect [...]


3. Mappings in sam/bam format, sorted by read name.

The input is given as, e.g.:
$ ORIG_MAPPINGS_SAM=file.sam alu-detect

The program performs steps B-E:

- (B, in stage 2) detect pairing information, using detect-pairing.

- (C, in stage 3) extract discordant mappings. For this, alu-detect
  uses: add-extra-sam-flags and filter-mappings. Both take pairing
  information into account.

  add-extra-sam-flags adds the following flags to SAM field #2:

  - 0x1000: mqv (of this read only) >= minimum mqv (default: 5)

  - 0x2000: mapping left end has tail of >= min_tail_insert_size bp
    (default: 15bp) in cigar ops H/S/I.

  - 0x4000: mapping right end has tail of >= min_tail_insert_size bp
    (default: 15bp) in cigar ops H/S/I.

  - 0x8000: paired mapping is concordant (both reads mapped, correct
    strand position, within min/max fragment size)

  - 0x10000: read length < min_read_len (default: 20bp)

  filter-mappings is used to extract discordant mappings as follows:

  - for unpaired reads, extract mappings: that are unmapped (0x4) or
    have large left/right tail (0x2000 or 0x4000)

  - for paired reads, extract paired mappings where: the pair is not
    mapped concordantly (0/0x8000), or either read has large
    left/right tail (0x2000 or 0x4000).