GMA - Genome Mappability Analyzer
The latest version is 0.1.3 You can download the original gma-0.1.3.tar.gz on the Sourceforge download page. You can download this modfied version of gma-0.1.3.tar.gz from [this Github repository] (https://github.com/cbergman/gma-0.1.3/archive/master.zip)
GMA is developed under Linux environment. Details are described below, and thus it will work in similar environments that we support. If you're Windows user, you may need cygwin.
$ uname -a
Linux mshadoop1.cshl.edu 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux
GMA outputs ".gms" extension files, which consists of 6 columns. A detailed description of each column is as follows:
field | explanation |
---|---|
chr | This shows chromosomes if it is species that have multi chromosomes. Otherwise it means the species. |
pos | Position. Note that it begins with "1" not "0". |
base | Base from aligning results. Note that if it's not a,c,g,t/A,C,G,T, then the alignment tool failed to map reads to this position. In this case, "*" will be shown. |
maq | By design, each position can have 4 mapping quality scores at maximum. All slots are initialized with "-1". If any mapping quality scores are assigned, non-negative numbers should be shown. |
total | This will give you rough idea of GMS. Total means how many reads are mapped to the position. The first number after "=" means the number of reads more than threshold that a user can set as a GMA parameters. Thus they are considered reliable reads. However the second number means the number of reads that is less than threshold so they are unreliable reads. Based on such information guide you rough GMS. |
GMS | This is the score that shows how reliably each base of the genome can be mapped. |
This is a snippet of E. coli gms file.
E. coli
#chr pos base maq total = >= + <(um) GMS
NC_000913 000000001 A (60,-1)(-1,-1) 1 = 1 + 0(0) 99.9999
NC_000913 000000002 G (60,-1)(-1,-1) 2 = 2 + 0(0) 99.9999
NC_000913 000000003 C (60,-1)(-1,-1) 3 = 3 + 0(0) 99.9999
NC_000913 000000004 T (60,-1)(-1,-1) 4 = 4 + 0(0) 99.9999
NC_000913 000000005 T (60,-1)(-1,-1) 5 = 5 + 0(0) 99.9999
NC_000913 000000006 T (60,-1)(-1,-1) 6 = 6 + 0(0) 99.9999
NC_000913 000000007 T (60,-1)(-1,-1) 7 = 7 + 0(0) 99.9999
NC_000913 000000008 C (60,-1)(-1,-1) 8 = 8 + 0(0) 99.9999
NC_000913 000000009 A (60,-1)(-1,-1) 9 = 9 + 0(0) 99.9999
NC_000913 000000010 T (60,-1)(-1,-1) 10 = 10 + 0(0) 99.9999
NC_000913 000000011 T (60,-1)(-1,-1) 11 = 11 + 0(0) 99.9999
NC_000913 000000012 C (60,-1)(-1,-1) 12 = 12 + 0(0) 99.9999
NC_000913 000000013 T (60,-1)(-1,-1) 13 = 13 + 0(0) 99.9999
NC_000913 000000014 G (60,-1)(-1,-1) 14 = 14 + 0(0) 99.9999
NC_000913 000000015 A (60,-1)(-1,-1) 15 = 15 + 0(0) 99.9999
NC_000913 000000016 C (60,-1)(-1,-1) 16 = 16 + 0(0) 99.9999
NC_000913 000000017 T (60,-1)(-1,-1) 17 = 17 + 0(0) 99.9999
NC_000913 000000018 G (60,-1)(-1,-1) 18 = 18 + 0(0) 99.9999
NC_000913 000000019 C (60,-1)(-1,-1) 19 = 19 + 0(0) 99.9999
NC_000913 000000020 A (60,-1)(-1,-1) 20 = 20 + 0(0) 99.9999
This is a snippet of hg19 gms file.
hg19 (Homo Sapiens ver.19)
#chr pos base maq total = >= + <(um) GMS
chrX 000093411 G (20,-1)(37,-1) 200 = 47 + 153(0) 38.7653
chrX 000093412 G (20,-1)(37,-1) 200 = 49 + 151(0) 39.7602
chrX 000093413 C (20,-1)(37,-1) 200 = 51 + 149(0) 40.7551
chrX 000093414 A (20,-1)(37,-1) 200 = 53 + 147(0) 41.7500
chrX 000093415 G (20,-1)(43,-1) 200 = 55 + 145(0) 42.7450
chrX 000093416 T (20,-1)(43,-1) 200 = 57 + 143(0) 43.7399
chrX 000093417 T (20,-1)(43,-1) 200 = 59 + 141(0) 44.7349
chrX 000093418 A (20,-1)(43,-1) 200 = 61 + 139(0) 45.7299
chrX 000093419 G (20,-1)(43,-1) 200 = 63 + 137(0) 46.7249
chrX 000093420 A (18,-1)(41,-1) 200 = 64 + 136(0) 47.7169
chrX 000093421 G (18,-1)(41,-1) 200 = 65 + 135(0) 48.7089
chrX 000093422 A (18,-1)(41,-1) 200 = 66 + 134(0) 49.7010
chrX 000093423 G (18,-1)(41,-1) 200 = 67 + 133(0) 50.6930
chrX 000093424 C (18,-1)(41,-1) 200 = 68 + 132(0) 51.6850
chrX 000093425 A (18,-1)(41,-1) 200 = 69 + 131(0) 52.6771
chrX 000093426 G (18,-1)(41,-1) 200 = 70 + 130(0) 53.661
chrX 000093427 A (18,-1)(41,-1) 200 = 71 + 129(0) 54.6612
chrX 000093428 C (18,-1)(41,-1) 200 = 72 + 128(0) 55.6532
chrX 000093429 T (18,-1)(41,-1) 200 = 73 + 127(0) 56.6452
chrX 000093430 G (18,-1)(41,-1) 200 = 74 + 126(0) 57.6373
1. Download the modfied version of the latest version of GMA here.
$ wget -O gma-0.1.3.zip https://github.com/cbergman/gma-0.1.3/archive/master.zip
2. Extract files.
$ unzip gma-0.1.3.zip
If you go to "gma-0.1.3-master" directory, there are 6 directories.
$ cd gma-0.1.3-master/
$ ls
bin example inc lib script src README.md
3. Now you have two options to run GMA
(1) Pre-compiled binary files
You use pre-compiled binary file in bin directory. Note that mapper and reducer are our responsibility. Since bwa and samtools are needed to run GMA, we included the precompiled versions of them into GMA package. This modified distribution also includes bedGraphToBigWig to allow conversion of GMA output to bigWig format.
$ cd gma-0.1.3-master/bin
$ ls
bedGraphToBigWig bwa mapper reducer samtools
(2) Compiling source code.
You may want to get binary files from compiling source code. Then go to src/ directory and "make".
$ cd gma-0.1.3-master/src
$ make
If you get errors below, you need to install libz library.
$ make
gcc -g -Wall -L../lib main_mapper.o genfa.o mapper.o genfq.o runall.o -o mapper -lz -lbam -lm
/usr/bin/ld: cannot find -lz
collect2: ld returned 1 exit status
make: *** [mapper] Error 1
If compiling works smoothly, you can find new binary files in bin/ directory. Note that there will be no binary files newly generated in src/ directory. Makefile script moves them to bin/ automatically.
For quick start, there are two important directories that you must know. One is example and the other is script. In the example directory, there is sub-directory called "ecoli", which has pre-computed indexes and a pre-processed fasta file named NC_000913.fa.ppd inside the input/index and input/ppd directories, respectively.
$ cd gma-0.1.3-master/
$ ls
bin example inc lib script src
ls example
ecoli yeast
$ ls example/ecoli
input output
$ ls example/ecoli/input
index ppd
$ ls example/ecoli/input/index
NC_000913.fna NC_000913.fna.ann NC_000913.fna.pac NC_000913.fna.rpac NC_000913.fna.sa
NC_000913.fna.amb NC_000913.fna.bwt NC_000913.fna.rbwt NC_000913.fna.rsa
$ ls example/ecoli/input/ppd
NC_000913.fa.ppd
Now, all you need is to run script in script/ directory.
$ cd gma-0.1.3-master/script/
$ ./expr.local.ecoli.sh
.
.
Now, you can see a new directory, named "ecoli.l100.o0.qA.s0.02.i0.d0 " is generated with "timestamp.txt". The result file name is "mapred.txt".
$ cd gma-0.1.3-master/example/ecoli/output/local/ecoli.l100.o0.qA.s0.02.i0.d0
$ ls
aln_se.bam aln_se.sort.bam map.txt ref.fa.ann ref.fa.pac ref.fa.rsa
aln_se.sai mapred.txt ref.fa ref.fa.bwt ref.fa.rbwt ref.fa.s-100-A-0.0200-0.0000-0.0000.fq
aln_se.sam mapsort.txt ref.fa.amb ref.fa.fai ref.fa.rpac ref.fa.sa
In the script directory there are 4 example script files to run GMA on single machine:
- expr.local.ecoli.sh
- expr.local.yeast.sh
- expr.local.ecoli.tech.sh
- expr.local.yeast.tech.sh
The scripts with "tech" in their filenames run with pre-set options for different NGS technologies.
GMA provides two main binary files -- mapper and reducer -- whose names are more intuitive under MapReduce. The local version of GMA is run using a the combination of two programs. Help menus for the mapper and reducer programs are as follows.
$ cd gma-0.1.3-master/bin/
$ mapper -h
Program : GMA (Genome Mappability Analyzer) MAPPER
Version : 0.0.1
Usage : gma [COMMAND] [OPTION] [FASTA]
[COMMAND] runall | genfqse | genfqpe | genfa
runall : Proceed genfqse/genfqpe/analyze.
-l Length [1-200]
-q Quality value. Refer http://en.wikipedia.org/wiki/FASTQ_format
-s Substitution rate. [0.0000 - 1.0000] 1 means 100%, 0.25 means 25%
-i Insertion rate. [0.0000 - 1.0000] 1 means 100%, 0.25 means 25%
-d Deletion rate. [0.0000 - 1.0000] 1 means 100%, 0.25 means 25%
-o library distance bp [100-250] between end of start and start of end
If -o option is specified, pair-end aligment will be forced unless it is 0(zero)
If -o option is zero, single-end reads will be generated
-t Threshold for map quality to analyze sam file and generatemap result.
-f FASTA file name to be generated, e.g. ref.fa
-b The number of bases per line in FASTA file before it is preprocessed
-x index file path
-p use if you use Hadoop and should specify the directory that includes bwa and samtools and can be accessed by hadoop.
genfqse : Generate FastQ file for single ends.
-l Length [1-200]
-q Quality value. Refer http://en.wikipedia.org/wiki/FASTQ_format
-s Substitution rate. [0.0000 - 1.0000] 1 means 100%, 0.25 means 25%
-i Insertion rate. [0.0000 - 1.0000] 1 means 100%, 0.25 means 25%
-d Deletion rate. [0.0000 - 1.0000] 1 means 100%, 0.25 means 25%
genfqpe : Generate FastQ file for mate pair.
-l Length [1-200]
-q Quality value. Refer http://en.wikipedia.org/wiki/FASTQ_format
-s Substitution rate. [0.0000 - 1.0000] 1 means 100%, 0.25 means 25%
-i Insertion rate. [0.0000 - 1.0000] 1 means 100%, 0.25 means 25%
-d Deletion rate. [0.0000 - 1.0000] 1 means 100%, 0.25 means 25%
-o library distance bp [100-250] between end of start and start of end
If -o option is specified, pair-end aligment will be forced unless it is 0(zero)
If -o option is zero, single-end reads will be generated
-b The number of bases per line in FASTA file before it is preprocessed
genfa : Generate FastA file from preprocessed file.
-f FASTA file name to be generated, e.g. ref.fa
-b The number of bases in a line of FASTA file
extract : extract information from bam file then make an intermediate file
-l Length [1-200]
-f FASTA file name to be generated, e.g. ref.fa
-m BAM file name
$ reducer -h
Program : GMA (Genome Mappability Analyzer) REDUCER
Version : 0.0.1
Usage : reducer [COMMAND] [OPTION]
[COMMAND] analyzer
analyzer : anlayze map results, calculate genome mappabilityr.
[OPTION] l | t
-l Length [1-200]
-t Threshold of map quality from sam file
For a quick run, we provide parameter settings 5 kinds of the most popular sequencing technologies; Illumina, Solid, Ion Torrent, Roche-454 and PacBio. We also support pacbio-ec, which assumes PacBio error corrected case. You can use one of these options to pass to the mapper tech command, similar to how the illumina option is used in the command below to generate GMS files for 100bp single-ended reads on the E. coli genome.
$ cd gma-0.1.3-master/example/ecoli/output/local/ecoli.l100.o0.qA.s0.02.i0.d0
$ cat ../../../input/ppd/NC_000913.fa.ppd | ../../../../../bin/mapper tech --illumina -b 70 -x ../../../input/index/NC_000913.fna -p ../../../../../bin 1> map.txt
option | Description |
---|---|
--illumina | Settings for Illumina |
--solid | Settings for Solid |
--iontorrent | Settings for Ion Torrent |
--roche | Settings for Roche/454 |
--pacbio | Settings for PacBio |
--pacbio-ec | Settings for PacBio error corrected version |
The scripts listed above such as expr.local.ecoli.sh follow basic Hadoop steps: map -> sort/shuffle -> reduce. Hence there are 3 important lines in these scripts that you must modify to run GMA with your own settings. As a pre-processing step, you need to generate .ppd file(s) before running map step. Such files are included for E. coli and yeast, but you will need to prepare your own .ppd files for different species. This examples shows how you can generate pre-processed .ppd files for from a multi-fasta reference genome and then generate 100bp single-ended GMS files for the yeast genome.
$ cd gma-0.1.3-master/example/yeast/input/ppd
$ ../../../../script/prepro.chr.py ../index/yeast.fa
chr1.fa.ppd is being processed
chr2.fa.ppd is being processed
chr3.fa.ppd is being processed
chr4.fa.ppd is being processed
chr5.fa.ppd is being processed
chr6.fa.ppd is being processed
chr7.fa.ppd is being processed
chr8.fa.ppd is being processed
chr9.fa.ppd is being processed
chr10.fa.ppd is being processed
chr11.fa.ppd is being processed
chr12.fa.ppd is being processed
chr13.fa.ppd is being processed
chr14.fa.ppd is being processed
chr15.fa.ppd is being processed
chr16.fa.ppd is being processed
chrMT.fa.ppd is being processed
$ cd gma-0.1.3-master/example/yeast/output/local
$ mkdir yeast.l100.o0.qA.s0.02.i0.d0
$ cd yeast.l100.o0.qA.s0.02.i0.d0
$ cat ../../../input/ppd/*.ppd | ../../../../../bin/mapper runall -l 100 -q A -s 0.02 -i 0 -d 0 -o 0 -t 20 -f ref.fa -b 70 -x ../../../input/index/yeast.fa -p ../../../../../bin 1> map.txt
In this command, we can learn that mapper takes its input as standard input(stdin). The format of command is as follows:
mapper <command> [options] > map.txt
Options and descriptions for the mapper runnall command are described in below table.
option | Description |
---|---|
-l | Read length |
-q | Phred-scaled quality value for each base |
-s | Error rate of substitution |
-i | Error rate of insertion |
-d | Error rate of deletion |
-o | Expected distance between two reads if paired-end. |
-t | Threshold |
-f | File path of a reference fasta |
-b | The number of bases in a line when re-building fasta file from pre-processed file. It is important to use the same number of the original fasta file. |
-x | File path of index files |
-m | File path of bam file |
-p | File path for programs such as BWA and SAMtools |
The Linux/Unix commond "sort" is used to this purpose.
$ cat map.txt | sort -k1,1 -t"|" > mapsort.txt
For reduce step, the "analyzer" command is used with the following 3 options.
cat mapsort.txt | ../../../../../bin/reducer analyzer -l 100 -t 20 -o 0 1> mapred.txt 2> log.txt
option | Description |
---|---|
-l | Read length |
-t | Threshold |
-o | Expected distance between two reads if paired-end. |