forked from ablab/quast
-
Notifications
You must be signed in to change notification settings - Fork 0
/
manual.html
942 lines (825 loc) · 46.5 KB
/
manual.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
<html>
<head>
<title>QUAST 2.3 manual</title>
<style type="text/css">
body {
margin-top: 30px;
margin-left: 50px;
max-width: 800px;
font-family: Tahoma,sans-serif;
font-size: 14px;
}
pre.code {
background-color: #EEE;
padding: 5px 0px;
margin-top: 5px;
margin-bottom: 15px;
}
table td {
vertical-align: top;
}
h2, h3, h4 {
margin-bottom: -10px;
}
h2 {
margin-top: 40px;
}
h3 {
margin-top: 50px;
}
h4 {
margin-top: 30px;
font-size: 1.1em;
}
ul {
margin-top: -12px;
}
ul li, ol li {
margin-bottom: 5px;
}
.options {
margin-left: 34px;
}
.option {
margin-top: 15px;
margin-bottom: 5px;
}
.options .option:first-child {
margin-top: 4px;
}
.metric-name, .metric-ref {
font-family: Georgia,serif;
}
.metric-name {
font-weight: bold;
}
.metrics_description p {
margin-bottom: 25px;
}
.hs { /* 10<span class="hs"></span>000 */
margin-left: .2em;
}
.rhs { /* 1<span class="rhs"> </span>kb */
font-size: 50%;
line-height: 1;
}
</style>
</head>
<body>
<h1>QUAST 2.3 manual</h1>
<p> QUAST stands for <u>QU</u>ality <u>AS</u>sesment <u>T</u>ool.
The tool evaluates genome assemblies by computing various metrics.
<br>
<br>
You can find all project news and the latest version of the tool
at <a href="http://sourceforge.net/projects/quast/">http://sourceforge.net/projects/quast</a>.
<br>
<br>
QUAST utilizes <a href="http://mummer.sourceforge.net/">MUMmer</a>, <a href="http://exon.gatech.edu/GeneMark/">GeneMarkS</a>, <a href="http://exon.gatech.edu/GeneMark/">MetaGeneMark</a>,
<a href="http://cbcb.umd.edu/software/glimmerhmm/">GlimmerHMM</a>
and <a href="http://gage.cbcb.umd.edu/index.html">GAGE</a>. These tools are built in, so you do not need to install
them separately.
<br>
<br>
Version 2.3 of QUAST was released under GPL v2 (see <a href="LICENSE">LICENSE</a> for details) on 17 January 2014.
</p>
<h2>Contents</h2>
<ol>
<li><a href="#sec1">Installation</a></li>
<li><a href="#sec2">Running QUAST</a>
<ol>
<li><a href="#sec2.1">For impatient people</a></li>
<li><a href="#sec2.2">Input data</a></li>
<li><a href="#sec2.3">GAGE mode</a></li>
<li><a href="#sec2.4">Command line options</a></li>
<li><a href="#sec2.5">Metagenomic assemblies</a></li>
</ol>
</li>
<li><a href="#sec3">QUAST output</a>
<ol>
<li><a href="#sec3.1">Metrics description</a>
<ol>
<li><a href="#sec3.1.1">Summary report</a></li>
<li><a href="#sec3.1.2">Misassemblies report</a></li>
<li><a href="#sec3.1.3">Unaligned report</a></li>
</ol>
</li>
<li><a href="#sec3.2">Plots descriptions</a></li>
</ol>
</li>
<li><a href="#sec4">Adjusting QUAST reports and plots</a></li>
<li><a href="#sec5">Citation</a></li>
<li><a href="#sec6">Feedback and bug reports</a></li>
<li><a href="#sec7">FAQ</a></li>
</ol>
<a name="sec1"></a>
<h2>1. Installation</h2>
<p>
QUAST can be run on Linux or Mac OS.
</p>
<p>
It requires:
</p>
<ul>
<li>python 2 (2.5 or higher)</li>
<li>perl 5.6.0 or higher</li>
<li>g++</li>
<li>make</li>
<li>sh</li>
<li>csh</li>
<li>sed</li>
<li>awk</li>
<li>ar</li>
</ul>
All those tools are usually preinstalled on Linux.<br>
Mac OS, however, initially misses <code>make</code>, <code>g++</code> and <code>ar</code>, so you will have to install
<a href='https://developer.apple.com/xcode/'>Xcode</a> (or only
<a href='https://developer.apple.com/downloads/index.action?name=Command%20Line%20Tools'>Command Line Tools for Xcode</a>) to make them available.
<br>
<br>
It is also highly recommended to install the <a href="http://matplotlib.sourceforge.net/">Matplotlib</a> Python library for drawing plots.
We recommend to use Matplotlib version 1.0 or higher. Tested with Matplotlib v.1.3.1.
<br>
Installation can be done with Python <a href="http://www.pip-installer.org/">pip-installer</a>:
<pre class="code">
pip install matplotlib
</pre>
Or with the <a href="http://peak.telecommunity.com/DevCenter/EasyInstall">Easy Install</a> Python module:
<pre class="code">
easy_install matplotlib
</pre>
Or on Ubuntu by typing:
<pre class="code">
sudo apt-get install python-matplotlib
</pre>
<br>
To download the <a href="https://downloads.sourceforge.net/project/quast/quast-2.3.tar.gz">QUAST source code tarball</a> and extract it, type:
<pre class="code">
wget https://downloads.sourceforge.net/project/quast/quast-2.3.tar.gz
tar -xzf quast-2.3.tar.gz
cd quast-2.3
</pre>
<p>
QUAST automatically compiles all its sub-parts when needed (on the first use).
Thus, there is no special installation command for QUAST.
However, we recommend you to run:
<pre class="code">
python quast.py --test (if you plan to use only quast.py)
</pre>
or
<pre class="code">
python metaquast.py --test (if you plan to use only metaquast.py)
</pre>
or both.
These commands run all QUAST and metaQUAST modules and check correctness
of their work on your platform.
</p>
<p>
Note: you should place quast-2.3 directory in the final destination before
the first use (e.g. before run with --test). If you want to move QUAST to
some new place after several usages you should use a clean copy of quast-2.3.
This limitation is caused by auto-generation of absolute paths in compiled
modules of QUAST.
</p>
<a name="sec2"></a>
<h2 style='margin-bottom: -40px;'>2. Running QUAST</h2>
<a name="sec2.1"></a>
<h3>2.1 For impatient people</h3><br>
Running QUAST on test data from the installation tarball (reference genome, gene and operon annotations, and two assemblies of the first 10<span class="rhs"> </span>kbp of <i>E. coli</i>):
<pre class="code">
./quast.py test_data/contigs_1.fasta \
test_data/contigs_2.fasta \
-R test_data/reference.fasta.gz \
-G test_data/genes.txt \
-O test_data/operons.txt \
</pre>
View the summary of the evaluation results with the <a href="http://www.greenwoodsoftware.com/less/">less</a> utility:
<pre class="code">
less quast_results/latest/report.txt
</pre>
<a name="sec2.2"></a>
<h3>2.2 Input data</h3>
<p>
The <code>test_data</code> directory contains examples of assembly, reference, gene and operon files.<br>
<br>
<b>Sequences</b><br>
The tool accepts assemblies and references in FASTA format. Files may be compressed with zip, gzip, or bzip2.<br>
Multiple reference chromosomes can be provided as separate sequences in a single FASTA file.<br>
<span style='line-height: 50%;'> </span><br>
Maximum assembly length is 4.29<span class="rhs"> </span>Gbp.<br>
Maximum length of a reference sequence (e.g. a chromosome) is 536<span class="rhs"> </span>Mbp. The number of sequences in a reference file is not limited.<br>
<span style='line-height: 50%;'> </span><br>
Those restrictions belongs to Nucmer, a tool that QUAST applies to align contigs to a reference genome.
The metrics that do not require alignment are computed in any case.<br>
<br>
<b>Genes and operons</b><br>
One can also specify files with gene and operon positions in the reference. QUAST will count fully and partially aligned regions,
and output <a href='#genes'>total values</a> and <a href='#gene_plot'>cumulative plots</a>.<br>
<span style='line-height: 50%;'> </span><br>
The following file formats are supported:
<ul>
<li>GFF, versions <a href="http://www.sanger.ac.uk/resources/software/gff/spec.html">2</a> and <a href="http://www.sequenceontology.org/gff3.shtml">3</a>
(note: <feature>/<type> field should be either "gene" or "operon");
<li>the <a href="http://www.ncbi.nlm.nih.gov/gene">format used by NCBI</a> for genes ("Summary (text)");
<li>four tab-separated columns: sequence name, gene/operon id, start position, end postion.
</ul>
Note that the sequence name has to match a name in the reference file.<br>
<span style='line-height: 50%;'> </span><br>
Coordinates are 1-based, i.e. the first nucleotide in the reference has position 1, not 0.
If a <i>start position</i> less than a corresponding <i>end position</i>, such gene or operon is on a positive strand, otherwise it is on a negative strand.
</p>
<a name="sec2.3"></a>
<h3>2.3 GAGE mode</h3>
<p>
<a href="http://gage.cbcb.umd.edu/index.html">GAGE</a> is a well-known assessment tool. However, it has limitations:
<ul>
<li>Only one assembly per run. It complicates assembly comparison.
<li>Fixed threshold for a minimum contig length (200<span class="rhs"> </span>bp).
</ul>
These issues are solved by QUAST in GAGE mode (run with a <code>--gage</code> option). QUAST filters contigs according to a specified threshold and runs GAGE on every assembly.
GAGE statistics (see the <a href="http://gage.cbcb.umd.edu/index.html">GAGE site</a> and the <a href="http://genome.cshlp.org/content/early/2012/01/12/gr.131383.111">GAGE paper</a> for descriptions)
are reported in addition to a standard QUAST report.<br>
<span style='line-height: 50%;'> </span><br>
Note:<br><br>
<ul>
<li>GAGE requires a reference genome.</li>
<li>Java and Java Compiler must be installed on your machine. Tested with OpenJDK 6.</li>
</ul>
</p>
<a name="sec2.4"></a>
<h3>2.4 Command line options</h3>
<br>
QUAST runs from a command line as follows:
<pre class='code'>
python quast.py [options] <contig_file(s)>
</pre>
Options:
<div class='options'>
<div class='option'>
<code><b>-o</b> <output_dir></code>
</div>
Output directory. The default value is <code>quast_results/results_<date_time></code>.<br>
Also, a symlink <code>quast_results/latest</code> is created.<br>
<br>
Note: QUAST reuses Nucmer alignments if run repeatedly on the same directory.
Thus, you can efficiently reuse already computed results when running QUAST with different parameters, or adding more assemblies to an existing comparison.
<div class='option'>
<code><b>-R</b> <path></code>
</div>
Reference genome file. Optional. Many metrics can't be evaluated without a reference. If this is omitted, QUAST will only report the metrics that can be evaluated without a reference.
<div class='option'>
<code><b>-G</b> <path></code> (or <code>--genes <path></code>)
</div>
File with gene positions in reference. See details about the file format in <a href="#sec2.2">section 2.2</a>.<br>
<span style='line-height: 50%;'> </span><br>
If you do not have gene positions, you can make QUAST predict genes by the <code><a href='#gene_finding'>--gene-finding</code> option</a>.<br>
<div class='option'>
<code><b>-O</b> <path></code> (or <code>--operons <path></code>)
</div>
File with operon positions in reference. See details about the file format in <a href="#sec2.2">section 2.2</a>
<div class='option'>
<a name='min_contig'></a>
<code><b>--min-contig</b> <int></code>
</div>
Lower threshold for a contig length. Shorter contigs won't be taken into account
(except for some metrics, see <a href="#sec3">section 3</a>). The default value is 500.
</div>
<br>
Advanced options:
<div class='options'>
<div class='option'>
<code><b>-t</b></code> (or <code>--threads</code>) <code><int></code>
</div>
Maximum number of threads. The default value is the number of CPUs. If QUAST fails to determine the number of CPUs, the number is set to 4.
<div class='option'>
<code><b>--labels</b></code> (or <code>-l</code>) <code><label,label...></code>
</div>
Human-readable assembly names. Those names will be used in reports, plots and logs. For example:<br>
<div style='margin-left: 30px; margin-top: 5px; margin-bottom: 10px;'>
<code>-l SPAdes,IDBA-UD</code>
</div>
If your labels include spaces, use quotes:<br>
<div style='margin-left: 30px; margin-top: 5px; margin-bottom: 10px;'>
<code>-l SPAdes,"Assembly 2",Assembly3</code>
</div>
<div style='margin-left: 30px; margin-top: 5px; margin-bottom: 10px;'>
<code>-l "SPAdes 2.5, SPAdes 2.4, IDBA-UD"<br></code>
</div>
<div class='option'>
<code><b>-L</b></code>
</div>
Take assembly names from their parent directory names.
<div class='option'>
<a name='gene_finding'></a><code><b>--gene-finding</b></code>
</div>
Enables gene finding. Affects perfomance, thus disabled by default.<br>
<span style='line-height: 50%;'> </span><br>
By default, we assume that the genome is prokaryotic, and apply GeneMark.hmm for gene finding.
If the genome is eukaryotic, add the <a href='#eukaryote'><code>--eukaryte</code></a> option to enable GlimmerHMM instead.
If it is a metagenome, add the <a href='#meta'><code>--meta</code></a> option.<br>
<span style='line-height: 50%;'> </span><br>
If a <a href='#sec2.2'>gene file</a> is provided by <code>-G</code> as well, both
<span class='metric_ref'><a href='#genes'># genes</a></span> in the file covered by the assembly, and
<span class='metric_ref'><a href='#predicted_genes'># predicted genes</a></span> are reported. Note that operons are not predicted,
but a file of known operon positions can be provided instead.<br>
<div class='option'>
<code><b>--gene-thresholds</b> <int,int,...></code>
</div>
Comma-separated list of thresholds for gene lengths to find with a finding tool. The default value is 0,300,1500,3000.
Note: this list is used only if <code>--gene-finding</code> option is specified.
<div class='option'>
<a name="eukaryote"></a>
<code><b>--eukaryote</b></code>
</div>
Genome is eukaryotic. Affects gene finding and contig alignment:<br>
<ol class='my_ol' style='margin-top: 0;'>
<li>For prokaryotes (which is default), GeneMark.hmm is used. For eukaryotes, GlimmerHMM are used.
<li>By default, QUAST assumes that a genome is circular and correctly processes its linear representation.
This options indicates that the genome is not circular.
</ol>
<div class='option'>
<a name="meta"></a>
<code><b>--meta</b></code>
</div>
Use MetaGeneMark for gene finding, if the <code>--gene-finding</code> option is specified.
If the <code>--eukaryote</code> option is also provided, MetaGeneMark still will be used.<br>
<span style='line-height: 50%;'> </span><br>
Note: if you have multiple references, <a href='#sec2.5'>use metaquast.py</a> instead
(it is in the same directory as quast.py).
<div class='option'>
<code><b>--est-ref-size</b> <int></code>
</div>
Estimated reference size (in bases) for computing <span class='metric-ref'>NGx</span> statistics. This value will be used only if a reference genome file is
not specified (see <code>--R</code> option).
<div class='option'>
<code><b>--gage</b></code>
</div>
Starts QUAST in "GAGE mode" (see <a href="#sec2.3">section 2.3</a>).
Note: in this case, you also have to set the <code>-R</code> option.
<div class='option'>
<code><b>--contig-thresholds</b> <int,int,...></code>
</div>
Comma-separated list of contig length thresholds. Used in <span class='metric-ref'># contigs ≥ x</span> and
<span class='metric-ref'>total length (≥ x)</span> metrics (see <a href="#sec3">section 3</a>). The default value is 0,1000.
<div class='option'>
<code><b>--scaffolds</b></code>
</div>
The assemblies are scaffolds (rather than contigs). QUAST will add split versions
of assemblies to the comparison. Assemblies are split by continuous fragments of N's of length ≥ 10.
<div class='option'>
<code><b>--use-all-alignments</b></code>
</div>
Compute <span class='metric-ref'>genome fraction, # genes, # operons</span> metrics in the manner used in QUAST v.1.*.
By default, QUAST v.2.0 and higher filters out ambiguous and redundant alignments, keeping only one alignment per contig
(or one set of non-overlapping or slightly overlapping alignments). This option makes QUAST count all alignments.
<div class='option'>
<code><b>--ambiguity-usage</b> <<b>none</b>|<b>one</b>|<b>all</b>></code>
</div>
Way of processing equally good alignments (probably repeats):<br>
<table style="margin-left: 30px; font-size: 1em; vertical-align: bottom;">
<tr><td style='width: 40px;'><code>none</code></td><td>skip all such alignments;</td></tr>
<tr><td><code>one</code></td><td>take only one (the first one);</td></tr>
<tr><td><code>all</code></td><td>use all alignements. Can cause a significant increase of <span class='metric-ref'># mismatches</span>
(repeats are almost always inexact due to accumulated SNPs, indels, etc.).</td></tr>
</table>
The default value is 'one'.
<div class='option'>
<code><b>--strict-NA</b></code>
</div>
Break contigs at every misassembly event (including local ones) to compute <span class='metric-ref'>NAx</span> and
<span class='metric-ref'>NGAx</span> statistics. By default, QUAST breaks contigs <i>only at extensive</i> misassemblies (not local ones).
<div class='option'>
<code><b>--no-plots</b></code>
</div>
Do not draw plots. This will speed up computation but you will get only text reports as a result.
<div class='option'>
<code><b>--test</b></code>
</div>
Run the tool on a data from the <code>test_data</code> folder and check correctness of the evaluation process. Output is saved in quast_test_output.
<div class='option'>
<code><b>-h</b></code> (or <code>--help</code>)
</div>
Print help.
</div>
<a name="sec2.5"></a>
<h3>2.5 Metagenomic assemblies</h3>
<p>
The <code>metaquast.py</code> script accepts multiple references. One can provide several files, a merged FASTA file with multiple sequences, or a combination.
The tool partitions all contigs in groups aligned to each reference.
Then it runs quast.py several times:<br>
<ul>
<li>for all references in combination,
<li>for each reference separately, by using corresponding contigs,
<li>for the rest of the contigs that were not aligned anywhere.
</ul>
<!-- <table style="margin-left: 20px;">
<tr>
<td style="padding-right: 20px; padding-bottom: 5px;">- for all references in combination,</td>
<td></td>
</tr>
<tr>
<td style="padding-right: 20px; padding-bottom: 5px;">- for each reference separately, by using corresponding contigs,</td>
<td></td>
</tr>
<tr>
<td style="padding-right: 20px; padding-bottom: 5px;">- for the rest of the contigs that were not aligned anywhere.</td>
<td></td>
</tr>
</table> -->
<p>All outputs are in separate directories inside the directory provided by <code>-o</code> (or in quast_results/latest).</p>
Usage:
<pre class="code">
python metaquast.py contigs_1 contigs_2 ... -R reference_1,reference_2,reference_3,...
</pre>
All options are the same as for quast.py, except for <code>-R</code>: it can accept multiple references.</p>
<a name="sec3"></a>
<h2>3. QUAST output</h2>
<p>If an output path was not specified manually, QUAST puts its output into the directory <code>quast_results/result_<DATE></code> and creates a symlink <code>latest</code> to it inside the directory <code>quast_results/</code>.
<br>
<br>
QUAST output contains:
<table style="margin-left: 20px; font-size: 1em;">
<tr>
<td style="padding-right: 20px;">report.txt</td>
<td>an assessment summary in a simple text format,</td>
</tr>
<tr>
<td style="padding-right: 20px;">report.tsv</td>
<td>a tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc),</td>
</tr>
<tr>
<td style="padding-right: 20px;">report.tex</td>
<td>a LaTeX version of the summary,</td>
</tr>
<tr>
<td style="padding-right: 20px;">alignment.svg</td>
<td>a contig alignment plot (file is created if the matplotlib python library is installed),</td>
</tr>
<tr>
<td style="padding-right: 20px;">report.pdf</td>
<td>all other plots combined with all tables (file is created if the matplotlib python library is installed),</td>
</tr>
<tr>
<td style="padding-right: 20px;">report.html</td>
<td>an HTML version of the report with interactive plots inside it,</td>
</tr>
<tr>
<td style="padding-right: 20px;">contigs_reports/</td>
<td></td>
</tr>
<tr>
<td style="padding-right: 20px; padding-left: 20px;">misassemblies_report</td>
<td>a detailed report on misassemblies. See <a href="#sec3.1.2">section 3.1.2</a> for details,</td>
</tr>
<tr>
<td style="padding-right: 20px; padding-left: 20px;">unaligned_report</td>
<td>a detailed report on unaligned and partially unaligned contigs. See <a href="#sec3.1.3">section 3.1.3</a> for details.</td>
</tr>
</table>
<br>
Note:
<ul style="margin-top: -12px;">
<li>metrics based on a reference genome are computed only if a reference is provided (see <a href="#sec1.3">section 2.4</a>), </li>
<li>metrics based on genes and operons are computed only if proper annotations are provided (see <a href="#sec1.3">section 2.4</a>). </li>
</ul>
</p>
<div class='metrics_description'>
<a name="sec3.1"></a>
<h3 style='margin-bottom: -15px;'>3.1 Metrics description</h3>
<a name="sec3.1.1"></a>
<h4>3.1.1 Summary report</h4>
<p><span class='metric-name'># contigs (≥<span class="rhs"> </span>x<span class="rhs"> </span>bp)</span>
is total number of contigs of length <code>≥ x<span class="rhs"> </span>bp</code>.
Not affected by the <code>--min-contig</code> parameter (see <a href="#sec2.4">section 2.4</a>).</p>
<p><span class='metric-name'>Total length (≥<span class="rhs"> </span>x<span class="rhs"> </span>bp)</span>
is the total number of bases in contigs of length <code>≥ x<span class="rhs"> </span>bp</code>.
Not affected by the <code>--min-contig</code> parameter (see <a href="#sec2.4">section 2.4</a>).<br>
<br>
<i>All remaining metrics are computed only the contigs that exceed the threshold specified by specified by the
<code>--min-contig</code> option (see <a href="#sec2.4">section 2.4</a>, default is 500).</i>
</p>
<p><span class='metric-name'># contigs</span> is the total number of contigs in the assembly.</p>
<p><a name='largest_contig'></a><span class='metric-name'>Largest contig</span> is the length of the longest contig in the assembly.</p>
<p><span class='metric-name'>Total length</span> is the total number of bases in the assembly.</p>
<p><span class='metric-name'>Reference length</span> is the total number of bases in the reference.</p>
<p><a name='GC'></a><span class='metric-name'>GC (%)</span> is the total number of G and C nucleotides in the assembly,
divided by the total length of the assembly.</p>
<p><span class='metric-name'>Reference GC (%)</span> is the percentage of G and C nucleotides in the reference.</p>
<p><a name='N50'></a><span class='metric-name'>N50</span> is the length for which the collection of all contigs of that length or longer
covers at least half an assembly.<br>
<p><a name='NG50'></a><span class='metric-name'>NG50</span> is the length for which the collection of all contigs of that length or longer
covers at least half a reference genome.<br> This metric is computed only if
a reference genome is provided.</p>
<p><span class='metric-name'>N75 and NG75</span> are defined similarly with 75<span class="rhs"> </span>% instead of 50<span class="rhs"> </span>%.</p>
<p><span class='metric-name'>L50 (L75, LG50, LG75)</span> is the number of contigs as long as N50 (N75, NG50, NG75)<br>
In other words, L50, for example, is the minimal number of contigs that cover half the assembly.</p>
<p><span class='metric-name'># misassemblies</span> is the number of positions in the contigs that satisfy one of the following criteria:<br>
<ul style='margin-top: -24px;'>
<li>the left flanking sequence aligns over 1<span class="rhs"> </span>kbp away from the right flanking sequence on the reference;
<li>flanking sequences overlap on more than 1<span class="rhs"> </span>kbp;
<li>flanking sequences align to different strands or different chromosomes.
</ul>
This metric requires a reference genome.</p>
<p><span class='metric-name'># misassembled contigs</span> is the number of contigs that contain misassembly events.</p>
<p><span class='metric-name'>Misassembled contigs length</span> is the total number of bases in misassembled contigs.</p>
<p><span class='metric-name'># local misassemblies</span> is the number of breakpoints that satisfy the following conditions:
<ol class='my_ol' style='margin-top: -24px;'>
<li>Two or more distinct alignments cover the breakpoint.</li>
<li>The gap between left and right flanking sequences is less than 1<span class="rhs"> </span>kbp.</li>
<li>The left and right flanking sequences both are on the same strand of the same chromosome of the reference genome.</li>
</ol> </p>
<p><span class='metric-name'># unaligned contigs</span> is the number of contigs that have no alignment to the
reference sequence. The value "X<span class='rhs'> </span>+<span class='rhs'> </span>Y part" means X totally unaligned contigs plus Y partially unaligned contigs. </p>
<p><span class='metric-name'>Unaligned length</span> is the total length of all unaligned regions in the assembly
(sum of lengths of fully unaligned contigs and unaligned parts of partially unaligned ones). </p>
<!--p><span class='metric-name'># ambiguous contigs</span> is the number of contigs which have reference alignments
of equal quality in multiple locations on the reference. </p>
<p><span class='metric-name'>Ambiguous contigs length</span> is the number of total bases contained in all ambiguous contigs. </p-->
<p><span class='metric-name'>Genome fraction (%)</span> is the percentage of alinged bases in the reference.
A base in the reference is aligned if there is at least one contig with at least one alignment to this base.
Contigs from repetitive regions may map to multiple places, and thus may be counted multiple times.</p>
<p><span class='metric-name'>Duplication ratio</span> is the total number of aligned bases in the assembly divided by the total number of aligned bases in the
reference (see <span class='metric-ref'>Genome fraction (%)</span> for the 'aligned base' defenition). If the assembly contains many contigs that cover the same
regions of the reference, its <span class='metric-ref'>duplication ratio</span> may be much larger than 1. This may occur due to overestimating
repeat multiplicities and due to small overlaps between contigs, among other reasons.</p>
<p><span class='metric-name'># N's per 100<span class="rhs"> </span>kbp</span> is the average number of uncalled bases (N's) per 100<span class='hs'></span>000 assembly bases.</p>
<p><span class='metric-name'># mismatches per 100<span class="rhs"> </span>kbp</span> is the average number of mismatches
per 100<span class='hs'></span>000 aligned bases. True SNPs and sequencing errors are not distinguished and are counted equally.
</ul>
</p>
<p><span class='metric-name'># indels per 100<span class="rhs"> </span>kbp</span> is the average number of indels per 100<span class='hs'></span>000 aligned bases.
Several consecutive single nucleotide indels are counted as one indel.</p>
<p><a name='genes'></a><span class='metric-name'># genes</span> is the number of genes in the assembly (complete and partial), based on a user-provided
list of gene positions in the reference. A gene 'partially covered' if the assembly contains at least 100<span class="rhs"> </span>bp
of this gene but not the whole one.<br>
<span style='line-height: 50%;'> </span><br>
This metric is computed only if a reference genome and an annotated list of gene positions are provided (see <a href="#sec2.4">section 2.4</a>).</p>
<p><span class='metric-name'># operons</span> is defined similarly to <span class='metric-ref'># genes</span>, but an operon positions file required instead.</p>
<p><a name='predicted_genes'><span class='metric-name'># predicted genes</span> is the number of genes in the assembly
found by GeneMark.hmm, GlimmerHMM or MetaGeneMark. See the description of the <a href='#gene_finding'><code>--gene-finding</code></a> option for details.</p>
<p><span class='metric-name'>Largest alignment</span> is the length of the largest continuous alignment in the assembly.
A value can be smaller than a value of <a href='largest_contig'><span class='metric-ref'>largest contig</span></a> if the largest contig is misassembled.</p>
<p><a name='NAx'><a name='NGAx'></a><span class='metric-name'>NA50, NGA50, NA75, NGA75, LA50, LA75, LGA50, LGA75</span> ("A" stands for "aligned") are similar to
the corresponding metrics without "A", but in this case aligned blocks instead of contigs are considered.<br>
Aligned blocks are obtained by breaking contigs in misassembly events and removing all analigned bases.</p>
<a name="sec3.1.2"></a>
<h4>3.1.2 Misassemblies report</h4>
<p><span class='metric-name'># misassemblies</span> is the same as <span class='metric-ref'># misassemblies</span> from <a href="#sec3.1.1">section 3.1.1</a>.
However, this report also contains a classification of all misassemblies into three groups: <span class='metric-ref'>relocations</span>, <span class='metric-ref'>translocations</span>,
and <span class='metric-ref'>inversions</span> (see below).</p>
<p><span class='metric-name'>Relocation</span> is a misassembly where the left flanking sequence aligns over 1<span class="rhs"> </span>kbp away from the right flanking
sequence on the reference, or they overlap by more than 1<span class="rhs"> </span>kbp, and both flanking sequences align on the same chromosome. </p>
<p><span class='metric-name'>Translocation</span> is a misassembly where the flanking sequences align on different chromosomes. </p>
<p><span class='metric-name'>Inversion</span> is a misassembly where the flanking sequences align on
opposite strands of the same chromosome. </p>
<p><span class='metric-name'># misassembled contigs</span> and <span class='metric-ref'>misassembled contigs length</span> are the same as the metrics from
<a href="#sec3.1.1">section 3.1.1</a> and are counted among all contigs with any type of a misassembly
(relocation, translocation or inversion). </p>
<p><span class='metric-name'># local misassemblies</span> is the same as <span class='metric-ref'># local misassemblies</span> from
<a href="#sec3.1.1">section 3.1.1</a>.
</p>
<p><span class='metric-name'># mismatches</span> is the number of mismatches in all aligned bases.</p>
<p><span class='metric-name'># indels</span> is the number of indels in all aligned bases.</p>
<p><span class='metric-name'># short indels (≤<span class="rhs"> </span>5<span class="rhs"> </span>bp)</span> is the number of indels of length <code>≤<span class="rhs"> </span>5<span class="rhs"> </span>bp</code>.</p>
<p><span class='metric-name'># long indels (><span class="rhs"> </span>5<span class="rhs"> </span>bp)</span> is the number of indels of length <code>><span class="rhs"> </span>5<span class="rhs"> </span>bp</code>.</p>
<p><span class='metric-name'>Indels length</span> is the total number of bases contained in all indels.</p>
<a name="sec3.1.3"></a>
<h4>3.1.3 Unaligned report</h4>
<p><span class='metric-name'># fully unaligned contigs</span> is the number of contigs that have no alignment to the reference sequence.</p>
<p><span class='metric-name'>Fully unaligned length</span> is the total number of bases in all unaligned contigs.</p>
<p><span class='metric-name'># partially unaligned contigs</span> is the number of contigs that are not fully unaligned, but have fragments
with no alignment to the reference sequence.</p>
<p><span class='metric-name'># with misassembly</span> is the number of partially unaligned contigs that have a misassembly in their aligned fragment.
Note that such misassemblies are not counted in <span class='metric_ref'># misassemblies</span> and other <span class='metric_ref'>misassemblies</span> statistics.</p>
<p><span class='metric-name'># both parts are significant</span> is the number of partially unaligned contigs that have both aligned and unaligned fragments
longer than the value of <a href='#min_contig'><code>--min-contig</code></a>.
<p><span class='metric-name'>Partially unaligned length</span> is the total number of unaligned bases in all partially unaligned contigs.</p>
<p><span class='metric-name'># N's</span> is the total number of uncalled bases (N's) in the assembly.</p>
<a name="sec3.2"></a>
<h3>3.2 Plots description</h3>
<p><span class='metric-name'>Contig alignment plot</span> shows alignment of contigs to the reference genome and the positions of misassemblies in these contigs.
Contigs that align correctly are colored blue if the boundaries agree (within 2 kbp on each side, contigs are larger than 10 kbp) in at least half of the assemblies,
and green otherwise.
Blocks of misassembled contigs are colored orange if the boundaries agree in at least half of the assemblies, and red otherwise.
Contigs are staggered vertically and are shown in different shades of their color in order to distinguish the separate contigs, including small ones. If the reference
file consists of several sequences all of them are drawn on the single plot horizontally next to each other.</p>
<p><span class='metric-name'>Cumulative length plot</span> shows the growth of contig lengths. On the x-axis, contigs are ordered from the largest
to smallest. The y-axis gives the size of the x largest contigs in the assembly.</p>
<p><span class='metric-name'>Nx plot</span> shows <a href='#N50'><span class='metric-ref'>Nx</span></a> values as x varies from 0 to 100<span class='rhs'> </span>%.</p>
<p><span class='metric-name'>NGx plot</span> shows <a href='#NG50'><span class='metric-ref'>NGx</span></a> values as x varies from 0 to 100<span class='rhs'> </span>%.</p>
<p><span class='metric-name'>GC content plot</span> shows the distribution of <a href='#GC'>GC content</a> in the contigs.<br>
<span style='line-height: 50%;'> </span><br>
The x value is the GC percentage (0 to 100<span class='rhs'> </span>%).<br>
The y value is the number of non-overlapping 100<span class="rhs"> </span>bp windows which GC content equals x<span class='rhs'> </span>%.<br>
<span style='line-height: 50%;'> </span><br>
For a single genome, the distribution is typically Gaussian. However, for assemblies with contaminants, the GC distribution
appears to be a superposition of Gaussian distributions, giving a plot with multiple peaks.
</p>
<p><span class='metric-name'>Cumulative length plot for aligned contigs</span> shows the growth of lengths of aligned blocks.
If a contig has a misassembly, QUAST breaks it into smaller pieces called aligned blocks.<br>
<span style='line-height: 50%;'> </span><br>
On the x-axis, blocks are ordered from the largest
to smallest. The y-axis gives the size of the x largest aligned blocks.<br>
This plot is created only if a reference genome is provided.<p>
<p><span class='metric-name'>NAx and NGAx plots</span><br>
These plots are similar to the <span class='metric-ref'>Nx</span> and <span class='metric-ref'>NGx</span> plots but for the <a href='#NAx'>NAx and NGAx</a> metrics respectively.
These plots are created only if a reference genome is provided.</p>
<p><a name='gene_plot'></a><span class='metric-name'>Genes plot</span> shows the growth rate of full genes in assemblies.<br>
The y-axis is the number of full genes in the assembly, and the x-axis is the number of contigs in the assembly (from the largest one to the smallest one).<br>
This plot could be created only if a <a href='#sec2.2'>reference and genes annotations files</a> are given.</p>
<p><span class='metric-name'>Operons plot</span> is similar to the previous one but for operons.</p>
</div>
<a name="sec4"></a><p>
<h2>4. Adjusting QUAST reports and plots</h2>
<p> You can easily change content, order of metrics, and metric names in all QUAST reports. For doing this,
please edit the <code>CONFIGURABLE PARAMETERS</code> section in <code>libs/reporting.py</code>. It contains a lot of informative comments,
which will help you to adjust QUAST reports easily even if you are new to Python.
</p>
<p> You can also adjust plot colors, style and width of lines, legeng font, plots output format, etc.
Please see the <code>CONFIGURABLE PARAMETERS</code> section in <code>libs/plotter.py</code>.
</p>
<p> Note: if you restart QUAST on the same directory with new parameters, is will reuse alignments and run much faster.
See the description of the <code>-o</code> option in <a href="#sec2.4">section 2.4</a>.
</p>
<a name="sec5"></a><p>
<h2>5. Citation</h2><br>
If you use QUAST in your research, please include <i>Gurevich et al., 2013</i> into your reference list:<br>
<div style='margin-top: 5px; padding-left: 30px; padding-top: 5px; padding-bottom: 5px; background-color: #EEE'>
Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi and Glenn Tesler, <br>
QUAST: quality assessment tool for genome assemblies, <br>
Bioinformatics (2013) 29 (8): 1072-1075. doi: <a href="http://dx.doi.org/10.1093/bioinformatics/btt086">10.1093/bioinformatics/btt086</a><br>
First published online: February 19, 2013
</div>
<a name="sec6"></a><p>
<h2>6. Feedback and bug reports</h2>
<p> We will be thankful if you help us make QUAST better by sending your comments, bug reports, and suggestions
to <a href="mailto:[email protected]">[email protected]</a>.
</p>
<p>
We kindly ask you to attach the <code>quast.log</code> file from output directory (or an entire archive of the folder) if you have troubles running QUAST.
<br><br>
Note that if you didn't specify the output directory manually, it is going to be automatically set to <code>quast_results/results_<date_time></code>, with a symbolic link <code>quast_results/latest</code> to that directory.
</p>
<a name="sec7"></a><p>
<h2>7. FAQ</h2>
<p>
This section contains most popular questions about QUAST output. Read answers for deeper understanding of results generated by the tool.
</p>
<p>
In several answers there are descriptions of files under <code><quast_output_dir></code> directory.
<br>
If you use the command-line version of QUAST you specify <code><quast_output_dir></code> by -o option or it is <code>"quast_results/latest"</code> by default.
<br>
If you use http://quast.bioinf.spbau.ru/ you should download full report by pressing <code>"Download report"</code> button (at top-right corner),
decompress result and go to <code>"full_report"</code> subdirectory.
</p>
<br>
<p><b><i>
Q1. It seems that QUAST is giving me a differing number of misassemblies
and misassembled contigs. Does this imply that QUAST looks for
multiple misassemblies within one contig?
</b></i></p>
<p>
Yes, you are right, QUAST looks for multiple misassemblies within one contig. Thus,
number of misassemled contigs is always less or equal to number of misassemblies.
</p>
<br>
<p><b><i>
Q2. Is there a way to get only misassembled contigs of the assembly?
</b></i></p>
<p>
Yes, there is such way.
<br>
QUAST copies all misassembled contigs of <code>"<assembly_name>"</code> assembly into
<code><quast_output_dir>/contigs_reports/<assembly_name>.mis_contigs.fa</code> file.
<br>
E.g. if your assembly is "contigs.fasta" then the file is "contigs.mis_contigs.fa",
if your assembly is "ecoli_assembly_1.fasta" then the file is "ecoli_assembly_1.mis_contigs.fa".
<br>
</p>
<br>
<p><b><i>
Q3. Is it possible to find which misassembly corresponds to each contig and which kind of a misassembly it is?
</b></i></p>
<p>
Yes, it is possible.
<br>
You should open <code><quast_output_dir>/contigs_reports/contigs_report_<assembly_name>.stdout</code> file.
<br>
E.g. if your assembly is "contigs.fasta" then the file is "contigs_report_contigs.stdout", if your assembly is "ecoli_assembly_1.fasta" then the file is "contigs_report_ecoli_assembly_1.stdout".
<br>
<br>
After that, you should look for "Extensive misassembly" in the file and look around to detect contig name which corresponds this misassembly.
<br><br>
Let's look at the following example:
<br>
<code>
<pre>
CONTIG: <b>NODE_772 (575bp)</b>
Top Length: 296 Top ID: 100.0
Skipping redundant alignment 1096745 1096882 | 138 1 | 138 138 | 98.55 | Escherichia_coli NODE_772
This contig is misassembled. 3 total aligns.
Real Alignment 1: 924846 925134 | 287 575 | 289 289 | 100.0 | Escherichia_coli NODE_772
<b>Extensive misassembly ( inversion )</b> between these two alignments
Real Alignment 2: 924906 925201 | 296 1 | 296 296 | 100.0 | Escherichia_coli NODE_772
</pre>
</code>
In this example, we can see that contig name is <b>NODE_772</b>, its length is <b>575 bp</b>.
This contig has two alignments and one misassembly. <b>Inversion</b> is a type of the misassembly. QUAST also reports relocations and translocations,
see <a href="#sec3.1.2">section 3.1.2</a> for details.
<br><br>
Let's look at another example:
<br>
<code>
<pre>
CONTIG: <b>Contig_753 (140518bp)</b>
Top Length: 121089 Top ID: 99.98
Skipping redundant alignments after choosing the best set of alignments
Skipping redundant alignment 273398 273468 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
<i>....</i>
Skipping redundant alignment 3363797 3363867 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
This contig is misassembled. 14 total aligns.
Real Alignment 1: 1425621 1426074 | 19431 18978 | 454 454 | 100.0 | Escherichia_coli Contig_753
<i>Gap between these two alignments (local misassembly)</i>. Inconsistency = 148
Real Alignment 2: 1426295 1426818 | 18905 18382 | 524 524 | 100.0 | Escherichia_coli Contig_753
<b>Extensive misassembly ( relocation, inconsistency = 2224055 )</b> between these two alignments
Real Alignment 3: 3650278 3650348 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
<b>Extensive misassembly ( relocation, inconsistency = 236807 )</b> between these two alignments
Real Alignment 4: 3765544 3886652 | 140518 19430 | 121109 121089 | 99.98 | Escherichia_coli Contig_753
<b>Extensive misassembly ( relocation, inconsistency = -1052 )</b> between these two alignments
Real Alignment 5: 3886649 3905037 | 18381 1 | 18389 18381 | 99.96 | Escherichia_coli Contig_753
</pre>
</code>
This contig is <b>Contigs_753</b> of length <b>140518 bp</b>. It has <b>3 extensive misassemblies</b> (all three are <b>relocations</b>) and one local misassembly.
</p>
<br>
<p><b><i>
Q4. Could you explain the format of Real Alignments in contigs report files (see the answer for Q3)?
</b></i></p>
<p>
Yes, sure. Let's look at the following example:
<br>
<code>
<pre>
Real Alignment 1: 19796 20513 | 29511 30228 | 718 718 | 100.0 | ENA|U00096|U00096.2_Escherichia_coli contig-710
</pre>
</code>
The first two numbers are position on the target, and the second two are position on the query.
Note that positions on the target are always ascending while positions on the query can be
ascending (positive strand) and descending (negative one).
<br><br>
The next two numbers (in this case: 718 718) mean "the number of aligned bases on the target" and "the number of aligned bases on the query".
They are usually equal to each other but they can be slightly different because of short insertions and deletions. Actually, these numbers are excessive because they can be easily
calculated based on the first two pairs of numbers (positions on the target and positions on the query). However, sometimes it is convenient to look at these numbers.
<br><br>
The last number (in this case: 100.0) is the Nucmer aligner quality metric. It is called "identity %" (IDY %) and it describes the quality of the alignment
(the number of mismatches and indels between the target and the query). If IDY% = 100.0 then the alignment is perfect,
i.e. all bases on the target and on the query are equal to each other. If IDY% is less than 100.0
then the target and the query are slightly different. Quast has a threshold on IDY% which is 95%.
Thus we don't use alignments with IDY% less than 95% (they are relatively bad).
<br><br>
And finally, the last two columns are the name of the target sequence (i.e. reference name) and the name of the query (i.e. contig name).
</p>
<br>
<p><b><i>
Q5. Where does QUAST save information about SNPs?
</b></i></p>
<p>
There are two output files concerning SNPs. Both of them are saved in <code><quast_output_dir>/contigs_reports/nucmer_output/</code> directory.
<br>
The first one has extension ".all_snps" and it is raw Nucmer aligner output. Its format is:
<br>
<code>
<pre>
[P1] [SUB] [SUB] [P2] [BUFF] [DIST] [R] [Q] [FRM] [TAGS]
15383 T G 3339560 1 15383 3 2 1 -1 Escherichia_coli contig_15
</pre>
</code>
Where: P1 is position on the reference, SUB is nucleotide in the reference, SUB is nucleotide in the contig, P2 is position on the contig,
BUFF is the distance from this SNP to the nearest mismatch (end of alignment, indel, SNP, etc) in the same alignment, while the [DIST] column
specifies the distance from this SNP to the nearest sequence end.
<br>
R and Q specify the number of other alignments which overlap this position (in Reference and Query (i.e. contig) respectively).
FRM and TAGS are not documented in Nucmer help message, and the last two columns are reference name and contig name.
<br>
<br>
The second file ("*.used_snps") is generated by QUAST.
<br>
We analyse all alignments and filter them by skiping some "uninformative" alignments (redundant, duplicated) and after that include in ".used_snps" file
only those of all SNPs which were actually appear in filtered alignments. Thus, reported by QUAST numbers of "# mismatches per 100 kbp", "# indels per 100 kbp" includes statistics
from USED SNPs, not ALL SNPs.
<br>
In addition, we use our own format of ".used_snps" file.
<br>
<code>
<pre>
Escherichia_coli contig_15 728803 C . 3217983
</pre>
</code>
where the columns are: reference name, contig name, position on the reference, nucleotide in the reference, nucleotide in the contig
(in this case it is ".", i.e. an absence of a nucleotide in the contig which means a deletion) and the final column is position on the contig.
</p>
<br>
<br>
</body>
</html>