Could Additional Coverage (Reads) Have Improved the Assembly?

Genome Biol. 2009; x(8): R88.

Assisted assembly: how to improve a de novo genome associates past using related species

Sante Gnerre

¹Wide Institute of Harvard and MIT, Cambridge Center, Cambridge, Massachusetts 02142, USA

Eric Southward Lander

^oneBroad Establish of Harvard and MIT, Cambridge Center, Cambridge, Massachusetts 02142, U.s.a.

Kerstin Lindblad-Toh

¹Broad Constitute of Harvard and MIT, Cambridge Heart, Cambridge, Massachusetts 02142, USA

²Department of Medical Biochemistry and Microbiology, Uppsala University, Husarg.3, Uppsala 751 23, Sweden

David B Jaffe

¹Wide Found of Harvard and MIT, Cambridge Eye, Cambridge, Massachusetts 02142, USA

Received 2009 Apr seven; Revised 2009 Jul eight; Accustomed 2009 Aug 27.

Abstract

Nosotros describe a new assembly algorithm, where a genome associates with low sequence coverage, either throughout the genome or locally, due to cloning bias, is considerably improved through an assisting process via a related genome. We show that the information provided by aligning the whole-genome shotgun reads of the target against a reference genome can exist used to substantially improve the quality of the resulting associates.

Background

How completely one can reconstruct a genome sequence from whole-genome shotgun (WGS) reads depends on the depth of sequence coverage generated [1]. Additionally, longer reads and better base quality in reads provides more than information and, therefore, allows any assembler to perform a better job, resulting in both the generation of bigger contigs/scaffolds and improvements in the quality of the assembly. The genomes of many species, including the mammals Mus muscle [two], Canis familiaris [3], and Monodelphis domestica [4], have been assembled from Sanger-chemistry WGS reads at, respectively, 6.one×, 7.6×, and 6.7× coverage, yielding drafts that represent nearly all of the genomes' euchromatic parts. These drafts are of high quality, and although imperfect, have served as references for the customs.

However, at times, the toll of genome sequencing or the biological backdrop of a genome sequence will strength a genome to be sequenced at lower coverage. Since mammalian genomes are large, price was a major factor when, in 2004, the thought was conceived to annotate the human genome using the genome sequence of many mammals [v]. A lower coverage of the genome was then considered since, theoretically, at two× coverage ane - e ^-2≈ 86% of the genome is represented [one].

When theoretically because the challenge behind depression coverage assembly, nosotros note that depression coverage (either global or local) makes the associates problem much harder to deal with, since information technology affects our capability of both distinguishing true from false read-read alignments and edifice a list of confirmed non-chimeric read pair links. Since an important pace of the assembly process is to generate a set of read-read alignments, errors introduced in this step volition take a major effect on the concluding product. If we somehow could generate only perfect data in this stride (that is, the gear up of all and only the 'true' alignments, where 'true' ways that ii reads align if, and but if, they come from overlapping regions in the genome), then we could produce the optimal associates of the sequence data. In general, nevertheless, we are not even close to the 'perfect' set, and nosotros end up with both missing alignments (truthful alignments that are not detected), and with 'simulated' alignments (alignments of reads that really belong to unlike regions of the genome). In addition, poor sequence quality, polymorphism and repetitiveness are reasons why truthful alignments may not be detected.

In principle, i could overcome this trouble by introducing a method whereby low-coverage de novo assemblies may be improved via assistance from genome sequences of related species. If two species are very closely related, the problem is picayune since the overall genome structure is like and read-read alignments to the related species will give the true position of reads also in the novel genome. Nonetheless, in many cases no very similar genome exists as a template. As genomes become more diverged, ii issues arise: reads may be more difficult to accurately align to the reference genome and biological differences in genome structure (that is, conserved synteny breakpoints, repeat insertions, and segmental duplications) may mean that the read-read placements on the reference are not reflective of the novel genome sequence. In terms of read placement, Margulies and co-workers established that using the BLASTZ algorithm [6] aligns reads reliably when the genomes diverge by up to approximately 0.45 substitutions per site. In addition, increased deviation usually correlates with increased amounts of genomic rearrangement.

We therefore conceived an assisted assembly method that works by reinforcing information that is already present in the reads. For example, consider two contigs continued by a single read pair. Because a small fraction (mayhap approximately 1%) of read pairs are chimeric - that is, upshot from a random ligation in the library construction process - joining the contigs would carry a roughly 1% risk of introducing a false join into the assembly. At present suppose both reads of the pair align consistently to a related genome. Because the odds that a chimeric read pair would align consistently is extremely low, nosotros can safely join the contigs. Similarly, other information in a low-coverage data gear up may exist suitably leveraged. We first tested this approach on the true cat genome [7].

Here we describe the assisted assembly algorithms in detail, then test them on a low-coverage subset of a previously assembled high-coverage data set (C. familiaris), so that nosotros tin can rigorously assess the effect of assistance on assembly accuracy, continuity and completeness. We then apply the method to several depression-coverage mammals and the 8× Plasmodium falciparum HB3 assembly, which, due to cloning bias, is reduced to two× or less over 15% of the genome [8]. The assisted assembly method gives marked improvements in all cases.

The source code for the assisted assembly algorithms and the assemblies themselves are available online [9].

Assisted assembly algorithm

The assisted assembly process starts by simultaneously edifice a de novo associates from the reads and by aligning the same reads to 1 or more than related genomes. These alignments provide proximity relationships between the reads, which then seed changes to the associates - for example, by calculation in reads that had not been previously assembled. In the simplest case, a read has not been placed in a contig because its overlap with the contig is short. Now, with the additional show provided by cross-species proximity, the read can be placed with sufficient confidence. Similarly, alignment of a read pair to a related genome tin can validate the soundness of the read pair - near guaranteeing that it is not a chimera - thus allowing for a single read pair to join ii scaffolds in the assembly. Once the initial assist has been performed, the algorithm iteratively carries out a series of standard assembly steps, such equally adding in mate pairs, which can improve the quality of the assembly. This process may even right errors introduced by the assistance process itself. Below and in Figure i we describe the key components of the assisted associates algorithm.

An external file that holds a picture, illustration, etc. Object name is gb-2009-10-8-r88-1.jpg

Assisted assembly principle. (a) In this case, five reads marshal uniquely to the reference genome, and the ii leftmost of these (imperial) too appear as the 2 rightmost reads in an existing de novo contig. We can then extend the de novo contig by using the 3 unassembled reads (green), fifty-fifty if in that location is no supporting linking evidence (in general, ARACHNE requires a read to be linked to the contig it overlaps before using it to extend the contig). (b) 2 scaffolds (blue and purple) are mapped and oriented on the reference genome by the trusted green reads. Furthermore, the 2 scaffolds are joined past a single link (black dotted line), although this is not trusted per se. The ARACHNE scaffolding algorithm would not normally bring together the two scaffolds; nevertheless, in this case the separation of the ii scaffolds implied by the link is consistent with the separation implied past the mapping on the reference genome, and we thus implicitly validate the black dotted link and join the two scaffolds. (c) Trusted read placements anchor portions of a unmarried scaffold onto two afar parts of the reference genome, suggesting either a bona fide syntenic break or a misassembly. To examination for the latter, the contested region on the scaffold is subject to a stringent test for misassembly, and cleaved if it fails. The same level of stringency of misassembly testing could not be practical to the unabridged associates because, at low coverage, at that place would be besides many false positives.

Placing reads on a reference genome

Reads are separately aligned to the reference sequence for each related species. These alignments are local: a read is non required to align from stop to end. This allows for reads to be placed in spite of evolutionary events, such as insertion of transposable elements, which are large relative to the read length. Reads may exist placed multiply. Thus, if a region in the sample species' genome has been duplicated in the reference species, we can still use the related species to amend the assembly of the region.

Grouping reads (building proto-contigs)

For each read placement, nosotros infer the read's outset and stop points on the related genome, even if the placement does non extend from finish to end. We and so group read placements by continuity: we put reads together so long as their inferred start/stop intervals on the related genome overlap by at to the lowest degree i base of operations. This overlap threshold is somewhat capricious: for purposes of grouping it could exist increased or even made negative without conceptually altering the method.

Enlarging contigs

The reads in the groups are now used to enlarge the preexisting de novo associates contigs (Figure 1a) and, in some cases, to start new contigs. To do this, we attempt to assign each grouping to a contig, by offset finding all contigs that the group shares reads with. If there is one contig, nosotros assign the group to that contig. If at that place are ii contigs, as would happen if the grouping bridged a gap between them, nosotros assign the group to the contig that it shares the about reads with. If there are more than 2 contigs, nosotros do not assign the group. If there are no contigs, nosotros extract 1 read from the group, phone call it a new contig, and assign the group to this new contig. Supposing that the grouping is assigned to a contig, we so have all the reads from the group that are not already in the contig, and marshal the reads one by one to the contig. If there is an end-to-end alignment between the read and the contig of at to the lowest degree a minimum length (24 nucleotides), the read is placed in the contig and the contig is modified if appropriate (for example adding bases on one stop).

Joining scaffolds

In a de novo assembly, single read pair links cannot be used to join scaffolds, because even with a low rate of chimerism (for example, 1%) in libraries, there would yet exist too many incorrect joins. Given an assisting genome, however, we can define a single link equally 'trusted' if it has a valid and unique alignment to the reference genome, and then use such single trusted links to join scaffolds. Allowing trusted links to join scaffolds would work - only inefficiently - because in practice merely a fraction of the links are actually trusted. Instead, nosotros first apply the trusted links to place and orient the de novo scaffolds onto the reference genome, and and so we join nearby scaffolds, provided that there is a single logical link (not necessarily trusted on its ain) that goes from one scaffold to the other consistently with the placement of the scaffolds on the reference (Figure 1b).

Correcting misassemblies

Consider a scaffold for which role aligns to 1 place on the reference genome and an side by side role aligns to another identify. This could be due to an evolutionary rearrangement or to misassembly. To allow for both possibilities, we first define a window around the juncture in the scaffold, so apply a consistency check algorithm (run across Materials and methods for details) localized to the window itself (Effigy 1c). If this check fails, we break the scaffold. The thought is that nosotros do not want to run the consistency check algorithm on the whole assembly, since the regions at low coverage would yield a very large number of faux positives.

Smoothing the assembly

Once the operations just described - that utilise the reference genome - take been run, a serial of de novo assembly operations can exist carried out, without using the reference genome. These operations move reads to meliorate homes within the assembly, join contigs when possible, break contigs where needed, and so forth.

Results

Validation of the assisted assembly algorithm

We tested the performance and accurateness of our assisted associates algorithms against the 7.6× loftier quality draft assembly of C. familiaris [3]. To do that, we start randomly selected whole plates from the original data set up to twofold coverage on high-quality bases (Q20, per-base error rate = 1%). With this two× data set we performed a de novo assembly followed by an assisted assembly against the human genome (build 36), which has an average departure from domestic dog of 0.35 substitutions per site. The assisted assembly had a 7% net increase in reads assembled, an 8% comeback of full contig length, and an virtually threefold improvement of scaffold length (Table ane).

Table i

Comparison between initial, assisted, and theoretical ii× canine assemblies

	Canis familiaris - 2× assembly

	Initial draft	Assisted	Theoretical
Bases assembled (%)	81.0	86.5	94.i
Total contig length (Mb)	i,697	i,823	1,969
N50 contig (kb)	2.five	2.8	3.3
N50 scaffold gapped (kb)	18.6	53.1	4,039.7
N50 scaffold ungapped (kb)	10.iii	36.eight	3,519.ane

In parallel, we generated a 'theoretical 2× assembly' by taking as input the high quality draft associates and removing all the reads that were not present in the randomly selected set used to generate the canine 2× assembly. This represents a theoretical upper limit assembly - that is, the ideal best possible assembly for the 2× data fix. Comparison of the real and theoretical 2× assemblies shows that the assisted assembly greatly improves the initial de novo assembly in terms of genomic content: full contig length in the initial assembly is 1.70 Gb, which improves to 1.82 Gb after assist, versus 1.97 Gb of full contig length in the theoretical assembly. Assisted assembly likewise dramatically improves the N50 (length-weighted median) scaffold length (from 18.half dozen kb to 53.ane kb), simply does non reach the theoretical limit (iv.0 Mb). The large discrepancy betwixt assisted and theoretical scaffold length is largely due to the fact that 'holes' in the assembly - that is, regions that were not recovered by the assisting algorithm - greatly increased fragmentation at the scaffolding level.

Nosotros then devised the post-obit statistical validation examination to decide the quality of any given assembly against a finished or high quality draft assembly. Nosotros randomly selected a large number of high quality oriented k-mers from the 2× assembly (in practice, nosotros used m = 24), and then nosotros ascertained the frequency at which chiliad-mers at distance d from each other in the ii× assembly (for diverse values of d) appeared to be misassembled with respect to the high quality typhoon (Figure 2, Table 2).

Table two

Accurateness of initial and assisted assemblies, estimated using the Assembly proximity test*

	1 kb	two kb	6 kb	10 kb	20 kb	60 kb	100 kb
Initial draft	97.9%	97.5%	97.4%	97.ane%	96.2%	95.3%	94.4%

Assisted	98.2%	98.i%	98.1%	98.0%	98.0%	97.9%	97.9%

*Random paired one thousand-mers were selected from the 2× canine assemblies and then matched against the high quality draft associates. The tabular array shows the success rate for various values of d (the altitude between the pairs).

An external file that holds a picture, illustration, etc. Object name is gb-2009-10-8-r88-2.jpg

Validation examination. From the target assembly, we randomly select a pair of high-quality thousand-mers at distance d from each other. The pair is declared valid if the two grand-mers are both present in the reference genome, with the same orientation and a separation d', approximately equal to d. This operation is repeated for many pairs. We report the fraction of such pairs that are valid.

We applied the validation test to the de novo and the assisted assemblies of C. familiaris (we could not utilize the exam to the other assemblies, since it requires a finished or high quality typhoon associates to employ as the 'truth'). We found that the assembly afterward assist is the most accurate of the two, notwithstanding the fact that scaffolds are much longer in the assisted version. For example, the fraction of pairs of k-mers 100 kb apart that were confirmed by the high quality assembly was 94.four% in the initial 2× draft and 97.9% in the 2× assisted associates.

two× mammalian assemblies

A major awarding for the assisted assembly algorithm is the 2× mammalian genomes sequenced for notation of the human genome [5,9]. To date, 21 ii× assemblies accept been generated using these algorithms, with man and canis familiaris every bit references. I of these, the assembly of the cat genome, has also been mapped to the chromosomes using an existing radiation hybrid map [seven].

These reference genomes were selected based on their high genome quality, their positions in two different groups of the eutherian tree, and their relatively low departure from the common ancestor of mammals. The mouse genome, although more complete than the dog, was non used equally a reference genome considering of its high divergence charge per unit.

The assist process had a clear effect on all the original 2× mammalian assemblies (see Materials and methods): read usage and total contig length improved, on boilerplate, about 10%; N50 contig length increased, on average, from 2.8 kb to 3.0 kb; and scaffold N50 size increased past upwards to a cistron of 5. Table 3 shows data from four examples that were assembled with the verbal same version of the lawmaking. As expected, the impact of the profitable procedure is larger when the branching length between the assisted genome and the reference genome is shorter: after assistance, for example, the N50 scaffold length for bushbaby, Otolemur garnetti, was approximately 72 kb, almost twice the N50 scaffold length of the elephant, Loxodonta Africana (Table three).

Table 3

Assembly statistics for initial drafts and assisted assemblies for a option of 2× mammal assemblies

	Four projects from Mammal24 - 2× assemblies

	Otolemur garnetti (bushbaby)		Loxodonta africana (African elephant)		Oryctolagus cuniculus (rabbit)		Cavia porcellus (guinea pig)

	Initial	Assisted*	Initial	Assisted*	Initial	Assisted*	Initial	Assisted*
Bases assembled (%)	76.1	85.7	77.5	84.2	80.1	85.3	75.6	82.4
Total contig length (Mb)	1,672	one,905	2,089	two,314	1,925	2,080	1,658	1,853
N50 contig (kb)	ii.6	ii.nine	2.7	2.7	two.7	2.nine	two.five	2.6
N50 scaffold gapped (kb)	13.half-dozen	71.half-dozen	xi.8	37.0	13.3	53.9	xi.0	44.5
N50 scaffold ungapped (kb)	9.i	37.6	eight.iv	fifteen.nine	9.5	20.1	7.6	12.2

*All assemblies were assisted confronting 2 references, Human being sapiens and C. familiaris.

Assisting high coverage information sets with cloning bias

In theory, the assisted assembly should work equally well to rescue genomes with astringent cloning bias resulting in depression coverage sequence in certain portions of the genome. We therefore applied the same algorithms on the malaria strain P. falciparum HB3. It was sequenced to 8× [8], but the resulting assembly had surprisingly low connectivity and shorter-than-expected total contig length. In fact, cloning bias reduced the coverage to 2× or less for about xx% out of the 24 Mb genome, which is considerably more than the 0.03% expected for an average viii× assembly.

The reference strain P. falciparum 3D7 was used equally a reference [x]. This is of well-nigh finished quality, and is 0.12 substitutions per site diverged from the HB3 strain [8]. The assisting procedure recovered almost four Mb of low coverage regions (17% of the genome), while the N50 scaffold length increased past about a factor of three (Tabular array iv).

Table iv

Assembly statistics for initial drafts and assisted assemblies for the 8× associates of P. falciparum HB3, which has severe cloning bias

	P. falciparum HB3 - 8× assembly

	Initial typhoon	Assisted
Bases assembled (%)	85.6	93.4
Total contig length (Mb)	xix.8	23.5
N50 contig (kb)	xiii.7	fifteen.4
N50 scaffold gapped (kb)	17.0	48.8
N50 scaffold ungapped (kb)	16.eight	47.5

Discussion

Nosotros show that the assisted assembly process significantly improves contiguity and quality of low coverage mammalian assemblies and that it tin be successfully applied to genomes with locally low coverage acquired by cloning bias, such as P. falciparum HB3 [8]. While some previous piece of work has described the utilize of information such every bit optical maps or typhoon assemblies of the aforementioned species to inform the associates procedure [11-thirteen], we believe that the algorithms described here stand out, as they carefully use the conserved synteny information of reads aligned to a reference genome to leverage information already existing within a the target genome sequence data.

The pick of reference genome(due south) is critical when performing assisted associates. Clearly, using a closely related genome to meliorate an initial typhoon assembly will have a bigger impact on the final draft assembly, and the accuracy and completeness of a reference genome also contribute. In the assemblies we generated, the number of validated pairs adjustment uniquely to the reference varied from xviii.five% of the alignments of the guinea hog against the human reference, to 74.3% of the alignments of strain HB3 of Plasmodium against the reference strain 3D7 (Table 5).

Table five

Statistics of the alignments of reads onto the reference genomes

	Assisted on	Reads aligning target uniquely	Valid pairs aligning target uniquely
Plasmodium falciparum HB3	Plasmodium falciparum 3D7	79.1%	74.3%
Canis familiaris - ii× assembly	Human being sapiens	64.1%	35.1%
Loxodonta africana	Homo sapiens	51.ane%	22.7%
Oryctolagus cuniculus	Homo sapiens	55.three%	25.2%
Otolemur garnetti	Human being sapiens	68.8%	38.0%
Cavia porcellus	Homo sapiens	47.8%	18.5%
Loxodonta africana	Canis familiaris	49.3%	28.8%
Oryctolagus cuniculus	Domestic dog	48.8%	29.8%
Otolemur garnetti	Canis familiaris	59.6%	43.ix%
Cavia porcellus	Canis familiaris	41.half dozen%	22.4%

The projects from the Mammal24 ready were assisted against both human and canine references.

Withal, the most disquisitional factor is the ability to uniquely align target reads to the reference genome. The BLASTZ algorithm [6] aligns reads reliably when the genomes are up to approximately 0.45 substitutions per site apart, as was determined as a prerequisite for the project to annotate the homo genome using 24 low coverage mammals [5].

Many of the parameters that affect the accuracy of the read to reference genome alignments are generally less favorable for new sequencing technologies, where short reads with higher error charge per unit are more common. This ways that the current methodology can simply be used on actually closely related species using new curt-read sequence technologies.

Materials and methods

Code and associates

Nosotros used ARACHNE [14,fifteen] to generate initial draft assemblies, and all the assisted assembly tools were adult within the framework provided by ARACHNE. The lawmaking is available for download from [16], equally well as the assemblies generated for this paper, together with the fix of 'lab notes' used to generate the assemblies. All the assemblies reported in Table two were generated with the same frozen code. The original set of 21 projects in Mammal24 is publicly available from [17].

Placing reads on a reference genome

We used the aligner BLASTZ [6] with default arguments to align the 2× mammalian assemblies against both human and canine references. At the end of the process we filtered the alignments from BLASTZ past discarding those with an alignment score lower than a given threshold (iii,000), hence allowing for a read to be multiply placed.

We used the aligner QueryLookupTable with parameters MF = 5000 SH = True MC = 0.fifteen to align the WGS reads of P. falciparum HB3 against the strain 3D7. The aligner is part of the standard distribution of the ARACHNE code and is distributed together with the assisting code.

Enlarging contigs

The procedure of enlarging contigs consists of assuasive groups of reads that appear to overlap based on their position on the reference to extend existing de novo contigs. This is realized in practice equally an assisted improvement of the layout lawmaking: reads that are adjacent to each other in their group on the reference are tested for read-read alignment, and if a read-read alignment exists, this is used to seed the positioning of the new read onto the existing layout (hence extending the layout of the contig). Subsequently assisted layout, the de novo consensus module is called with standard arguments.

Joining scaffolds

Scaffolds are anchored to the reference genome by using the set of pairs that align uniquely and validly onto the reference genome. A pair adjustment uniquely onto the reference genome is chosen a 'validated pair' if the absolute value of its stretch (defined equally the departure between observed separation and given separation divided by the given standard deviation) does not exceed v. The cease reads of validated pairs are called 'validated reads'.

For a given scaffold, we wait at all the validated reads: each of these reads implicitly maps and orients the scaffold on the reference genome. We and so sort the validated reads by their start on the scaffold. Ii side by side validated reads are defined to be 'consequent' if they map and orient the scaffold on the same reference sequence, and if the absolute rate of the compression rate c (that is, the ratio betwixt the distance of the 2 reads on the scaffold and on the reference genome) is such that i/three < c < three.

A scaffold is anchored to the reference genome if there are at least two validated reads in the scaffold, and if all the pairs of sequent validated reads in the scaffold are consistent. In exercise, we found that almost scaffolds incorporate at to the lowest degree a few validated reads, even when only a fraction of the reads was actually validated.

Correcting misassemblies

Nosotros now focus on scaffolds for which the following happens: the scaffold contains several validated reads (which are sorted by their start on the scaffold), and the validated reads are divided in 2 'make clean' sets - that is, there is one, and only one, not-consistent pair of consecutive validated reads, say r1 and r2. We then define a window of possible misassembly as the interval [a, b), where a is the get-go on the scaffold of r1, and b the end on the scaffold of the read r2.

Nosotros then apply the following consistency check to the window of possible misassembly: if in that location exists a point in the window with read coverage <three and no insert coverage, so the contig is cleaved at the juncture, and eventually the scaffold is broken in its connected components. In other words, the contig is broken if at any point the window is 'held together' by a single read-read overlap.

Validation: assembly proximity test

This section defines what a 'valid' pair of k-mers is, for the proximity validation test. We start by fixing a target assembly (for example, one of the two× canis familiaris assemblies) together with a reference finished course assembly of the same species (for instance, the total coverage typhoon assembly of domestic dog).

Nosotros then randomly select from the target assembly a high quality pair of oriented k-mers at distance d from each other. This is divers as a pair of k-mers, such that: all the bases in the two k-mers take quality 50; and the separation betwixt the 2 thousand-mers is d. Next, nosotros define the standard deviation of such a pair. If the two one thousand-mers belong to the same contig, and then this is defined as the maximum between k and d/100. Otherwise, the square of the standard deviation of the pair is defined as the sum of the squares of the standard deviations of the gaps between the two contigs containing the k-mers.

We now look for the pair in the reference assembly. The pair is 'valid' if we can detect at least ane instance of the pair onto the reference associates, such that: the relative orientation of the 2 grand-mers in the pair is the same as in the target assembly; and the stretch of the pair does not exceed three, where stretch is defined every bit (d' - d)/stdev, where d' is the altitude betwixt the thousand-mers on the reference, and stdev the standard divergence of the pair defined in a higher place.

Abbreviations

N50: length-weighted median; WGS: whole-genome shotgun.

Authors' contributions

ESL and KLT proposed the assisted assembly concept. SG carried out the research and wrote the code. DBJ proposed the validation methodology. SG, DBJ and KLT wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

Nosotros thank the Sequencing platform of the Broad Establish at Harvard and MIT and the Whole Genome Assembly Team. We thank Leslie Gaffney for help with figures. This work was supported in role past NHGRI. DJ has support for 'Whole-genome shotgun sequencing strategy and assembly" and KLT has a EURYI from ESF.

References

Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988;ii:231–239. doi: 10.1016/0888-7543(88)90007-ix. [PubMed] [CrossRef] [Google Scholar]
Mouse Genome Sequencing Consortium. Waterston RH, Lindblad-Toh K, Birney East, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Brook Southward, Berry E, Birren B, Bloom T, Bork P, Botcherby G, Bray Due north, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, et al. Initial sequencing and analysis of the mouse genome. Nature. 2002;420:520–562. doi: 10.1038/nature01262. [PubMed] [CrossRef] [Google Scholar]
Lindblad-Toh G, Wade CM, Mikkelsen TS, Karlsson EK, Jaffe DB, Kamal M, Clamp M, Chang JL, Kulbokas EJ, tertiary, Zody MC, Mauceli E, Xie X, Breen M, Wayne RK, Ostrander EA, Ponting CP, Galibert F, Smith DR, DeJong PJ, Kirkness E, Alvarez P, Biagi T, Brockman West, Butler J, Chin CW, Cook A, Cuff J, Daly MJ, DeCaprio D, Gnerre South, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 2005;438:803–819. doi: 10.1038/nature04338. [PubMed] [CrossRef] [Google Scholar]
Mikkelsen TS, Wakefield MJ, Aken B, Amemiya CT, Chang JL, Knuckles S, Garber 1000, Gentles AJ, Goodstadt Fifty, Heger A, Jurka J, Kamal M, Mauceli E, Searle SM, Sharpe T, Baker ML, Batzer MA, Benos PV, Belov 1000, Clamp Grand, Melt A, Cuff J, Das R, Davidow 50, Deakin JE, Fazzari MJ, Glass JL, Grabherr M, Greally JM, Gu Due west, et al. Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature. 2007;447:167–177. doi: ten.1038/nature05805. [PubMed] [CrossRef] [Google Scholar]
Margulies EH, NISC Comparative Sequencing Program. Maduro VV, Thomas PJ, Tomkins JP, Amemiya CT, Luo M, Green D. Comparative sequencing provides insights about the structure and conservation of marsupial and monotreme genomes. Proc Natl Acad Sci USA. 2005;102:3354–3359. doi: 10.1073/pnas.0408539102. [PMC costless article] [PubMed] [CrossRef] [Google Scholar]
Schwartz S, Kent W, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003;13:103–107. doi: 10.1101/gr.809403. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
Pontius JU, Mullikin JC, Smith DR, Agencourt Sequencing Team. Lindblad-Toh K, Gnerre S, Clamp Thousand, Chang J, Stephens R, Neelam B, Volfovsky N, Schäffer AA, Agarwala R, Narfström K, Murphy WJ, Giger U, Roca AL, Antunes A, Menotti-Raymond M, Yuhki North, Pecon-Slattery J, Johnson Nosotros, Bourque Yard, Tesler G, NISC Comparative Sequencing Program. O'Brien SJ. Initial sequence and comparative analysis of the cat genome. Genome Res. 2007;17:1675–1689. doi: 10.1101/gr.6380007. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
Volkman SK, Sabeti PC, DeCaprio D, Neafsey DE, Schaffner SF, Milner DA, Jr, Daily JP, Sarr O, Ndiaye D, Ndir O, Mboup S, Duraisingh MT, Lukens A, Derr A, Stange-Thomann N, Waggoner S, Onofrio R, Ziaugra L, Mauceli Eastward, Gnerre S, Jaffe DB, Zainoun J, Wiegand RC, Birren BW, Hartl DL, Galagan JE, Lander ES, Wirth DF. A genome-wide map of diversity in Plasmodium falciparum. Nat Genet. 2007;39:113–119. doi: 10.1038/ng1930. [PubMed] [CrossRef] [Google Scholar]
Broad Institute: Assisted Assembly ftp Site ftp://ftp.broadinstitute.org/pub/papers/comprd/assisted_assembly
Gardner MJ, Hall N, Fung E, White O, Berriman K, Hyman RW, Carlton JM, Hurting A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford Yard, Salzberg SL, Craig A, Kyes S, Chan MS, Nene Five, Shallom SJ, Suh B, Peterson J, Angiuoli Due south, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DM, et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002;419:498–511. doi: 10.1038/nature01097. [PMC costless article] [PubMed] [CrossRef] [Google Scholar]
Nagarajan N, Read TD, Pop Grand. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008;24:1229–1235. doi: 10.1093/bioinformatics/btn102. [PMC free commodity] [PubMed] [CrossRef] [Google Scholar]
Soderlund C, Longden I, Mott R. FPC: a organisation for edifice contigs from restriction fingerprinted clones. Comput Appl Biosci. 1997;xiii:523–535. [PubMed] [Google Scholar]
Sundquist A, Ronaghi M, Tang H, Pevzner P, Batzoglou S. Whole-genome sequencing and assembly with high throughput, short-read technologies. PLoS ONE. 2007;ii:e484. doi: ten.1371/journal.pone.0000484. [PMC costless article] [PubMed] [CrossRef] [Google Scholar]
Batzoglou Due south, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES. ARACHNE: a whole-genome shotgun assembler. Genome Res. 2002;12:177–189. doi: 10.1101/gr.208902. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
Jaffe DB, Butler J, Gnerre S, Mauceli East, Lindblad-Toh K, Mesirov JP, Zody MC, Lander ES. Whole-genome sequence associates for mammalian genomes: Arachne 2. Genome Res. 2003;13:91–96. doi: 10.1101/gr.828403. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
Broad Plant: Computational Inquiry and Evolution http://world wide web.broadinstitute.org/science/programs/genome-biology/crd
Mammalian Genome Project: Information Release Summary http://world wide web.broadinstitute.org/science/projects/mammals-models/information-release-summary

sowellclavory.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2745769/

Could Additional Coverage (Reads) Have Improved the Assembly?

Assisted assembly: how to improve a de novo genome associates past using related species

Sante Gnerre

Eric Southward Lander

Kerstin Lindblad-Toh

David B Jaffe

Abstract

Background

Assisted assembly algorithm

Placing reads on a reference genome

Grouping reads (building proto-contigs)

Enlarging contigs

Joining scaffolds

Correcting misassemblies

Smoothing the assembly

Results

Validation of the assisted assembly algorithm

Table i

Table two

two× mammalian assemblies

Table 3

Assisting high coverage information sets with cloning bias

Table iv

Discussion

Table five

Materials and methods

Code and associates

Placing reads on a reference genome

Enlarging contigs

Joining scaffolds

Correcting misassemblies

Validation: assembly proximity test

Abbreviations

Authors' contributions

Acknowledgements

References

0 Response to "Could Additional Coverage (Reads) Have Improved the Assembly?"

Enregistrer un commentaire

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel