An introduction to get started in genome assembly and annotation
Contributors
Author(s) |
![]() |
Editor(s) |
![]() ![]() ![]() ![]() |
Questions
What are the guidelines before starting a Genome Assembly and Annotation project?
Steps before starting a genome project
.left[
-
Step 1: Build a wide community for the project if possible
-
Step 2: Gather information about the target genome
-
Step 3: Select the best possible DNA source and extraction
-
Step 5: Choose an appropriate sequencing technology
-
Step 6: Check the computational resources and requirements ]
Build a wide community for the project (if it’s possible)
.left[ The aim of a genome project is to sequence the entire target genome for a wide range of genomics applications. ]
.left[ Analyses, reanalyses and integration of genomic and other phenotype information require: ]
- Facilities: Wet lab, sequencing, bioinformatics,…
- Personnel: Skill intensive
- Software: Knowledge intensive
.left[ warning Data storage, maintenance, transfer, and analysis costs will also likely remain substantial and represent an increasing proportion of overall sequencing costs in the future. ]
Genome information: Genome size
.pull-left[ How to collect informations?
- Flux cytometry
- Databases:
- Fungi: http://www.zbi.ee/fungalgenomesize
- Animals: http://www.genomesize.com
- Plants: http://data.kew.org/cvalues
- Bibliography ]
.pull-right[
.image-100[
]]
.footnote[https://commons.wikimedia.org/w/index.php?curid=19537795]
Genome information: GC content
.pull-left[ Why?
.left[ Sequencing is not random! GC and AT rich regions are under-represented. ]
How to solve?
- Chemistry quirks
- Increase the sequencing depth
- Technologies combination (both long and short reads) ]
.pull-right[
.image-100[
]]
.footnote[Chaisson et al. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015).]
Genome information: Ploidy level
.pull-right[
.image-55[
]]
.pull-left[
Ploidy (N):
Number of sets of chromosomes in a cell
Organism | Ploidy |
---|---|
Bacteria | 1N |
Human, mouse, rat | 2N |
Amphibians (Xenopus) | 2N to 12N |
Plants (wheat) | 6N |
Autopolyploid | . |
Hybrids | . |
]
Higher ploidy -> harder to assemble => Increase of sequencing depth
.footnote[Daniel Hartl. Essential Genetics: A Genomics Perspective. Jones & Bartlett Learning. p. 177. ISBN 978-0-7637-7364-9. (2011).]
Genome information: Heterozygosity level
.pull-left[ .left[ Heterozygous: Locus-specific, diploid (2N) organism has two different alleles of a particular gene at the same locus ]]
.pull-right[
.image-100[
]]
.left[ Heterozygosity is a metric used to denote the probability an individual will be heterozygous for a particular allele ]
Higher heterozygosity -> harder to assemble => Increase of sequencing depth
.footnote[https://www.genome.gov/genetics-glossary/heterozygous]
Genome information: Heterozygosity level
.image-125[
]
.footnote[Heng Li’s blog: lh3.github.io/2021/04/17/concepts-in-phased-assemblies]
Genome information: Complexity aka repeats elements
.left[ It is impossible to resolve repeats of length L unless you have reads longer than L ]
Most common source of assembly errors:
.pull-left[
.image-65[
]]
.pull-right[
.image-65[
]]
Genome information: Others
- Karyotype: chromosome number
- Sex chromosome system: None, XY, ZW, UV,…
- Purity: possible presence of contaminants and/or symbionts?
- Is there any other useful data (NCBI, SRA, ENA, etc) that could improve my assembly?
Genome information: Tips
.pull-left[
- Flux cytometry : genome size and ploidy level
- k-mer frequency from Illumina reads : genome size, ploidy level, GC content, heterozygosity and even repeats composition! ]
.pull-right[
.image-20[
]]
The best possible DNA
.left[ Select the best possible DNA source and extraction. The extraction of high-quality DNA is the most important aspect of a successful genome project
The lack of a good starting material will limit the choice of sequencing technology and will affect the quality of obtained data ]
The best possible DNA: Chemical purity of DNA
.left[ Sample-related contaminants:
- Polysaccharides
- Proteoglycans
- Proteins
- Secondary metabolites
- Polyphenols
- Humic acids
- Pigments
- Etc,…
All these contaminants can impair the efficacy of library preparation in any technology, especially true for PCR-free libraries (both PacBio and ONT) ]
The best possible DNA: Quantity of DNA
.left[ Different technologies require different amount of DNA:
- Illumina and 10x > 3 ng
- BioNano > 200 ng
- ONT > 1 μg
- Dovetail (Hic) > 5 μg
- PacBio > 15 μg ]
The best possible DNA: Structural integrity of DNA
.left[ High Molecular Weight (HMW) for Nanopore/PacBio (obtained mainly from fresh material) ]
The best possible DNA: Tips
.left[
- Many DNA extraction protocol available for a lot of species/taxa (VGP, Darwin Tree of Life, Nanopore, PacBio, etc)
- Keep DNA samples from the same individual in case of library preparation/sequencing fail, need more coverage, new sequencing technology, etc
- Use a single individual and sequence a haploid, highly inbred diploid organism, or isogenic individual ]
Appropriate sequencing technology
.left[ Mostly dependant on the quantity and quality of DNA and the cost but many parameters need to be considered prior to running an NGS experiment:
- Short versus long reads or both
- Read length
- Read quality/error rate
- Genome read coverage/depth
- Library preparation
- Downstream applications ]
Appropriate sequencing technology: Primary assembly
.left[
- Illumina: short reads (up to 2x250bp) with high quality reads. Sequencing bias with AT/GC rich regions
- IonTorrent: short reads (up to 500bp) with medium quality reads
- Nanopore: long reads (average ~15kbp) with low quality reads. Errors are not randomly distributed!
- PacBio:
- CLR: long reads (average ~20kbp) with low quality reads
- HiFi: long reads (average ~15kbp) with high quality reads ]
Appropriate sequencing technology: Scaffolding
.left[
- HiC: single or double restriction enzyme fragmentation. Need huge amount of coverage.
- Omni-C: sequence-independent endonuclease ]
Appropriate sequencing technology: Short vs long reads
.pull-left[ Depending on the sequencing technology reads length and sequencing depth are different.
- More sequencing depth but shorter reads
- Longer reads but less sequencing depth ]
.pull-rigth[
.image-40[
]]
.footnote[https://github.com/alexcorm/developments-in-next-generation-sequencing]
Appropriate sequencing technology: Short vs long reads
.pull-left[ Reads accuracy differs depending on the sequencing technology:
- Illumina and PacBio HiFi: more accurate
- ONT and PacBio CLR: less accurate (but longer) ]
.pull-rigth[
.image-40[
]]
Appropriate sequencing technology
.image-100[
]
.footnote[Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat Rev Genetics 20, 631–656 (2019).]
Appropriate sequencing technology: Coverage versus depth
.left[ Coverage: fraction of the genome sequenced at least by one read
Depth: average number of reads that cover any given region
Intuitively, more reads should increase coverage and depth. ]
.image-40[
]
.footnote[Chaisson et al. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015).]
Computational resources and requirements
.left[ To succeed you need to have sufficient compute resources (CPUS, RAM, walltime and storage).
- The resource are different between:
- Assembly
- Annotation
- Other analysis tools
- For genome assembly:
- Running times and RAM increase with the type and amount of data
- More data for large genomes, increase running time/RAM/Storage
- Most of tools running on a single node: parallelized but not distributed
- For genome annotation:
- Mapping/alignment of external data (RNA-seq, proteins) can be parallelized and distributed
- Annotation process can be parallelized and distributed ]
Typical sequencing strategies: Bacterial genomes
.left[
- PacBio CLR or Oxford Nanopore reads at 40-50x coverage, self-correction and/or hybrid correction (using Illumina data)
- PacBio HiFi reads at 20-30x coverage
- Illumina 2x250bp paired-end reads from MiSeq ]
Typical sequencing strategies: Larger genomes
.left[
- PacBio CLR or Oxford Nanopore reads at 100x coverage, hybrid correction using Illumina data and scaffolding using HiC or Omni-C
- PacBio HiFi reads at 30x coverage and scaffolding using HiC or Omni-C
- PacBio HiFi reads at 30x coverage, 120x Oxford Nanopore ultra long reads ]
Thank you!
This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!
Funding
These individuals or organisations provided funding support for the development of this resource
This project (2020-1-NL01-KA203-064717
) is funded with the support of the Erasmus+ programme of the European Union. Their funding has supported a large number of tutorials within the GTN across a wide array of topics.