Welcome to BecA bioinformatics page

Introduction to NGS

Bioinformatics Modules

What is Next-Generation Sequencing(NGS)?

Next-generation sequencing (NGS), also known as high-throughput sequencing, is the catch-all term used to describe a number of different modern sequencing technologies including:
  • Illumina (Solexa) sequencing
  • Roche 454 sequencing
  • Ion torrent: Proton / PGM sequencing
  • SOLiD sequencing
These recent technologies have revolutionised the study of genomics and molecular biology.

Why NGS?

The four main advantages of NGS over classical Sanger sequencing are:
  1. speed
  2. cost
  3. sample size
  4. accuracy
NGS is significantly cheaper, quicker, needs significantly less DNA and is more accurate and reliable than Sanger sequencing. For Sanger sequencing, a large amount of template DNA is needed for each read. Several strands of template DNA are needed for each base being sequenced (i.e. for a 100bp sequence you'd need many hundreds of copies, for a 1000bp sequence you'd need many thousands of copies), as a strand that terminates on each base is needed to construct a full sequence. In NGS, a sequence can be obtained from a single strand. In both kinds of sequencing multiple staggered copies are taken for contig construction and sequence validation.
NGS is quicker than Sanger sequencing in two ways.
  1. Firstly, the chemical reaction may be combined with the signal detection in some versions of NGS, whereas in Sanger sequencing these are two separate processes.
  2. Secondly and more significantly, only one read (maximum ~1kb) can be taken at a time in Sanger sequencing, whereas NGS is massively parallel, allowing 300Gb of DNA to be read on a single run on a single chip.
The reduced time, manpower and reagents in NGS mean that the costs are much lower. The first human genome sequence cost in the region of £300M. Using modern Sanger sequencing methods, aided by data from the known sequence, a full human genome would still cost £6M. Sequencing a human genome with Illumina today would cost only £6,000.

Repeats are intrinsic to NGS, as each read is amplified before sequencing, and because it relies on many short overlapping reads, so each section of DNA or RNA is sequenced multiple times. Also, because it is so much quicker and cheaper, it is possible to do more repeats than with Sanger sequencing. More repeats means greater coverage, which leads to a more accurate and reliable sequence, even if individual reads are less accurate for NGS.
Sanger sequencing can be used to give much longer sequence reads. However, the parallel nature of NGS means that longer reads can be constructed from many contiguous short reads.

Glossary of NGS terms and formats

  • SAM: Sequence Alignment Map: This is a generic alignment format that supports short and long reads. BAM is the non human readable format (computer-friendly, faster computational binary format) of SAM.
  • BED: Browser Extensible Data: This format is used for mapping / annotation (bigWig extension). BEDGraph files are used to represent peak scores.
  • WIG: Wiggle formats are used for visualization and summarizing data- mostly count data, or normalised count data (RPKM). They use BigWig extension
  • GFF: General Feature Format- is used for annotation of genetic / genomic features, eg genes in ensembl database, and is used in downstream analysis to assign annotation to regions and peaks
  • VCF: Variant Call Format is used for SNP representation after mapping

Next:

NGS Data analysis: De novo assembly vs Reference assembly.