This is an old revision of the document!

Studying population diversity using SNP chips

SNP chips (pronounce "snip chips") are genotyping platforms based on the presence of tens of thousands of markers on a small sequencing device, also called a "microarray" or "bead chip". For instance, the Illumina OvineSNP50 microarray features some 54,241 SNP probes, while their Porcine SNP60 beadchip features over 64,000 markers.

Using a SNP chip, one can genotype simultaneously many samples. The pricing of the chip obviously depends on the number of samples one can run on the chip. For instance, for the above-mentioned PorcineSNP60v2, the offer ranges from chips accommodating up to 48 samples to chips accommodating up to 1152 samples. In the case of a diploid species, each of the markers produces one or two signals for each of the samples: one signal only if the sample is homozygous for that marker, and two if the sample is heterozygous.

Using a SNP chip, one can therefore genotype the samples of interest, and also study the molecular diversity within the batch of samples submitted for sequencing.

Input data formats: Illumina final report, .ped, .map, .bsc files, etc

The data one gets from a microarray describes for each sample and each locus (aka marker) the set of alleles called by the sequencing operation. "Called" here means "seen" by the sequencing device at that position. Each marker is also fully described, usually giving its genomic position in some reference genome, as well as, possibly, the most common (dominant) allele seen in that species at that position. Different data formats have been standardized for that purpose.

Illumina "Final report"

These "Final report" files are large csv files. Being the market-dominating vendor of SNP arrays, Illumina have developed their own data format for the array-based genotyping reports. Those files are produced by Illumina's Genome Studio software, and after a header, they contain comma-separated data (or tab-separated, less frequently) with one row per sample and per marker: this is therefore a "long" format, not a "wide" one: there are as many rows of data as n*m, where n is the number of samples, and m the number of markers (SNPs). Below is an excerpt from such a file:

[Header]
GSGT Version,2.0.4
Processing Date,3/16/2020 11:12 AM
Content,,Goat_IGGC_65K_v2_15069617X365016_A1.bpm
Num SNPs,59727
Total SNPs,59727
Num Samples,720
Total Samples,1224
[Data]
Sample Name,SNP Name,Allele1 - Top,Allele2 - Top
WG9308435-DNA_A01_EY0041000,1_101941444_AF-PAKI,G,G
WG9308435-DNA_A01_EY0041000,1_10408764_AF-PAKI,G,G
WG9308435-DNA_A01_EY0041000,1_104453302_AF-PAKI,-,-
WG9308435-DNA_A01_EY0041000,1_107080965_AF-PAKI,G,G
WG9308435-DNA_A01_EY0041000,1_109839943_AF-PAKI,A,G
WG9308435-DNA_A01_EY0041000,1_112457804_AF-PAKI,A,A
WG9308435-DNA_A01_EY0041000,1_115032820_AF-PAKI,A,G
WG9308435-DNA_A01_EY0041000,1_117945786_AF-PAKI,A,G
WG9308435-DNA_A01_EY0041000,1_121035617_AF-PAKI,C,C

In such a file, the [Header] section provides information regarding the number of samples and the number of markers, and the [Data] section contains the genotyping information per se: each row contains at least the following information.

Sample Name
SNP Name → an identifier for the SNP marker, we will see it appear in .map files as well
Allele1 - Top → the first allele called by the sequencer, transformed to be read on the 'TOP', or 'reference' strand, even if the marker is actually in a coding sequence in the alternate strand
Allele2 - Top → the second allele called by the sequencer. It may well be the same as the first allele, in which case the sample is homozygous for that marker.

WARNING!! This is usually a very large ("long", actually) file, so although (apart from the header) it is actually a comma-separated text file, it will be cumbersome (and will necessitate lots of RAM) to read it using a traditional spreadsheet editor. Better browse it on the commandline using the less -S command.

.ped file format

The ped format is a quite compact, yet text-based file format that contains all the genotyping information. It contains one line per sample, each line or row containing the following tab- or whitespace-separated fields, in this order and with no header line. The first six columns are mandatory:

family or population ID. When no families/populations are defined, this first field will usually contain integers corresponding to sequential IDs for the samples, starting from 1: each sample is therefore the single member of its own family.
individual or sample ID. This is simply the sample name.
paternal ID. Left to 0 if unknown, or when the father is not in the sample set.
maternal ID. Left to 0 if unknown, or when the mother is not in the sample set.
sex. The coding is as follows: 1=male; 2=female; anything else (usually 0) means unknown.
phenotype of interest. A ped file must have one (and only one) phenotype column, for instance a quantitative trait or an "affected/unaffected" binary trait, where "affected" is coded 2 and "unaffected" is coded 1. In this column, missing values are coded either 0 or -9. See the plink documentation for more information about that.

The following 2m columns, where m is the number of markers, contain the genotype calls, i.e. single nucleotides. Missing information (when the genotyping device was unable to decide which allele was seen is encoded as 0.

A ped file must be biallelic, even for haploid species (for which the ped file will only contain pairs of identical alleles) or for polyploid species (where there will be some loss of information). In any case, all current (2020) SNP chips are designed for polyploid populations.

.map file format

The files with extension .map contain informations regarding the markers in use. They contain one line per marker (i.e. per SNP marker, most often), each line containing the following whitespace-separated fields:

chromosome number or contig name. That is the chromosome or contig where the marker is located. Set to 0 if unknown/unspecified.
variant identifier. That is the name of the marker, as found for instance in dbSNP or other variant databases.
position of the marker in morgans or centimorgans (optional; also safe to use dummy value of 0)
position of the marker in basepairs.

Fields 3 and 4 (positioning information) are relative to field 1: if field 1 is set to 0, then necessarily, fields 3 and 4 will also be 0. The position of markers on the genome is a very useful piece of information when one wants to study recombination hotspots, linkage disequilibrium decay, etc, or simply to be able to point which region of the genome correlates with a given trait of interest.

.bed files

"bed" stands for "binary ped": this is the compressed form of a ped file. A .bed file is therefore not human-readable, but machine-readable and much more compact (in terms of filesize) than the corresponding .ped file. .bed is the standard data format in use by Plink. A .bed file must always be accompanied by its companion .fam file, which describes the samples.

.fam file format

Such files contain information regarding the individuals (i.e. samples) that were genotyped. They are therefore sample information files. To a very large extent, the .fam file is actually a file made of the first six columns of the .ped file, in the same order. Compare the following (from the Plink2.0 documentation) with what we wrote above regarding the contents of the .ped file:

Family ID ("FID")
Within-family ID ("IID"; cannot be 0)
Within-family ID of father (0 if father isn't in dataset)
Within-family ID of mother (0 if mother isn't in dataset)
Sex code (1 = male, 2 = female, 0 = unknown)
Phenotype value (1 = control, 2 = case, -9/0/non-numeric = missing data if case/control)

.bsc files

Those are XML files generated by the Illumina Genome Studio software. They contain troubleshooting information telling the association between user samples and the cells in the microarray on which the samples have been genotyped.

You can safely ignore those .bsc files.

Data analysis workflow with Plink 1.9

PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner, Purcell S et al., 2007. Plink is purely an analysis tool for phenotype/genotype data.

Plink is available as a command line tool and as gPLINK, a Java-based graphical user interphase software package. On this tutorial we will use the command line tool.

Once logged into your /home/ on the hpc, several versions of PLINK are available and can be called into the working environment with module load plink/1.9 . To see the help simply type plink –help.

The basic data input formats for plink are the .ped and .map files as described above. In the input .ped file, within the first six obligatory columns, ensure that that naming in the first column; Family ID ("FID") and the second column; Within-family ID, do not have white spaces between them. These might cause the file to be read incorrectly by plink. If you need to separate fields in naming use an underscore _ instead.

We will run all the commands using slurm, so we prepare a script file to submit to run as batch jobs. To know more on this refer to the Using Slurm page.

We will begin with running basic statistics to explore the dataset

prepare your script

#!/bin/bash
#SBATCH -p batch
#SBATCH -n 2
#SBATCH -o /home/bngina/plink_work/batch_logs/plink.%N.%J.out
#SBATCH -e /home/bngina/plink_work/batch_logs/plink.%N.%J.err
#SBATCH -J plink

#load the tool

module load plink/1.9

#define file paths

in_file='/home/bngina/plink_work/orig_data/Caprin_60k'

ILRI Research Computing

Table of Contents