====== Studying population diversity using SNP chips ======

**SNP chips** (pronounce "snip chips") are genotyping platforms based on the presence of tens of thousands of markers on a small sequencing device, also called a "**microarray**" or "**bead chip**". For instance, the [[https://emea.illumina.com/products/by-type/microarray-kits/ovine-snp50.html|Illumina OvineSNP50]] microarray features some 54,241 SNP probes, while their [[https://emea.illumina.com/products/by-type/microarray-kits/porcine-snp60.html|Porcine SNP60]] beadchip features over 64,000 markers.

Using a SNP chip, one can genotype simultaneously many samples. The pricing of the chip obviously depends on the number of samples one can run on the chip. For instance, for the above-mentioned PorcineSNP60v2, the offer ranges from chips accommodating up to 48 samples to chips accommodating up to 1152 samples. In the case of a diploid species, each of the markers produces one or two signals for each of the samples: one signal only if the sample is homozygous for that marker, and two if the sample is heterozygous.

Using a SNP chip, one can therefore genotype the samples of interest, and also study the molecular diversity within the batch of samples submitted for sequencing.

===== Input data formats: Illumina final report, .ped, .map, .bsc files, etc =====

The data one gets from a microarray describes for each sample and each locus (aka marker) the set of alleles called by the sequencing operation. "Called" here means "seen" by the sequencing device at that position. Each marker is also fully described, usually giving its genomic position in some reference genome, as well as, possibly, the most common (dominant) allele seen in that species at that position. Different data formats have been standardized for that purpose.


==== Illumina "Final report" ====

These "Final report" files are **large csv files**. Being the market-dominating vendor of SNP arrays, Illumina have developed their own data format for the array-based genotyping reports. Those files are produced by Illumina's Genome Studio software, and after a header, they contain comma-separated data (or tab-separated, less frequently) with **one row per sample and per marker**: this is therefore a "long" format, not a "wide" one: there are as many rows of data as ''n*m'', where ''n'' is the number of samples, and ''m'' the number of markers (SNPs). Below is an excerpt from such a file:

<file>
[Header]
GSGT Version,2.0.4
Processing Date,3/16/2020 11:12 AM
Content,,Goat_IGGC_65K_v2_15069617X365016_A1.bpm
Num SNPs,59727
Total SNPs,59727
Num Samples,720
Total Samples,1224
[Data]
Sample Name,SNP Name,Allele1 - Top,Allele2 - Top
WG9308435-DNA_A01_EY0041000,1_101941444_AF-PAKI,G,G
WG9308435-DNA_A01_EY0041000,1_10408764_AF-PAKI,G,G
WG9308435-DNA_A01_EY0041000,1_104453302_AF-PAKI,-,-
WG9308435-DNA_A01_EY0041000,1_107080965_AF-PAKI,G,G
WG9308435-DNA_A01_EY0041000,1_109839943_AF-PAKI,A,G
WG9308435-DNA_A01_EY0041000,1_112457804_AF-PAKI,A,A
WG9308435-DNA_A01_EY0041000,1_115032820_AF-PAKI,A,G
WG9308435-DNA_A01_EY0041000,1_117945786_AF-PAKI,A,G
WG9308435-DNA_A01_EY0041000,1_121035617_AF-PAKI,C,C
</file>

In such a file, the ''[Header]'' section provides information regarding the number of samples and the number of markers, and the ''[Data]'' section contains the genotyping information //per se//: each row contains at least the following information.

  * ''Sample Name''
  * ''SNP Name'' -> an identifier for the SNP marker, we will see it appear in .map files as well
  * ''Allele1 - Top'' -> the first allele called by the sequencer, transformed to be read on the 'TOP', or 'reference' strand, even if the marker is actually in a coding sequence in the alternate strand
  * ''Allele2 - Top'' -> the second allele called by the sequencer. It may well be the same as the first allele, in which case the sample is //homozygous// for that marker.

WARNING!! This is usually a very large ("long", actually) file, so although (apart from the header) it is actually a comma-separated text file, it will be cumbersome (and will necessitate lots of RAM) to read it using a traditional spreadsheet editor. Better browse it on the commandline using the ''less -S'' command.


==== .ped file format ====

The ped format is a quite compact, yet text-based file format that contains all the genotyping information.  It contains **one line per sample**, each line or row containing the following tab- or whitespace-separated fields, in this order and with no header line. The first six columns are mandatory:
  - family or population ID. When no families/populations are defined, this first field will usually contain integers corresponding to sequential IDs for the samples, starting from 1: each sample is therefore the single member of its own family.
  - individual or sample ID. This is simply the sample name.
  - paternal ID. Left to 0 if unknown, or when the father is not in the sample set.
  - maternal ID. Left to 0 if unknown, or when the mother is not in the sample set.
  - sex. The coding is as follows: 1=male; 2=female; anything else (usually 0) means unknown.
  - phenotype of interest. A ped file must have one (and only one) phenotype column, for instance a quantitative trait or an "affected/unaffected" binary trait, where "affected" is coded 2 and "unaffected" is coded 1. In this column, missing values are coded either 0 or -9. See [[https://zzz.bwh.harvard.edu/plink/data.shtml#ped|the plink documentation]] for more information about that.

The following ''2m'' columns, where ''m'' is the number of markers, contain the genotype calls, i.e. single nucleotides. Missing information (when the genotyping device was unable to decide which allele was seen is encoded as ''0''.

A ped file **must be biallelic**, even for haploid species (for which the ped file will only contain pairs of identical alleles) or for polyploid species (where there will be some loss of information). In any case, all current (2020) SNP chips are designed for polyploid populations.


==== .map file format ====

The files with extension ''.map'' contain informations regarding the markers in use. They contain one line per marker (i.e. per SNP marker, most often), each line containing the following whitespace-separated fields:
  - chromosome number or contig name. That is the chromosome or contig where the marker is located. Set to ''0'' if unknown/unspecified.
  - variant identifier. That is the name of the marker, as found for instance in [[https://www.ncbi.nlm.nih.gov/snp/|dbSNP]] or other variant databases.
  - position of the marker in morgans or centimorgans (optional; also safe to use dummy value of ''0'')
  - position of the marker in basepairs.

Fields 3 and 4 (positioning information) are relative to field 1: if field 1 is set to ''0'', then necessarily, fields 3 and 4 will also be ''0''. The position of markers on the genome is a very useful piece of information when one wants to study recombination hotspots, linkage disequilibrium decay, etc, or simply to be able to point which region of the genome correlates with a given trait of interest.

==== .bed files ====

"bed" stands for "binary ped": this is the compressed form of a ped file. A .bed file is therefore not human-readable, but machine-readable and much more compact (in terms of filesize) than the corresponding .ped file. **.bed is the standard data format in use by Plink**. A .bed file must always be accompanied by its companion .fam file, which describes the samples.


==== .fam file format ====

Such files contain information regarding the individuals (i.e. samples) that were genotyped. They are therefore **sample information files**. To a very large extent, the .fam file is actually a file made of the first six columns of the .ped file, in the same order. Compare the following (from the [[https://www.cog-genomics.org/plink2/formats#fam|Plink2.0 documentation]]) with what we wrote above regarding the contents of the .ped file:
  - Family ID ("FID")
  - Within-family ID ("IID"; cannot be ''0'')
  - Within-family ID of father (''0'' if father isn't in dataset)
  - Within-family ID of mother (''0'' if mother isn't in dataset)
  - Sex code (''1'' = male, ''2'' = female, ''0'' = unknown)
  - Phenotype value (''1'' = control, ''2'' = case, ''-9''/''0''/non-numeric = missing data if case/control)

==== .bsc files ====

Those are XML files generated by the Illumina Genome Studio software. They contain troubleshooting information telling the association between user samples and the cells in the microarray on which the samples have been genotyped.

You can safely ignore those .bsc files.


===== Data analysis workflow with Plink 1.9 =====

[[http://zzz.bwh.harvard.edu/plink/index.shtml |PLINK]] is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner, [[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1950838/pdf/AJHGv81p559.pdf|Purcell S et al., 2007]]. Plink is purely an analysis tool for phenotype/genotype data. 

Plink is available as a command line tool and as [[http://zzz.bwh.harvard.edu/plink/gplink.shtml|gPLINK]], a Java-based graphical user interphase software package. On this tutorial we will use the command line tool.

Once logged into your ''/home/'' on the hpc, several versions of PLINK are available and can be called into the working environment with ''module load plink/1.9'' . To see the help simply type ''plink --help''.

The basic data input formats for plink are the ''.ped'' and ''.map'' files as described above. In the input ''.ped'' file, within the first six obligatory columns, ensure that that naming in the first column; Family ID ("FID") and the second column; Within-family ID, do not have white spaces between them. These might cause the file to be read incorrectly by plink. If you need to separate fields in naming use an underscore ''_'' instead.


We will run all the commands using slurm, so we prepare a script file to submit to run as batch jobs. To know more on this refer to the [[https://hpc.ilri.cgiar.org/using-slurm | Using Slurm page]].

We will begin with running basic statistics to explore the dataset, which will enable quality check of data leading to data filtering in particular for poorly genotyped individuals and poorly called marker across the individuals. 

==prepare your script ==

<code>
#!/bin/bash
#SBATCH -p batch
#SBATCH -n 2
#SBATCH -o /home/bngina/plink_work/batch_logs/plink.%N.%J.out
#SBATCH -e /home/bngina/plink_work/batch_logs/plink.%N.%J.err
#SBATCH -J plink

#load the tool

module load plink/1.9

#define file path variables

</code>

For the input ped and map files, you only need specify the path to the files and give the prefix used to name both files as is the norm, and plink will automatically fill the extension (.ped and .map).

<code>
in_file='/home/bngina/plink_work/orig_data/Caprin_60k'

out='/home/bngina/plink_work/plink_out_files'

#load tools to be used 

module load plink/1.9
</code>


It is recommended to compress the files in order to use them with plink, the ped and map files carry a lot information and are quite big, hence we convert them to binary files within plink for faster computation.

<code>
##########convert files to binary format ####################

#the (--file) tells plink where the file is, it automatically appends the extension)

plink --file ${file} \
 --out ${out}/bin_caprin_60k \
 --make-bed

</code>
Above creates three files in the specified output directory ''${out}'' with the specified prefix ''bin_caprin_60k''

  *//bin_caprin_60k.bed//
  *//bin_caprin_60k.fam//
   *//bin_caprin_60k.bim//

Now we use the created binary files, indicated to plink using ''--bfile'', to do some basic exploratory statistics of the data set. 

  -Look a the individuals with missing data and SNPs not typed in all the individuals

<code>


######### summary statistics ########

#missingness

plink --bfile ${out}/bin_caprin_60k --missing \
 --out ${out}/bin_caprin_60k \
 --noweb

</code>

This creates two files.

  *//bin_caprin_60k.imiss// - for the individuals
  *//bin_caprin_60k.lmiss// - for the loci


#The missing information found in the ''bin_caprin_60k.imiss'' for the individuals looks like below;
<code>
FID                               IID MISS_PHENO   N_MISS   N_GENO   F_MISS
   1          WG6694108-DNA_A01_110kin          Y     1325    53347  0.02484
   2         WG6694108-DNA_A02_105kin1          Y     1346    53347  0.02523
   3           WG6694108-DNA_A03_55kin          Y     1313    53347  0.02461
   4           WG6694108-DNA_A04_50kin          Y     1360    53347  0.02549
   5          WG6694108-DNA_A05_104kin          Y     1350    53347  0.02531
   6           WG6694108-DNA_A06_82kin          Y     1412    53347  0.02647
   7          WG6694108-DNA_A07_75kin1          Y     1387    53347    0.026
   8         WG6694108-DNA_A08_110kin1          Y     1312    53347  0.02459
   9           WG6694108-DNA_A09_77kin          Y     1356    53347  0.02542
  10           WG6694108-DNA_A10_Zkin2          Y     1349    53347  0.02529

</code>

The information in each header is as follows;
<code>
FID                Family ID
IID                Individual ID
MISS_PHENO         Missing phenotype? (Y/N)
N_MISS             Number of missing SNPs
N_GENO             Number of non-obligatory missing genotypes i.e total number of SNPs used
F_MISS             Proportion of missing SNPs (in percentage)
</code>

The information found in the ''bin_caprin_60k.lmiss'' for the SNPs is as below;
<code>
 CHR                           SNP   N_MISS   N_GENO   F_MISS
   0           snp1-scaffold1-2170        4      648 0.006173
   0      snp1-scaffold708-1421224        8      648  0.01235
   0        snp10-scaffold1-352655        2      648 0.003086
   0   snp1000-scaffold1026-533890        0      648        0
   0  snp10000-scaffold1356-652219        4      648 0.006173
   0  snp10001-scaffold1356-703514        9      648  0.01389
   0  snp10002-scaffold1356-766996       10      648  0.01543
   0  snp10003-scaffold1356-808120        5      648 0.007716
   0  snp10004-scaffold1356-853276        3      648  0.00463
   0  snp10005-scaffold1356-907019        2      648 0.003086
</code>   

The information in each column is as follows;
<code>
SNP                SNP identifier
CHR                Chromosome number
N_MISS             Number of individuals missing this SNP
N_GENO             Number of non-obligatory missing genotypes i.e total number of genotypes in the population
F_MISS             Proportion of sample missing for this SNP (in percentage)
</code>

We can generate a file with filters added for the rate missing data in individuals ''--mind'' and call rate for the SNPs ''--geno'' and also for the minor allele frequency //(MAF)// , with flag ''--maf''.

The thresholds for these filters should be adjusted accordingly to the different data sets.

<code>

#### filter data ###

plink --file ${file} \
 --geno 0.05 \   #95% call rate of SNPs
 --maf 0.01\     #SNPs with less than 1% minor allele frequencies
 --mind 0.25 \   #individuals with more than 25% missing data
 --out ${out}/bin_caprin_60k_fltrd \
 --make-bed

</code>
===== Data analysis workflow with R and adegenet =====

<code>
# Working on ped/map files with adegenet
library(adegenet)
library(pegas)
setwd("~/Work/ABCF_fellows/GoatDiv/Data_March2020/")


n_markers = 59727 # for now you have to get this number first, which can be easily gotten
# since it is the number of lines in the ped file (do 'wc -l' on that file in the shell)
# We first read the ped file into a genind object:
dat <- read.loci("13.Caprin 60Kv2.ped", header = FALSE, loci.sep = "\t", allele.sep = "\t", col.pop = 1,
                 row.names = 2, col.loci = 7:(6+2*n_markers))

# the above takes *lots* of time. It wasn't even able to finish in reasonable time on my computer.
# It is much better to first convert the .ped file to a .raw file using Plink, through the following command:
# plink1.9 --ped '13.Caprin 60Kv2.ped' --map '13.Caprin 60Kv2.map' --chr-set 29 --allow-extra-chr --recodeA
# and then import the resulting "raw" object into R with read.PLINK():

read.PLINK("plink.raw", map.file = "13.Caprin 60Kv2.map", parallel = T, n.cores = 2) -> dat
# the above is still long, but reasonably so.

save(dat, file = "March2020_genlight.Rdata")
# In any case, now we have a compact genlight object, better write it to the disk so that next time we will load it
# straight ahead, skipping the stuff above, and instead of that, next time we will do a simple:
load("March2020_genlight.Rdata") # this file is 11MB, compared to the 165MB of the ped file

# summary of the object, checking everything is ok
dat
indNames(dat)[1:10]
locNames(dat)[1:10]

</code>