Differences

This shows you the differences between two versions of the page.

--- tutorials:population-diversity:snp-chips [2020/09/11 11:35] – [Data analysis workflow with Plink 1.9] bngina
+++ tutorials:population-diversity:snp-chips [2020/09/22 10:21] (current) – [Data analysis workflow with Plink 1.9] bngina
@@ Line 109: / Line 109: @@
 We will run all the commands using slurm, so we prepare a script file to submit to run as batch jobs. To know more on this refer to the [[https://hpc.ilri.cgiar.org/using-slurm | Using Slurm page]].
-We will begin with running basic statistics to explore the dataset
+We will begin with running basic statistics to explore the dataset, which will enable quality check of data leading to data filtering in particular for poorly genotyped individuals and poorly called marker across the individuals.
 ==prepare your script ==
@@ Line 125: / Line 125: @@
 module load plink/1.9
-#define file paths
+#define file path variables
-#for the input ped and map files, you only need specify the path to the files and give the path to used for both files, plink will automatically fill the extension
+</code>
+For the input ped and map files, you only need specify the path to the files and give the prefix used to name both files as is the norm, and plink will automatically fill the extension (.ped and .map).
+<code>
 in_file='/home/bngina/plink_work/orig_data/Caprin_60k'
+out='/home/bngina/plink_work/plink_out_files'
+#load tools to be used
+module load plink/1.9
 </code>
+It is recommended to compress the files in order to use them with plink, the ped and map files carry a lot information and are quite big, hence we convert them to binary files within plink for faster computation.
+<code>
+##########convert files to binary format ####################
+#the (--file) tells plink where the file is, it automatically appends the extension)
+plink --file ${file} \
+ --out ${out}/bin_caprin_60k \
+ --make-bed
+</code>
+Above creates three files in the specified output directory ''${out}'' with the specified prefix ''bin_caprin_60k''
+  *//bin_caprin_60k.bed//
+  *//bin_caprin_60k.fam//
+   *//bin_caprin_60k.bim//
+Now we use the created binary files, indicated to plink using ''--bfile'', to do some basic exploratory statistics of the data set.
+  -Look a the individuals with missing data and SNPs not typed in all the individuals
+<code>
+######### summary statistics ########
+#missingness
+plink --bfile ${out}/bin_caprin_60k --missing \
+ --out ${out}/bin_caprin_60k \
+ --noweb
+</code>
+This creates two files.
+  *//bin_caprin_60k.imiss// - for the individuals
+  *//bin_caprin_60k.lmiss// - for the loci
+#The missing information found in the ''bin_caprin_60k.imiss'' for the individuals looks like below;
+<code>
+FID                               IID MISS_PHENO   N_MISS   N_GENO   F_MISS
+          WG6694108-DNA_A01_110kin          Y     1325    53347  0.02484
+         WG6694108-DNA_A02_105kin1          Y     1346    53347  0.02523
+           WG6694108-DNA_A03_55kin          Y     1313    53347  0.02461
+           WG6694108-DNA_A04_50kin          Y     1360    53347  0.02549
+          WG6694108-DNA_A05_104kin          Y     1350    53347  0.02531
+           WG6694108-DNA_A06_82kin          Y     1412    53347  0.02647
+          WG6694108-DNA_A07_75kin1          Y     1387    53347    0.026
+         WG6694108-DNA_A08_110kin1          Y     1312    53347  0.02459
+           WG6694108-DNA_A09_77kin          Y     1356    53347  0.02542
+           WG6694108-DNA_A10_Zkin2          Y     1349    53347  0.02529
+</code>
+The information in each header is as follows;
+<code>
+FID                Family ID
+IID                Individual ID
+MISS_PHENO         Missing phenotype? (Y/N)
+N_MISS             Number of missing SNPs
+N_GENO             Number of non-obligatory missing genotypes i.e total number of SNPs used
+F_MISS             Proportion of missing SNPs (in percentage)
+</code>
+The information found in the ''bin_caprin_60k.lmiss'' for the SNPs is as below;
+<code>
+ CHR                           SNP   N_MISS   N_GENO   F_MISS
+           snp1-scaffold1-2170        4      648 0.006173
+      snp1-scaffold708-1421224        8      648  0.01235
+        snp10-scaffold1-352655        2      648 0.003086
+   snp1000-scaffold1026-533890        0      648        0
+  snp10000-scaffold1356-652219        4      648 0.006173
+  snp10001-scaffold1356-703514        9      648  0.01389
+  snp10002-scaffold1356-766996       10      648  0.01543
+  snp10003-scaffold1356-808120        5      648 0.007716
+  snp10004-scaffold1356-853276        3      648  0.00463
+  snp10005-scaffold1356-907019        2      648 0.003086
+</code>
+The information in each column is as follows;
+<code>
+SNP                SNP identifier
+CHR                Chromosome number
+N_MISS             Number of individuals missing this SNP
+N_GENO             Number of non-obligatory missing genotypes i.e total number of genotypes in the population
+F_MISS             Proportion of sample missing for this SNP (in percentage)
+</code>
+We can generate a file with filters added for the rate missing data in individuals ''--mind'' and call rate for the SNPs ''--geno'' and also for the minor allele frequency //(MAF)// , with flag ''--maf''.
+The thresholds for these filters should be adjusted accordingly to the different data sets.
+<code>
+#### filter data ###
+plink --file ${file} \
+ --geno 0.05 \   #95% call rate of SNPs
+ --maf 0.01\     #SNPs with less than 1% minor allele frequencies
+ --mind 0.25 \   #individuals with more than 25% missing data
+ --out ${out}/bin_caprin_60k_fltrd \
+ --make-bed
+</code>
 ===== Data analysis workflow with R and adegenet =====
+<code>
+# Working on ped/map files with adegenet
+library(adegenet)
+library(pegas)
+setwd("~/Work/ABCF_fellows/GoatDiv/Data_March2020/")
+n_markers = 59727 # for now you have to get this number first, which can be easily gotten
+# since it is the number of lines in the ped file (do 'wc -l' on that file in the shell)
+# We first read the ped file into a genind object:
+dat <- read.loci("13.Caprin 60Kv2.ped", header = FALSE, loci.sep = "\t", allele.sep = "\t", col.pop = 1,
+                 row.names = 2, col.loci = 7:(6+2*n_markers))
+# the above takes *lots* of time. It wasn't even able to finish in reasonable time on my computer.
+# It is much better to first convert the .ped file to a .raw file using Plink, through the following command:
+# plink1.9 --ped '13.Caprin 60Kv2.ped' --map '13.Caprin 60Kv2.map' --chr-set 29 --allow-extra-chr --recodeA
+# and then import the resulting "raw" object into R with read.PLINK():
+read.PLINK("plink.raw", map.file = "13.Caprin 60Kv2.map", parallel = T, n.cores = 2) -> dat
+# the above is still long, but reasonably so.
+save(dat, file = "March2020_genlight.Rdata")
+# In any case, now we have a compact genlight object, better write it to the disk so that next time we will load it
+# straight ahead, skipping the stuff above, and instead of that, next time we will do a simple:
+load("March2020_genlight.Rdata") # this file is 11MB, compared to the 165MB of the ped file
+# summary of the object, checking everything is ok
+dat
+indNames(dat)[1:10]
+locNames(dat)[1:10]
+</code>