Differences

This shows you the differences between two versions of the page.

--- tutorials:population-diversity:snp-chips [2020/09/17 15:22] – [Data analysis workflow with Plink 1.9] bngina
+++ tutorials:population-diversity:snp-chips [2020/09/22 10:21] (current) – [Data analysis workflow with Plink 1.9] bngina
@@ Line 127: / Line 127: @@
 #define file path variables
-#for the input ped and map files, you only need specify the path to the files and give the prefix used to name both files as is the norm, and plink will automatically fill the extension (.ped and .map).
+</code>
-in_file='/home/bngina/plink_work/orig_data/Caprin_60k'
+For the input ped and map files, you only need specify the path to the files and give the prefix used to name both files as is the norm, and plink will automatically fill the extension (.ped and .map).
-#directory to store output files, first create this directory(plink_out_files) in your home working directory in order to reference it here
+<code>
+in_file='/home/bngina/plink_work/orig_data/Caprin_60k'
 out='/home/bngina/plink_work/plink_out_files'
@@ Line 138: / Line 139: @@
 module load plink/1.9
+</code>
-##########convert files to binary format ####################
-#its recommended to compress the files in order to use them with plink, the ped and map files carry a lot information are quite big, hence we convert them to binary files within plink for faster computation'
+It is recommended to compress the files in order to use them with plink, the ped and map files carry a lot information and are quite big, hence we convert them to binary files within plink for faster computation.
+<code>
+##########convert files to binary format ####################
 #the (--file) tells plink where the file is, it automatically appends the extension)
@@ Line 149: / Line 153: @@
  --make-bed
+</code>
+Above creates three files in the specified output directory ''${out}'' with the specified prefix ''bin_caprin_60k''
+  *//bin_caprin_60k.bed//
+  *//bin_caprin_60k.fam//
+   *//bin_caprin_60k.bim//
+Now we use the created binary files, indicated to plink using ''--bfile'', to do some basic exploratory statistics of the data set.
+  -Look a the individuals with missing data and SNPs not typed in all the individuals
+<code>
+######### summary statistics ########
+#missingness
+plink --bfile ${out}/bin_caprin_60k --missing \
+ --out ${out}/bin_caprin_60k \
+ --noweb
 </code>
+This creates two files.
+  *//bin_caprin_60k.imiss// - for the individuals
+  *//bin_caprin_60k.lmiss// - for the loci
+#The missing information found in the ''bin_caprin_60k.imiss'' for the individuals looks like below;
+<code>
+FID                               IID MISS_PHENO   N_MISS   N_GENO   F_MISS
+          WG6694108-DNA_A01_110kin          Y     1325    53347  0.02484
+         WG6694108-DNA_A02_105kin1          Y     1346    53347  0.02523
+           WG6694108-DNA_A03_55kin          Y     1313    53347  0.02461
+           WG6694108-DNA_A04_50kin          Y     1360    53347  0.02549
+          WG6694108-DNA_A05_104kin          Y     1350    53347  0.02531
+           WG6694108-DNA_A06_82kin          Y     1412    53347  0.02647
+          WG6694108-DNA_A07_75kin1          Y     1387    53347    0.026
+         WG6694108-DNA_A08_110kin1          Y     1312    53347  0.02459
+           WG6694108-DNA_A09_77kin          Y     1356    53347  0.02542
+           WG6694108-DNA_A10_Zkin2          Y     1349    53347  0.02529
+</code>
+The information in each header is as follows;
+<code>
+FID                Family ID
+IID                Individual ID
+MISS_PHENO         Missing phenotype? (Y/N)
+N_MISS             Number of missing SNPs
+N_GENO             Number of non-obligatory missing genotypes i.e total number of SNPs used
+F_MISS             Proportion of missing SNPs (in percentage)
+</code>
+The information found in the ''bin_caprin_60k.lmiss'' for the SNPs is as below;
+<code>
+ CHR                           SNP   N_MISS   N_GENO   F_MISS
+           snp1-scaffold1-2170        4      648 0.006173
+      snp1-scaffold708-1421224        8      648  0.01235
+        snp10-scaffold1-352655        2      648 0.003086
+   snp1000-scaffold1026-533890        0      648        0
+  snp10000-scaffold1356-652219        4      648 0.006173
+  snp10001-scaffold1356-703514        9      648  0.01389
+  snp10002-scaffold1356-766996       10      648  0.01543
+  snp10003-scaffold1356-808120        5      648 0.007716
+  snp10004-scaffold1356-853276        3      648  0.00463
+  snp10005-scaffold1356-907019        2      648 0.003086
+</code>
+The information in each column is as follows;
+<code>
+SNP                SNP identifier
+CHR                Chromosome number
+N_MISS             Number of individuals missing this SNP
+N_GENO             Number of non-obligatory missing genotypes i.e total number of genotypes in the population
+F_MISS             Proportion of sample missing for this SNP (in percentage)
+</code>
+We can generate a file with filters added for the rate missing data in individuals ''--mind'' and call rate for the SNPs ''--geno'' and also for the minor allele frequency //(MAF)// , with flag ''--maf''.
+The thresholds for these filters should be adjusted accordingly to the different data sets.
+<code>
+#### filter data ###
+plink --file ${file} \
+ --geno 0.05 \   #95% call rate of SNPs
+ --maf 0.01\     #SNPs with less than 1% minor allele frequencies
+ --mind 0.25 \   #individuals with more than 25% missing data
+ --out ${out}/bin_caprin_60k_fltrd \
+ --make-bed
+</code>
 ===== Data analysis workflow with R and adegenet =====