[[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop HOME]]
====== Clean SNP file ======
Read the file making sure explicitly tell it to delimit using tab and header is true
Remember to save the file as a tab delimited text file.
read.table("Draft sent to Manny.txt", sep="\t", header=T, row.names=1)->draft
To count na use is.na. The number of True can be counted.
apply(is.na(draft), 2, sum) -> draft.snp.na.sum
Identify columns that have <= 7% of missing data
draft[ ,which(draft.snp.na.sum <= 0.07*nrow(draft)) ] -> draft.goodsnps
Do same for genotype
apply(is.na(draft.goodsnps), 1, sum) -> draft.goodsnps.na.sum
draft.goodsnps[draft.goodsnps.na.sum<=0.07*ncol(draft.goodsnps),]->draft.goodsnps.goodgen
To remove the regions column and only save the snps.
snponly=draft.goodsnps.goodgen[,2:1259]
row.names(snponly)=row.names(draft.goodsnps.goodgen)
Check frequency of the different alleles
table(as.factor(as.character(snponly[,1])))