[[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop HOME]] ====== Clean SNP file ====== Read the file making sure explicitly tell it to delimit using tab and header is true Remember to save the file as a tab delimited text file.


read.table("Draft sent to Manny.txt", sep="\t", header=T, row.names=1)->draft

To count na use is.na. The number of True can be counted.


apply(is.na(draft), 2, sum) -> draft.snp.na.sum

Identify columns that have <= 7% of missing data


draft[ ,which(draft.snp.na.sum <= 0.07*nrow(draft)) ] -> draft.goodsnps

Do same for genotype


apply(is.na(draft.goodsnps), 1, sum) -> draft.goodsnps.na.sum
draft.goodsnps[draft.goodsnps.na.sum<=0.07*ncol(draft.goodsnps),]->draft.goodsnps.goodgen

To remove the regions column and only save the snps.


snponly=draft.goodsnps.goodgen[,2:1259]
row.names(snponly)=row.names(draft.goodsnps.goodgen)

Check frequency of the different alleles


table(as.factor(as.character(snponly[,1])))