[[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop HOME]] ====== Clean SNP file ====== Read the file making sure explicitly tell it to delimit using tab and header is true Remember to save the file as a tab delimited text file. read.table("Draft sent to Manny.txt", sep="\t", header=T, row.names=1)->draft To count na use is.na. The number of True can be counted. apply(is.na(draft), 2, sum) -> draft.snp.na.sum Identify columns that have <= 7% of missing data draft[ ,which(draft.snp.na.sum <= 0.07*nrow(draft)) ] -> draft.goodsnps Do same for genotype apply(is.na(draft.goodsnps), 1, sum) -> draft.goodsnps.na.sum draft.goodsnps[draft.goodsnps.na.sum<=0.07*ncol(draft.goodsnps),]->draft.goodsnps.goodgen To remove the regions column and only save the snps. snponly=draft.goodsnps.goodgen[,2:1259] row.names(snponly)=row.names(draft.goodsnps.goodgen) Check frequency of the different alleles table(as.factor(as.character(snponly[,1])))