Running structure, CLUMPP, Distruct

The program structure is a free software package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs.
CLUMPP and distruct programs are used for producing nice graphical displays of structure results, and computing useful statistics.
Structure Harvester by Dent Earl provides additional tools for visualizing Structure output.

CLUMPAK: a program for identifying clustering modes and packaging population structure inferences across K.



How to Run Structure

1. Format your data so it similarly looks like the one from the example files.
2. Open Structure; create a directory where you want your Structure files will be put (for example: My Documents/Structure)
3. Open your data file
4. Create a parameter set by click on menu: Parameter Set, New...
5. Fill in the length of burnin period (100.000 is usually more than enough, for more info please consult Structure manual by Pritchard et al.), and number of MCMC reps that you want (people use variable reps for this, from 100.000 to 2.000.000 or even more).
6.I usually follow the instruction to use admixture model for first run, leave the others as default.
7. Allele frequencies correlated.
8. Compute probability of the data (for estimating K)
9. Click ok and name your parameter
10. On menu file: Project, start a job, click on the parameter that you want to run, specify how many k you want to test for (k=1 to k=the max number of cluster you initially thought your data might possibly have)
11. Number of iterations: 5-10 or more
12. Click ok and let structure runs.
13. After complete, you can continue with Structure Harvester.

How to run Structure Harvester:

1. Go to the Results folder from your Structure results (for example: My Doc/Structure/ParameterName/Results)
2. Zip the Results folder. (Results.zip)
3. Upload that to Structure Harvester http://taylor0.biology.ucla.edu/structureHarvester/
4. Harvest!
5. Download the harvester output files.

How to run CLUMPP:

1. Based on Evanno et al. (2005) delta K formulation, you can identify your k.
2. Based on that k (for example k=2), take the specific k file for the indfile (then you will take K2.indfile; if your k is 5, take K5.indfile), move it to a new folder called (for example) FolderA
3. Edit the paramfile from the example files downloaded with CLUMPP package, Datatype 0, revise everything else accordingly (for example change the indfile name to K2.indfile and so on). Move/copy this file to FolderA
4. Copy/move also CLUMPP into your FolderA so in your folder, there are: CLUMPP, k2.indfile and the edited paramfile
5. Open Terminal, change the directory to where your FolderA is located
6. then type ./clumpp paramfile
7. CLUMPP will produce several files in the folder. Take the output file (for example arabid.outfile) and change it into arabid.indivq
8. Repeat number 2-7 for K2.popfile by changing the paramfile to Datatype 1 and put those 3 files to FolderB; take the output file (for example: arabid.outfile) change the name to arabid.popq

How to run Distruct:

1. Take your indvq and popq files (for example: arabid.indivq and arabid.popq) and move it to a Folder C
2. Edit the drawparams file from the Distruct package accordingly. Create your arabid.names and arabid.perm files
3. Put those 5 files (at least): arabid.indivq, arabid.popq, arabid.names, arabid.perm files and drawparams in Folder C with distruct in it.
4. Run distruct (for me, distruct does not work in my Mac, so I have to use PC to run it. i just click on the windows executable file and distruct produce the output.ps file.

How to use CLUMPAK Pipeline

CLUMPAK aids users in automating the process of analyzing the results of genotype clustering programs such as STRUCTURE. CLUMPAK separates groups of runs representing distinct solutions, and identifies an optimal cluster label alignment across different values of K, simplifying the comparison of clustering results across K. In addition, CLUMPAK implements a method for the identification of a preferred choice of K, and a comparison test for solutions obtained by different programs, models, or subsets of data.

Aims

CLUMPAK was designed to aid users in four main objectives:
(1) Separate distinct solutions obtained from STRUCTURE-like programs.
(2) Compare and align solutions obtained for different K values.
(3) Compare results obtained using different models/data subsets/programs.
(4) Indicate the preferred value of K according to Evanno et al.

CLUMPAK offers four modes of action – the main pipeline, DISTRUCT for many K’s, Compare, and Best K by Evanno.

Running CLUMPAK upload your structure result folder as a zipped file is mandatory. For the main pipeline, ‘DISTRUCT for many K’s’, and ‘Compare’ features, there are a number of other optional input files:
  • labels_file - contains text labels for populations (see toy_data_labels.txt). This file is optional, and affects only the graphical representation of the results. If provided, the order of populations in the produced figure will reflect their order in the labels_file, and the labels will be used below the figure. In case it is not provided, population codes will be extracted the results files.

  • colors_file- contains colors to be used in the produced figures (see colors_file.txt). Colors recognized by CLUMPAK are those that are recognized by DISTRUCT, please consult the DISTRUCT manual for a full list of colors. Colors should be numbered and ordered, such that the number of colors equal to or larger than the largest K value in the result files.

  • Drawparams file - confronts the format of DISTRUCT’s drawparams file (see drawparams.txt). Please consult the DISTRUCT manual for additional details.