Software for Analyzing Population Genetic Data

Software for Analyzing Population Genetic Data

Adam Porter lab page

Here are some programs I wrote for analyzing data from real and simulated populations, most of which have shown up in my publications.

Right now there are three programs, some supporting algorithms, and the code library I use.

SexLinkedFstats

This program handles the calculation of Wright's F-statistifcs for sex-linked loci and haplodiploid organisms. It handles both codominant and dominant markers, and uses both least-squares and restricted maximum likelihood methods for each. You can use the program to

1) calculate F-statistics from a data file you supply;

2) make up a data file of sex-linked data with specified F-statistics;

3) run a simulation to try out the effects of different sampling patterns on sex-linked/haplodiploid F-statistics.

The underlying population-genetic and statistical theory for this program is in manuscript form:

Porter, AH. (MS submitted.) F-statistics for sex-linked and haplodiploid codominant and dominant markers: models and estimators.

Please email me of you need to see a copy prior to publication.

The file you want to check should be in the same folder as the program;

The file you want to check needs a filename with no spaces in it.

SexLinkedFstats MacOS 8-10.x (Carbon)

SexLinkedFstats Windows (not compiled yet)

Sample data set Use this to see the data formatting

IslandModelTest

This program implements a parametric bootstrap to determine whether genotypic data show patterns that deviate significantly from Wright's (1931, 1969) neutral infinite-island model. Under this model, allele frequencies in different populations settle into an equilibrium shape described by a multivariate beta-distribution. The parameters of this beta-distribution are the allele frequencies of the pooled set of populations (Q), and the standardized variance in allele frequencies (Fst). Random samples of allele frequencies are repeatedly drawn from this beta-distribution to generate the null distributions of Fst for each locus. The observed Fst scores for each locus are compared to this null distribution to see if they fall outside the 95% confidence limits. The analysis will therefore reveal which loci show significant deviations from the average, multilocus pattern.

The null distribution includes sampling variation from two sources, the sampling pattern of the original data (the numbers of populations and individuals sampled) and the estimation error around the parameters used to create the beta-distribution, namely the observed values of Q, Fst and Fis. To handle the first source of error, the parametric beta-distribution is resampled following the same sampling pattern of the original data. The expected genotype frequencies are constructed from these allele frequencies using Fis, the within-population inbreeding coefficient, because this also influences sampling variation. To handle the second source of error, each replicate used to build the null distribution is generated from a different realization of the beta-distribution, created using unique values of Q, Fst and Fis. These values are obtained in turn by calculating them from a standard bootstrap sample from the original data set (resampling populations, then individuals within populations).

Finding significance means that one or more of the island-model assumptions is violated, but it doesn't tell you which. It could be because the deviant locus is under natural selection, or it could be that the migration patterns are different from the island model's, or for a host of other reasons. It could even be that most loci are under similar selection regimes but the deviant locus is neutral! Independent studies are needed to properly ascribe a cause.

The program takes data in my own format (very similar to Swofford's BIOSYS) or Arlequin format. It supports only diploid genotypic data with codominant loci. Soon I will put out versions for haploid and for dominant data (such as RFLP or AFLP). Please lower any expectations you have about the convenience of the user interface!

Fixed in version 0.5: An interface bug that kept users from opting for a bootstrap resampling scheme when calculating the null Fst distributions. The bootstrap scheme is now a default setting.

Please note that this is an alpha-release, and you may encounter bugs, especially in the interface. In my experience, most bugs arise when users present the program with data having idiosyncracies that I didn't expect. I would appreciate feedback before you give up on it!

IslandModelTest Carbon v0.5 (Macintosh)

IslandModelTest Win v0.5 (Windows)

IslandModelTest -- User's Guide (you'll be lost without it!)

IslandModelTest -- sample data (randomly generated, so no significant effects will be found. If you can't get this to read properly, it's quite possibly an issue involving end-of-line characters in text files, which differ on different platforms. Try either:

Running the file through my little reformatting program, WinMacNewLineFormatter, supplied below.

Opening the file in BBEdit and saving in the format of your favorite platform.

Source code (also requires AdamLibraries, below)

Supporting publication (PDF format)

Contact me if you can't get these to download properly.

Algorithm elements:

IslandModelTest relies on two main algorithmic components, one that draws allele frequencies from the island model's null distribution, and one that draws individual genotypes from an expected genotypic distribution.

This provides random draws of allele frequencies from a multi-allele beta distribution (=Dirichlet distribution), given Fst and a list of expected allele frequencies.

IslandModelRandomAlleles (Macintosh)

IslandModelRandomAlleles (Windows)

ExpectedGenotype provides expected diploid genotype distributions, given a list of allele frequencies and an Fis value. Although Fis can range from -1 to 1, negative Fis values are actually constrained to be above Fis = -1 when allele frequencies are unequal (otherwise negative genotype frequencies can be returned). This algorithm takes the constraints into account. It is described in the appendix of the Molecular Ecology paper.

ExpectedGenotype (Macintosh)

ExpectedGenotype (Windows)

ClineFit

This program fits genotypic data to equilibrium cline models developed by Nick Barton. It uses a numerical maximum likelihood algorithm (an MCMC method aka a Metropolis-Hastings algorithm aka a biased random walk), and returns maximum-likelihood estimates and 2-unit support limits. It supports diploid, haplodiploid or sex-linked genotypic data using codominant loci.

ClineFit takes data in my own format (very similar to Swofford's BIOSYS) or Arlequin format. It requires that the location of each population sample be a number, placed as the last piece of information in the population's name. It also requires that you identify, as part of the name of each locus, the alleles that will be most frequent on the right side of the cline.

ClineFit gives you considerable flexibility in determining the models that you can use. You can fit:

--> clines with 2, 4, 6 & 8 primary shape parameters, including:

- center & width
- 4 parameters describing introgression tails on either side of the cline
- 2 parameters describing frequencies of asymptotic polymorphisms on each side of the cline, if they are not fixed.

--> single or multiple markers
--> sex linkage and haplodiploidy
--> disequilibrium estimates, from which dispersal and selection estimates are obtained in 6- & 8-parameter models
--> models that omit parameters of your choice, such as the introgression tail on one side of the cline
--> models that omit such parameters for some traits but not others, under your control
--> models that combine parameters in ways that you control. For example, you can estimate a single center for all traits, or a unique center for each trait, or any combination in between.

This last feature gives you considerable flexibility for hypothesis testing. For example, you can determine if one trait has a unique center (or other parameter value) by first estimating the shape with that trait's center estimated as unique, then estimating the model with that trait's center co-estimated with the remaining traits. That trait has a significantly unique center (at level alpha) if the twice the difference of the likelihoods of these two estimates is greater than the tabled value for level alpha in a chi-square distribution with 1 degree of freedom. Generally, the degrees of freedom is the difference in the number of parameters estimated in fitting the two models. The motivation for these sorts of likelihood tests is well-described in Hilborn & Mangel (1997), Ecological Detective (Princeton Monographs).

If you want to measure only cline shape without underlying dispersal and selection parameters, you can also fit clines of
--> cytoplasmic markers (such as mitochondrial or chloroplasmic loci)
--> dominant markers (such as RFLP or AFLP)

ClineFit uses the method published in Evolution (Porter et al. 1997), which in turn follows Barton's methods very closely. Its main difference is that it fits genotypes to the cline shape directly, rather than fitting transformed data to a linearized model; this isn't such a big difference. I hope to put out versions for quantitative traits, dominant data (such as RFLP or AFLP), and analyses that incorporate cytonuclear disequilibrium.

Please lower any expectations you have about the convenience of the user interface! If your data format deviates even slightly from the sample data, you might well run into error messages, crashes or plain nonsense.

ClineFit is an alpha release (really pre-alpha). I've used the main algorithms for a while now, but the interface might give bugs. It's possible that new data sets with extreme conditions will uncover inconsistencies that I haven't anticipated in the numerical estimators. It's a null hypothesis that any program is bug-free.

ClineFit_v0.2 MacOS 8-10.x (Carbon)

ClineFit_v0.2 Windows

sample cline allozyme data Use this data format. More on the formatting can be found above in the user guide for IslandModelTest. If you can't get this to read properly after combing it for formatting inconsistencies, it's likely an issue involving end-of-line characters in text files, which differ on different platforms. Try opening the file in BBEdit and saving in the format of your favorite platform.

Source code - v0.2. They rely on AdamLibraries20081116

User Guide -- not available yet. One thing: put your data file into a folder named ClineFitFiles. Then, the ClineFit program should be in the same directory as ClineFitFiles.

Obsolete versions:
ClineFit_v0.1 - has several interface bugs, and small rounding errors in the last digit of the output (the internal calculations were unaffected).

WinMacNewLineFormatter

This program is obsolete. Use BBEdit for this.

This text is just an archive:

This program fixes a niggling problem that arises periodically when downloading text files created on different platforms (Mac vs. Windows or Unix). The core issue is that these platforms use different conventions for file formats. On most operating systems, each line ends with a <newline> character denoted \ n (backslash-n), which is not printed to the screen or page when the file is viewed. However, in MacOS, the end of line character is a <return>, denoted \ r (backslash-r). Most internet protocols correct for this when files are transferred among operating systems, but occasionally there are problems. Since the characters are invisiblem these problems are really hard to discover unless the file is read in a program (such as a debugger) that prints these characters to the screen.

This little program may clean up these problems, formatting files explicitly for the operating system of choice no matter what the current end-of-line formatting is. Run the program and choose the options you want. There are a few constraints, since I don't specialize in programming interfaces. If it doesn\'t work, then try pasting a text file into an email program, sending it to yourself, and then pasting it back into your computer into a new file.

The file you want to check should be in the same folder as the program;

The file you want to check needs a filename with no spaces in it.

WinMacNewLineFormatter MacOS 8-10.x (Carbon)

WinMacNewLineFormatter Windows

AdamLibraries

These are the core algorithms of my analyses and simulations, in C++, which I compile using MetroWerks CodeWarrior, which is unfortunately no longer supported. With work, they are portable to the GCC compiler, and presumably others as well. Eventually, as CodeWarrior begins to fail on the newer operating systems, I'll have to switch too.

I wrote most of these algorithms myself, but some are based on public domain or restricted-use code. MemoryManager and the cumulative distribution functions are not mine. All my classes and algorithms are copywrited, and you may not use them without my permission. I will provide prompt written permission for most non-profit uses. But, you have to ask.

I update these libraries occasionally, whenever I develop a new algorithm that I think I might want to re-use later, and whenever I find a bug or undesirable feature.

AdamLibraries 20081106

If you find a bug, please let me know!

Updated: 10 February 2010, A. Porter