Genomap 1.0 Perl Utility

Steve DiFazio

Oak Ridge National Laboratory

difazios@ornl.gov

Revised July 10, 2003

Introduction to PERL (skip if you know this stuff)

This program is written in the PERL (Practical Extraction and Reporting Language) scripting language. Each time the program is run, the PERL interpreter is automatically invoked, and this converts the script to machine language and executes it. Because of this, the program will work equally well on any operating system as long as the proper PERL interpreter is installed. Fortunately, this program is available for free!

Instructions for Installing The PERL Interpreter

UNIX/Linux

Unix and Linux systems should already have PERL installed. You may have to change the first line of the program to match the location of your PERL executable. Usually you can find this by typing

where perl

Then change the line

#! /usr/local/bin/perl

to match the location of your executable. If by any chance PERL isn’t on your system, just follow the link under the windows section and you’ll find it easily enough.

Windows

Go to the following web site and follow the directions to download the program (~9 MB):

http://www.activestate.com/Products/Download/Get.plex?id=ActivePerl

If you’re using Window 2000 or XP, you should be able to install the ‘windows MSI’ version directly. If you’re using an earlier version of Windows, you may need to download and install the ‘windows installer’ first.

After downloading, double-click the file and the installation should be self-explanatory!

PERL will install and run completely transparently (i.e., no user interface).

Genomap Introduction

The genomap perl utility processes tabular output files from ABI Genotyper software and prepares them for genetic mapping using Mapmaker and other mapping software. The program works equally well with any type of molecular marker data, including AFLP, STS, RAPD, and SSR. The program performs the following functions:

Combines data from an unlimited number of Genotyper files
Creates various output formats:

Mapmaker: Ready for input to the Mapmaker program. Has a separate line for each allele and indicates presence/absence for each sample.
Genoytpe: Very useful for reviewing genotypes and discovering errors. Has a separate line for each locus and indicates the genotype of each sample using a single letter for each allele. Serves as input for utilities to convert to other formats (e.g., Joinmap)
Haplotype: Much like Mapmaker output, but contains 0’s and 1’s to indicate presence and absence of alleles, and indicates possible aneuploidy and nonparental alleles.
Distance: A mapmaker input file that contains only one allele per locus. Useful for calculations of map distance.

Summarizes genotype data and identifies genotypes that may require further investigation:

Identifies alleles with segregation distortion
Lists suspected nulls by locus
Determines parental and species origins of alleles
Summarizes data by parent and species
Lists progeny that lack an allele from one or more parents
Lists progeny with more than two alleles
Lists progeny with nonparental alleles
Identifies missing data due to failed amplification
Compares repeated samples and flags those with different genotypes

Creating Data Files Using ABI Genotyper Software

Genotyper data should be scored using grouped categories to indicate alleles (i.e., ‘group’ indicates the locus, and alleles have unique category names within the group). Under the table options, make sure that the ‘low signal’ threshold is set for a very low number (e.g., 1). This is how the program recognizes failed amplifications versus double-nulls, a very important distinction. You can also enter ‘M’ in one of the ‘Peak’ columns to indicate missing data or failed amplifications. If you simply leave the ‘Peak’ columns blank, the sample will be scored as a double-null. Do not use ‘M’ as an allele name, because all data will be scored as missing! All other allele names are fair game, and there is no length limitation. Just be sure to be consistent!

The data must be saved as tab-delimited text files using the ‘Table’ function. See the file input.dat for an example.

The following fields MUST be present. Fields can appear in any order, and extraneous fields will not cause problems.

File Name

Sample Info

Genomap Configuration File

The file genomap.cfg must be present in the directory from which the program is being executed. This file contains information about the characteristics of your data, and provides a convenient means to alter the way genomap processes your data. The sample file provides the format and explanations of each entry. Comments are delineated by # at the beginning of lines. This is also a convenient way to disable features for different runs of the program. The components of this file are:

Sample Names: You must specify the names of the mother and father in the pedigree. The program will not function without genotypes for the mother and father for each locus. The names must be consistent in all files, though the program is case-insensitive.

Optionally, you can also specify the name of a blank sample as a negative control. The program will ignore these genotypes, though it will output a list of all blanks that have scored alleles.

The program also allows inclusion of grandparent genotypes for the purpose of tracing the species origin of alleles in interspecific crosses. These names are specified here as well.

All other sample names other than those specified in genomap.cfg are considered to be progeny of the mother and father.

Parental Species (optional) If an interspecific cross is involved, you can specify the species identity of the mother, father, paternal grandparents, and/or maternal grandparents to facilitate tracing the species origin of alleles. Species origins are traced as much as possible with the amount of data that are provided. For example, for a testcross between an F1 hybrid mother and a pure father, quite a few species origins can be determined if only one maternal grandparent is provided. Paternal grandparents are not necessary in this case. Species designations must consist of a single letter, which are then combined to indicate hybrids (parental crosses, backcrosses, and intercrosses are labeled equivalently with one letter from each species).

Exclusions Any combination of individuals, loci, or alleles can be excluded from the output according to specifications in the configuration file. For example, a particular problematic locus can be efficiently excluded from the analysis, or a sample with a high rate of amplification failure can be excluded. This obviates the need to alter large numbers of input files, and provides great flexibility in performing analyses such as bootstraps or jacknifes, and also provides a convenient way to construct maps for alleles originating from a particular species, for example. Note that exclusion of individual alleles is only possible for Mapmaker or Distance output, because these are the only contexts in which this makes sense.

Preparing List of Files to Process

The names of all of the data files that you want to process should be saved in another file. The easiest way to do this is

Move all of the files to one directory (lets call it datadir)
Open a command prompt (DOS window, MS-DOS prompt). Usually

Start->Programs->Accessories

Change to the directory containing the files:

cd c:\datadir

Copy the directory list to a file:

dir *.dat > files.txt

(or ls *.dat > files.txt for Unix or Linux)

In practice you can replace ‘datadir’ and ‘files.txt’ with names of your choosing.

See ‘files.txt’ as an example of an input file. If this doesn’t work, or if you prefer, you can also manually create the input file by simply placing the name of the files to be processed on separate lines. If you want to run the program in a different directory from the location of the data files, include the full path with each file name. Also, remember that the filenames shouldn’t have any spaces, and they must end in “.dat”.

Running the Program

The program is run from a command prompt, so the first thing to do is open a DOS window (or a console if you’re using Unix or Linux).

For Unix or Linux, just make sure you put the program genomap.pl in a directory that is included in your system path, and all you’ll need to do is type genomap.pl to run the program from anywhere. Unfortunately, perl doesn’t seem to work so transparently for windows, and as far as I can tell there’s no simple way to get the file associations to work right for all versions of windows. Therefore, just put it in a directory called c:\bin and we’ll type the full path to it each time. Feel free to mess with getting the associations right, and you’ll be able to use it like a standard program.

Make sure the configuration file (genomap.cfg) and the input file (files.txt) are in the same directory as the data files to be processed.
Change to the directory containing the data files, input file, and configuration file

cd c:\datadir

If you type perl c:\bin\genomap.pl at the command prompt you should see:

_______________________________________________________________

Usage: perl genomap.pl -g -h -s -m -d -null=n files.txt > out.txt

where

-g provides output for genotypic mapping (ABCD genotypes for each locus),

-h provides output for haplotype mapping(1 or 0 genotypes per allele),

-m provides output in Mapmaker format

-d provides output in Mapmaker format for one allele per locus for distance

calculations

-s provides a summary of null alleles and progeny without parental alleles,

-null=n, where n is the maximum number of loci for which parental mismatches are

allowed before designating a segregating null allele. Default is 1.

<files.txt> is the name of a file containing a list of all genotyper table files to be

processed, each of which must have a ".dat" extension, and

<out.txt> is the name of the file to save the output

** You can choose unique names for these input and output files **

If the program produced the expected output, you’re ready to roll. Type

perl c:\bin\genomap.pl –s files.txt > data_summ.txt

This will produce a summary file, data_summ.txt, that you can then open with a spreadsheet program like Excel.

perl c:\bin\genomap.pl –g files.txt > data_geno.txt

produces a file with genotypes for each sample.

The format of these files is explained in detail below.

I encourage you to carefully review the summary file and genotype file, which will likely point out many genotyping errors. Correct the errors in the original data files (i.e., the table files generated by Genotyper), and continually regenerate the summary and genotype files until everything looks correct. Then it’s time to generate a Mapmaker input file:

perl c:\bin\genomap.pl –m files.txt > data_map.txt

This file contains data for the maternal, paternal, and intercross alleles, each of which should be analyzed separately in Mapmaker.

Genotype File

This file is generated to facilitate error checking and to give an easily-reviewed overview of the data.

Visible Alleles: The program assigns unique single letters to alleles for each locus (A-L). Missing Data: The letter M designates missing data, either due to failed amplification (triggered by the ‘low signal’ warning or manually entered into the table), or to absence from the file. Note that if a sample is present in the file it is scored as a double null if ‘M’ or the low signal warning are not present.

Null Alleles: Putative null alleles are designated as N in the genotype file. Presence of nulls is inferred by a lack of parental alleles in the progeny. The default threshold for decalring a null is a single mismatch between parents and progeny. If this criterion is too liberal, you may increase the cutoff by using the –null=N command line option, where N is the new cutoff value. In some cases it is unclear if an offspring contains a null or if it is homozygous (e.g., a cross between parents with genotypes AN x AB, and progeny with only allele A visible). In these cases the genotype is designated A_. In some cases the progeny contain two alleles from one parent and none from the other. In such cases a null is assumed to be present on a third chromosome, and an N is appended to the genotype. Similarly, if nulls are segregating and the parent has two visible alleles, an N is still appended to the genotype, so for example a parent might appear as ABN.

Nonparental Alleles Alleles that are present in the progeny but absent in the parents are designated X, and an X is appended to the parental genotypes (mainly as a way of flagging the locus for further investigation). These are also flagged at the bottom of the output to facilitate identification of the anomalous progeny.

Aneuploidy Individuals that possess more than one allele from a single parent are flagged as possible aneuploids at the bottom of the output. They can also be readily identified by having more than two letters for their genotypes.

Summary File

Segregation Analysis

This section presents a summary of observed alleles and putative nulls, and an assessment of departures from Mendelian segregation in the progeny. Expected segregation is based on parental genotypes, and presence of nulls is inferred from lack of parental alleles in the progeny. Expectations for the observed alleles are based on the number of progeny possessing the allele, whereas the expectation in the case of nulls is the number of progeny lacking parental alleles.

Allele: original allele name

Allele ID: identifier assigned by the program and used in genotype output.

Origin: indicates if the allele came from the father (Fthr), mother (Mthr), Both, or neither (Nonparental, NP).

Species: If adequate information is provided in genomap.cfg, indicates the species origin of the alleles. Single-letter designations must be provided for pure species, and combinations of these for hybrid parents. Unk indicates that information provided was inadequate to determine the origin. Both indicates that both species contributed the allele that is segregating in the cross. Shared indicates that both species possess the segregating allele, but there is inadequate information to determine which contributed it to the cross. In determining species designations, I assumed that aneuploidy was minimal to facilitate deduction of species origins in the absence of complete grandparent genotypes.

Mother: Genotype of the mother

Father: Genotype of the father

N: Number of progeny scored for that locus.

Expected: Expected number of observations of the allele, with Mendelian segregation (E)

Observed: Number of actual observations (O)

Chi-Sq: (O-E)²/E + ((N-O)-(N-E))²/(N-E) (cutoff for P=0.05: 3.84)

Mu: (p_o-p_e)/Mu_ex, where p_o=O/N, p_e=E/N, and Mu_ex=(p_e(1-p_e)/N)^1/2(cutoff for P-0.05: 2.96)

Parental Summary: Summarizes the total number of loci analyzed, the number of alleles observed for each parent, % of loci for which parents were heterozygous, number of putative nulls originating from each parent, number of unreduced gametes (based on appearance of more than one allele from a single parent in the progeny), and number of alleles with significant segregation distortion based on the chi-squared test reported in the Segregation Analysis section. Only includes alleles that could be unambiguously assigned to one parent.

Species Summary:

Similar to the parental summary, but summarizes for alleles derived from each species in the cross, and for shared alleles and those that could not be assigned. The program will assign alleles to species even in cases where grandparent data are incomplete, as is often the case for forestry pedigrees. Unk indicates that information provided was inadequate to determine the origin. Both indicates that both species contributed the allele that is segregating in the cross. Shared indicates that both species possess the segregating allele, but there is inadequate information to determine which contributed it to the cross. In determining species designations, I assumed that aneuploidy was minimal to facilitate deduction of species origins in the absence of complete grandparent genotypes.

Aneuploidy

Lists observations of progeny that inherited more than one allele from a single parent. Names of original files are listed to facilitate correcting the data.

Nonparental Alleles

Lists alleles that appear in the progeny but do not appear in either parent.

Missing Data

Lists loci and samples for which observations are missing.

Progeny with Unexpected Lack of Parental Alleles

These progeny lacked parental alleles for loci at which no nulls were expected to be segregating. This will likely produce no output unless you change the minimum number of mismatches to be greater than the default (1). This can be done on the command line with the –null=N, where N is the new cutoff.

Mismatching Progeny for Loci with Segregation Distortion

This is a list of loci and samples that did not match parents, but for which the number of mismatches did not match Mendelian expectations if a null was segregating. In many cases, these mismatches will be due to errors in the parental or progeny genotypes rather than true null segregation, so these should be carefully checked. This section will likely contain a great deal of output.

Repeated Samples with Different Genotypes

This section contains genotypes that don’t match between samples with the same name. Also included is the number of observations for each genotype to facilitate choosing a consensus genotype for oft-repeated, important samples (e.g., parents). Note that the program will always use the last genotype read, even if it conflicts with prior genotypes. Therefore, conflicts must be resolved in the original data files.

Mapmaker, Haplotype, and Distance Outputs

Each of these outputs has a single allele on each line and information on whether the progeny possess the allele. In the case of the Haplotype output, presence and absence is indicated by 1 and 0 respectively. The respective designations are H and A in Mapmaker and Distance outputs. In addition, for Mapmaker output the second line contains the number of progeny, the number of loci, and the number of QTL’s (not processed by genomap at this point). Finally, an s is appended to each locus name that begins with a number, and an * is appended to all loci names (a Mapmaker requirement). The distance output has the same format as the Mapmaker output, but only one allele per locus is included to allow for calculation of the total map distance in centimorgans. Mapmaker output contains maternal alleles, paternal alleles, and intercross alleles. These should be saved to separate files for different analyses. The Distance output only contains Maternal and Paternal alleles. For all of these output files, the repulsion-phase allele is also generated, with an r appended to the end of the allele name.

UNIX/Linux

Windows

Genomap Configuration File

Summary File

Segregation Analysis

Chi-Sq: (O-E)2/E + ((N-O)-(N-E))2/(N-E) (cutoff for P=0.05: 3.84)

Chi-Sq: (O-E)²/E + ((N-O)-(N-E))²/(N-E) (cutoff for P=0.05: 3.84)