Genomap 1.0 Perl Utility
Steve DiFazio
difazios@ornl.gov
Revised July 10, 2003
Introduction to PERL (skip if you know this stuff)
This program is written in the PERL (Practical Extraction and Reporting Language) scripting language. Each time the program is run, the PERL interpreter is automatically invoked, and this converts the script to machine language and executes it. Because of this, the program will work equally well on any operating system as long as the proper PERL interpreter is installed. Fortunately, this program is available for free!
Instructions for Installing The PERL
Interpreter
Unix and Linux systems should already have PERL installed. You may have to change the first line of the program to match the location of your PERL executable. Usually you can find this by typing
where perl
Then change the line
#! /usr/local/bin/perl
to match the location of your executable. If by any chance PERL isn’t on your system, just follow the link under the windows section and you’ll find it easily enough.
Go to the following web site and follow the directions to download the program (~9 MB):
http://www.activestate.com/Products/Download/Get.plex?id=ActivePerl
If you’re using Window 2000 or XP, you should be able to install the ‘windows MSI’ version directly. If you’re using an earlier version of Windows, you may need to download and install the ‘windows installer’ first.
After downloading, double-click the file and the installation should be self-explanatory!
PERL will install and run completely transparently (i.e., no user interface).
Genomap Introduction
The genomap perl utility processes tabular output files from ABI Genotyper software and prepares them for genetic mapping using Mapmaker and other mapping software. The program works equally well with any type of molecular marker data, including AFLP, STS, RAPD, and SSR. The program performs the following functions:
Creating Data Files Using ABI Genotyper
Software
Genotyper data should be scored using grouped categories to indicate alleles (i.e., ‘group’ indicates the locus, and alleles have unique category names within the group). Under the table options, make sure that the ‘low signal’ threshold is set for a very low number (e.g., 1). This is how the program recognizes failed amplifications versus double-nulls, a very important distinction. You can also enter ‘M’ in one of the ‘Peak’ columns to indicate missing data or failed amplifications. If you simply leave the ‘Peak’ columns blank, the sample will be scored as a double-null. Do not use ‘M’ as an allele name, because all data will be scored as missing! All other allele names are fair game, and there is no length limitation. Just be sure to be consistent!
The data must be saved as tab-delimited text files using the ‘Table’ function. See the file input.dat for an example.
The following fields MUST be present.
Fields can appear in any order, and extraneous
fields will not cause problems.
File Name
Sample Info
Category
Peak1
Peak2
.....
Peak X (program can handle any number of peaks, which correspond to alleles)
Low Signal (for identifying unamplified samples)
The program checks the format of each input file and will generate appropriate error messages to the screen when format problems are encountered.
********* The files should be saved to a single directory with
filenames that contain NO SPACES and with a .dat
extension. **************
Program Usage
Input files must first be prepared according to the formats specified above. The program requires a configuration file called genomap.cfg as well as input files created with ABI Genotyper, and an input file that lists the names of the genotyper files to be processed. The latter is to facilitate input using a Windows command prompt.
The file genomap.cfg must be present in the directory from which the program is being executed. This file contains information about the characteristics of your data, and provides a convenient means to alter the way genomap processes your data. The sample file provides the format and explanations of each entry. Comments are delineated by # at the beginning of lines. This is also a convenient way to disable features for different runs of the program. The components of this file are:
Sample Names: You must specify the names of the mother and father in the pedigree. The program will not function without genotypes for the mother and father for each locus. The names must be consistent in all files, though the program is case-insensitive.
Optionally, you can also specify the name of a blank sample as a negative control. The program will ignore these genotypes, though it will output a list of all blanks that have scored alleles.
The program also allows inclusion of grandparent genotypes for the purpose of tracing the species origin of alleles in interspecific crosses. These names are specified here as well.
All other sample names other than those specified in genomap.cfg are considered to be progeny of the mother and father.
Parental Species (optional) If an interspecific cross is involved, you can specify the species identity of the mother, father, paternal grandparents, and/or maternal grandparents to facilitate tracing the species origin of alleles. Species origins are traced as much as possible with the amount of data that are provided. For example, for a testcross between an F1 hybrid mother and a pure father, quite a few species origins can be determined if only one maternal grandparent is provided. Paternal grandparents are not necessary in this case. Species designations must consist of a single letter, which are then combined to indicate hybrids (parental crosses, backcrosses, and intercrosses are labeled equivalently with one letter from each species).
Exclusions Any combination of individuals, loci, or alleles can be excluded from the output according to specifications in the configuration file. For example, a particular problematic locus can be efficiently excluded from the analysis, or a sample with a high rate of amplification failure can be excluded. This obviates the need to alter large numbers of input files, and provides great flexibility in performing analyses such as bootstraps or jacknifes, and also provides a convenient way to construct maps for alleles originating from a particular species, for example. Note that exclusion of individual alleles is only possible for Mapmaker or Distance output, because these are the only contexts in which this makes sense.
Preparing List of
Files to Process
The names of all of the data files that you want to process should be saved in another file. The easiest way to do this is
Start->Programs->Accessories
cd c:\datadir
dir *.dat > files.txt
(or ls *.dat > files.txt for Unix or Linux)
In practice you can replace ‘datadir’ and ‘files.txt’ with names of your choosing.
See ‘files.txt’ as an example of an input file. If this doesn’t work, or if you
prefer, you can also manually create the input file by simply placing the name
of the files to be processed on separate lines.
If you want to run the program in a different directory from the
location of the data files, include the full path with each file name. Also, remember that the filenames shouldn’t
have any spaces, and they must end in “.dat”.
Running the Program
The program is run
from a command prompt, so the first thing to do is open a DOS window (or a
console if you’re using Unix or Linux).
cd c:\datadir
_______________________________________________________________
Usage: perl genomap.pl -g -h -s -m -d -null=n files.txt > out.txt
where
-g provides output for genotypic mapping (ABCD genotypes for each locus),
-h provides output for haplotype mapping(1 or 0 genotypes per allele),
-m provides output in Mapmaker format
-d provides output in Mapmaker format for one allele per locus for distance
calculations
-s provides a summary of null alleles and progeny without parental alleles,
-null=n, where n is the maximum number of loci for which parental mismatches are
allowed before designating a segregating null allele. Default is 1.
<files.txt> is the name of a file containing a list of all genotyper table files to be
processed, each of which must have a ".dat" extension, and
<out.txt> is the name of the file to save the output
** You can choose unique names for these input and output files **
If the program produced the expected output, you’re ready to roll. Type
perl c:\bin\genomap.pl –s files.txt > data_summ.txt
This will produce a summary file, data_summ.txt, that you can then open with a spreadsheet program like Excel.
perl c:\bin\genomap.pl –g files.txt > data_geno.txt
produces a file with genotypes for each sample.
The format of these files is explained in detail below.
I encourage you to carefully review the summary file and genotype file, which will likely point out many genotyping errors. Correct the errors in the original data files (i.e., the table files generated by Genotyper), and continually regenerate the summary and genotype files until everything looks correct. Then it’s time to generate a Mapmaker input file:
perl c:\bin\genomap.pl –m files.txt > data_map.txt
This file contains data for
the maternal, paternal, and intercross alleles, each of which should be
analyzed separately in Mapmaker.
Genotype File
This file is generated to facilitate error checking and to give an easily-reviewed overview of the data.
Visible Alleles: The program assigns unique single letters to alleles for each locus (A-L). Missing Data: The letter M designates missing data, either due to failed amplification (triggered by the ‘low signal’ warning or manually entered into the table), or to absence from the file. Note that if a sample is present in the file it is scored as a double null if ‘M’ or the low signal warning are not present.
Null Alleles: Putative null alleles are designated as N in the genotype file. Presence of nulls is inferred by a lack of parental alleles in the progeny. The default threshold for decalring a null is a single mismatch between parents and progeny. If this criterion is too liberal, you may increase the cutoff by using the –null=N command line option, where N is the new cutoff value. In some cases it is unclear if an offspring contains a null or if it is homozygous (e.g., a cross between parents with genotypes AN x AB, and progeny with only allele A visible). In these cases the genotype is designated A_. In some cases the progeny contain two alleles from one parent and none from the other. In such cases a null is assumed to be present on a third chromosome, and an N is appended to the genotype. Similarly, if nulls are segregating and the parent has two visible alleles, an N is still appended to the genotype, so for example a parent might appear as ABN.
Nonparental Alleles Alleles that are present in the progeny but absent in the parents are designated X, and an X is appended to the parental genotypes (mainly as a way of flagging the locus for further investigation). These are also flagged at the bottom of the output to facilitate identification of the anomalous progeny.
Aneuploidy Individuals that possess more than one allele from a single parent are flagged as possible aneuploids at the bottom of the output. They can also be readily identified by having more than two letters for their genotypes.
This section presents a summary of observed alleles and putative nulls, and an assessment of departures from Mendelian segregation in the progeny. Expected segregation is based on parental genotypes, and presence of nulls is inferred from lack of parental alleles in the progeny. Expectations for the observed alleles are based on the number of progeny possessing the allele, whereas the expectation in the case of nulls is the number of progeny lacking parental alleles.
Allele: original allele name
Allele ID: identifier assigned by the program and used in genotype output.
Origin: indicates if the allele came from the father (Fthr), mother (Mthr), Both, or neither (Nonparental, NP).
Species: If adequate information is provided in genomap.cfg, indicates the species origin of the alleles. Single-letter designations must be provided for pure species, and combinations of these for hybrid parents. Unk indicates that information provided was inadequate to determine the origin. Both indicates that both species contributed the allele that is segregating in the cross. Shared indicates that both species possess the segregating allele, but there is inadequate information to determine which contributed it to the cross. In determining species designations, I assumed that aneuploidy was minimal to facilitate deduction of species origins in the absence of complete grandparent genotypes.
Mother: Genotype of the mother
Father: Genotype of the father
N: Number of progeny scored for that locus.
Expected: Expected number of observations of the allele, with Mendelian segregation (E)
Observed: Number of actual observations (O)
Mu: (po-pe)/Muex, where po=O/N, pe=E/N, and Muex=(pe(1-pe)/N)1/2 (cutoff for P-0.05: 2.96)
Parental Summary: Summarizes the total number of loci analyzed, the number of alleles observed for each parent, % of loci for which parents were heterozygous, number of putative nulls originating from each parent, number of unreduced gametes (based on appearance of more than one allele from a single parent in the progeny), and number of alleles with significant segregation distortion based on the chi-squared test reported in the Segregation Analysis section. Only includes alleles that could be unambiguously assigned to one parent.
Species Summary:
Similar to the parental summary, but summarizes for alleles derived from each species in the cross, and for shared alleles and those that could not be assigned. The program will assign alleles to species even in cases where grandparent data are incomplete, as is often the case for forestry pedigrees. Unk indicates that information provided was inadequate to determine the origin. Both indicates that both species contributed the allele that is segregating in the cross. Shared indicates that both species possess the segregating allele, but there is inadequate information to determine which contributed it to the cross. In determining species designations, I assumed that aneuploidy was minimal to facilitate deduction of species origins in the absence of complete grandparent genotypes.
Aneuploidy
Lists observations of progeny that inherited more than one allele from a single parent. Names of original files are listed to facilitate correcting the data.
Nonparental Alleles
Lists alleles that appear in the progeny but do not appear in either parent.
Missing Data
Lists loci and samples for which observations are missing.
Progeny with
Unexpected Lack of Parental Alleles
These progeny lacked parental alleles for loci at which no nulls were expected to be segregating. This will likely produce no output unless you change the minimum number of mismatches to be greater than the default (1). This can be done on the command line with the –null=N, where N is the new cutoff.
Mismatching Progeny
for Loci with Segregation Distortion
This is a list of loci and samples that did not match parents, but for which the number of mismatches did not match Mendelian expectations if a null was segregating. In many cases, these mismatches will be due to errors in the parental or progeny genotypes rather than true null segregation, so these should be carefully checked. This section will likely contain a great deal of output.
Repeated Samples with
Different Genotypes
This section contains genotypes that don’t match between samples with the same name. Also included is the number of observations for each genotype to facilitate choosing a consensus genotype for oft-repeated, important samples (e.g., parents). Note that the program will always use the last genotype read, even if it conflicts with prior genotypes. Therefore, conflicts must be resolved in the original data files.
Mapmaker, Haplotype, and Distance Outputs
Each of these outputs has a single allele on each line and information on whether the progeny possess the allele. In the case of the Haplotype output, presence and absence is indicated by 1 and 0 respectively. The respective designations are H and A in Mapmaker and Distance outputs. In addition, for Mapmaker output the second line contains the number of progeny, the number of loci, and the number of QTL’s (not processed by genomap at this point). Finally, an s is appended to each locus name that begins with a number, and an * is appended to all loci names (a Mapmaker requirement). The distance output has the same format as the Mapmaker output, but only one allele per locus is included to allow for calculation of the total map distance in centimorgans. Mapmaker output contains maternal alleles, paternal alleles, and intercross alleles. These should be saved to separate files for different analyses. The Distance output only contains Maternal and Paternal alleles. For all of these output files, the repulsion-phase allele is also generated, with an r appended to the end of the allele name.