Finrep Documentation
By Stefano Leonardi
stefano@dsa.unipr.it
April 1997
fasta
processing added by Steve DiFazio
difazios@ornl.gov
July 2003
This program is called "finrep"
(find repeats). It identifies locations
of any di-
tri- or tetra-nucleotide repeated more than four times
consecutively in a sequence.
The program has compiled in DOS and UNIX (Solaris 2 and
9).
It reads data (sequences) in three formats:
text files in 'finrep' format,
GenBank format, and
FASTA format.
The command will be:
finrep
<datafile>
for a datafile
in finrep format (see below),
finrep
-genbank datafile
for genebank
format datafile.
or
finrep
-fasta datafile
for a fasta
format datafile.
If you want an output file, in DOS or UNIX you can easily
redirect the output
with the following commands:
finrep
datafile > outputfile
or
finrep
-genbank datafile > outputfile.
In one file there can be as many sequences as you want
(up to about
1,000,000 sequences; see below)
Finrep format:
Each sequence should start with a line as follows:
SEQUENCE: "Name_of_the_sequence"
and should end with a line with
two slashes as follows:
//
The sequences itself may contain spaces or numbers for
easy
reference.
Lines beginning with a '#' or a ';' are considered
comments and they
will not be taken into
consideration by the program.
Please take a look at the file test.txt as example.
The GenBank data file is for
sequences obtained from GenBank. The
program needs at least the line
with the word LOCUS and the correct
number of b.p. Then it needs the line with the word
ACCESSION, then a
line with ORIGIN, the sequence
and the final line with "//".
The FASTA format datafile only
need contain the header line marked
with '>', followed by the
sequence with lines less than 256 characters.
The output is very redundant because I wanted to be sure
that
it will pick-up every repeat.
You can select only information
you actually need.
A script for processing the output and eliminating
repetitive results,
finrep_parse.perl, is available
from Steve DiFazio
(difazios@ornl.gov). We highly recommend that you process
the finrep
output with this perl script.
Please let us know if you have any kind of problem with
this program.
BUGS:
Crashes when processing very large
files (> 1,000,000 sequences) in Solaris 9. A script for converting large fasta files to
a number of smaller files
in finrep
format is available from Steve DiFazio.