Finrep Documentation

 

By Stefano Leonardi

stefano@dsa.unipr.it

April 1997

 

fasta processing added by Steve DiFazio

difazios@ornl.gov

July 2003

 

This program is called "finrep" (find repeats).  It identifies locations

of any di- tri- or tetra-nucleotide repeated more than four times

consecutively in a sequence.

 

The program has compiled in DOS and UNIX (Solaris 2 and 9).

 

It reads data (sequences) in three formats:

text files in 'finrep' format,

GenBank format, and

FASTA format.

 

The command will be:

 

finrep <datafile>

 

for a datafile in finrep format (see below),

 

finrep -genbank datafile

 

for genebank format datafile.

 

or

 

finrep -fasta datafile

 

for a fasta format datafile.

 

If you want an output file, in DOS or UNIX you can easily redirect the output

with the following commands:

 

finrep datafile > outputfile

or

finrep -genbank datafile > outputfile.

 

In one file there can be as many sequences as you want (up to about

1,000,000 sequences; see below)

 

Finrep format:

 

Each sequence should start with a line as follows:

SEQUENCE: "Name_of_the_sequence"

and should end with a line with two slashes as follows:

//

 

The sequences itself may contain spaces or numbers for easy

reference.

 

Lines beginning with a '#' or a ';' are considered comments and they

will not be taken into consideration by the program.

 

Please take a look at the file test.txt as example.

 

The GenBank data file is for sequences obtained from GenBank. The

program needs at least the line with the word LOCUS and the correct

number of b.p.  Then it needs the line with the word ACCESSION, then a

line with ORIGIN, the sequence and the final line with "//".

 

The FASTA format datafile only need contain the header line marked

with '>', followed by the sequence with lines less than 256 characters.

 

The output is very redundant because I wanted to be sure that

it will pick-up every repeat. You can select only information

you actually need.

 

A script for processing the output and eliminating repetitive results,

finrep_parse.perl, is available from Steve DiFazio

(difazios@ornl.gov).  We highly recommend that you process the finrep

output with this perl script.

 

Please let us know if you have any kind of problem with this program.

 

BUGS:

 

Crashes when processing very large files (> 1,000,000 sequences) in Solaris 9.  A script for converting large fasta files to a number of smaller files

in finrep format is available from Steve DiFazio.