biology daily - the biology and biochemistry encyclopedia
biology daily articles and research Encyclopedia Dictionary Forums biology research links Weblinks Pictures Articles Blogs Newsletter

FASTA format

In bioinformatics, FASTA format is a file format used to exchange information between genetic sequence databases. Its format looks like this:

>SEQUENCE_1
;comment line 1(optional)
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
;comment line 1(optional)
;comment line 2 (optional)
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

It consists of a header line (beginning with a '>') which gives a name and/or a unique identifier for the sequence, and often lots of other information too. Many different sequence databases use standarized headers, which helps when automatically extracting information from the header.

After the header line, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur. Most databases and bioinformatics applications do not recognize these comments so their use is discouraged, but they are part of the official format.

After the header line and comments, one or more sequence lines may follow. Sequences may be protein sequences or DNA sequences, they can be of any length and can contain gaps or alignment characters (see sequence alignment).

FASTA format files often have file extensions like .fa, .mpfa or .fsa (and probably many more!).

The simple format of FASTA files makes them easy to manipulate using text processing tools and scripting languages like Perl.

The NCBI have gone so far as to define a standard for their fasta header (although generally this is a bit messy)...

 GenBank                           gi|gi-number|gb|accession|locus
 EMBL Data Library                 gi|gi-number|emb|accession|locus
 DDBJ, DNA Database of Japan       gi|gi-number|dbj|accession|locus
 NBRF PIR                          pir||entry
 Protein Research Foundation       prf||name
 SWISS-PROT                        sp|accession|entry name
 Brookhaven Protein Data Bank      pdb|entry|chain
 Patents                           pat|country|number 
 GenInfo Backbone Id               bbs|number 
 General database identifier	    gnl|database|identifier
 NCBI Reference Sequence           ref|accession|locus
 Local Sequence identifier         lcl|identifier


External Links

this



07-14-2008 23:18:10
The contents of this article are licensed from Wikipedia.org under the GNU Free Documentation License. How to see transparent copy
BiologyDaily.com 2005. Legal info   Privacy