biology daily - the biology and biochemistry encyclopedia
biology daily articles and research Encyclopedia Dictionary Forums biology research links Weblinks Pictures Articles Blogs Newsletter

Suffix tree

The suffix tree data structure was one of the first linear-time solutions for the longest common substring problem. It was first described by E.M. McCreight in 1976. A suffix tree for an n-character string S is a Patricia trie containing all n suffixes of S.

With it, a large text can be searched, and common substrings can be extracted, very quickly. Variants of the LZW compression schemes use it (LZSS ). Suffix trees are useful for string matching applications, such as those that arise when working with DNA sequences.

Each edge in a suffix tree contains the following information: an edge label, in the form of a substring of the source string, represented by the start and end positions of the substring; a list of child nodes, often in the form of a linked list, a pointer to the next sibling node, and a suffix link, pointing to the node for the immediate suffix of the string represented by the current node. Suffix links are a key feature for linear-time construction of the tree, since they allow changes to propagate to all suffixes quickly.

The large amount of information at each node makes the suffix tree very memory-intensive , consuming some twenty times the memory size of the source text in common implementations. The Suffix array reduces this requirement to a factor of four, and efforts have continued to find smaller indexing structures.

References

  • E.M. McCreight. (1976). A space-economical suffix tree construction algorithm. Journal of the ACM 23 262-272.
  • E. Ukkonen. (1995). On-line construction of suffix trees. Algorithmica 14(3):249-260. PDF

External links



07-14-2008 23:18:10
The contents of this article are licensed from Wikipedia.org under the GNU Free Documentation License. How to see transparent copy
BiologyDaily.com 2005. Legal info   Privacy