Indeterminate String Factorizations and Degenerate Text TransformationsReport as inadecuate




Indeterminate String Factorizations and Degenerate Text Transformations - Download this document for free, or read online. Document in PDF available to download.

Mathematics in Computer Science

, Volume 11, Issue 2, pp 209–218

First Online: 02 February 2017Received: 22 April 2014Revised: 26 May 2014Accepted: 21 June 2014DOI: 10.1007-s11786-016-0285-x

Cite this article as: Daykin, J.W. & Watson, B. Math.Comput.Sci. 2017 11: 209. doi:10.1007-s11786-016-0285-x

Abstract

The data explosion problem continues to escalate requiring novel and ingenious solutions. Pattern inference focusing on repetitive structures in data is a vigorous field of endeavor aimed at shrinking volumes of data by means of concise descriptions. The Burrows–Wheeler transformation computes a permutation of a string of letters over an alphabet, and is well-suited to compression-related applications due to its invertability and data clustering properties. For space efficiency the input to the transform can be preprocessed into Lyndon factors. Rather than this classic deterministic approach for letter based strings, we consider scenarios with uncertainty regarding the data: a position in an indeterminate or degenerate string is a set of letters. We first define indeterminate Lyndon words and establish their associated unique string factorization; then we introduce the novel degenerate Burrows–Wheeler transformation which may apply the indeterminate Lyndon factorization. A core computation in Burrows–Wheeler type transforms is the linear sorting of all conjugates of the input string—we achieve this in the degenerate case with lex-extension ordering. Like the original forms, indeterminate Lyndon factorization and the degenerate transform and its inverse can all be computed in linear time and space with respect to total input size of degenerate strings. Regular molecular biological strings yield a wealth of applications of big data—an important motivation for generalizing to degenerate strings is their extensive use in expressing polymorphism in DNA sequences.

KeywordsDegenerate biological string Degenerate Burrows-Wheeler transform Indeterminate Lyndon word Indeterminate suffix array Inverse transform Lex-extension order Linear Full version of an extended abstract which appeared in the Proceedings of the 2nd International Conference on Algorithms for Big Data.

Mathematics Subject Classification68R15 Download to read the full article text



Author: Jacqueline W. Daykin - Bruce Watson

Source: https://link.springer.com/







Related documents