Do You Speak DNA?

by Matthias Galle

“In [these huge molecules] all the wealth and variety of heredity transmissions can find expression just as all the words and concepts of all languages can find expression in twenty-four to thirty letters of the alphabet”. When Friedrich Miescher wrote this in 1892 these words were just speculative as he was not aware of the hereditary functions of DNA, even if he was the first to identify and isolate nucleic acids.


Linguistic references pop out all the time when we talk about DNA. Ask your neighbor in your office what he thinks is the “language of life” (This was also the title of one of the first books on molecular biology for a broad audience [5]) . Or ask former US-president Bill Clinton, who announced the completion of the draft sequence of the human genome with the words “Today we are learning the language in which God created life” [2]. Or ask Francois Jacob who coined the expression “linguistic model in biology” [6] and was also one of the discovers of the transcription process, one of the two main mechanisms of transfer of sequential information in the Central Dogma [3]. The other one is called translation. Guess from which field came the inspiration for these two names.

Linguistics have provided great metaphors to inspire some of the brightest discoveries in DNA. This is understandable: both a text document and DNA could be defined as a sequential stream of symbols over a known alphabet that transmits information from one person to another.

A much harder question is how far we can push this metaphor and how it should influentiates a researchers work. In 1994 a controversial paper claimed that two typical characteristics of human language were present in non-coding DNA [9]. One of them was Zipf’s law, probably the most famous of several power laws that appear in several natural phenomena [7], and which relates the frequency of a word to its rank in the frequency table of all words. Their article concluded with the “possible existence of one (or more than one) structured biological language — present in non-coding DNA sequences”. The critics did not wait to attack their findings, and one of the most determined concluded: “The inescapable conclusion is clear: DNA sequences show no linguistic properties” [14]. Another – very recent – article [15] proposes to interpret DNA as a language in order to perform the inductive leap that would permit human to generalize language, with the final goal of being able to comprehend a new kind of language the day we encounter extraterrestrial life forms.

 

The question remains therefore how to use the common characteristics between DNA and languages without taking misleading or controversial paths. Recent research may indicate that the answer lies somewhere back in the ’50s. Around the same time Watson, Crick and Franklin unveiled the mystery behind the structure of DNA, Noam Chomsky led a revolution in the field of linguistics by proposing a mathematical treatment of the structure of syntax. He proposed to model languages with formal generative grammars, a type of rewriting-rule systems that specifies which symbols can be replaced by which sequence of other symbols. These were the first steps towards huge development in Natural Language Processing, which enabled accurate machine translation, speech recognition and a computer to win this year the Jeopardy! answer-question game.

Nowadays, we have a plethora of rigorous mathematical formalism that permit to describe and explain different types of languages, artificial (like programming languages) as natural ones. We know plenty of things on them, and understand how to manipulate and work with them. The possibility of throwing all this machinery on DNA sequence is too tempting. So tempting, that people have started doing it already. David Searls in particular analyzed how the most famous classes of grammars – those in the so-called Chomksy-Schutzenberger hierarchy – capture particular properties of DNA. His overview papers [1], [10]–[13] are easy to read, introduce linguistics to biologist and give an idea of the opportunities.

Take as an example, a study published in 2006 by Loose et.al [8]. They use naturally occurring antibiotics – called antimicrobial peptides (AmPs) – to automatically learn grammars for the “language of AmPs”. Using these grammars they generated new valid sentences of this language. When these were tested they successfully inhibited the growth of bacteria. The kind of grammars this study uses are only “regular grammars”, the lowest class in expression and structural power in the Chomksy-Sch¨utzenberger hierarchy. My thesis dissertation [4] was precisely about considering more expressive grammars to detect meaningful hierarchical structures in DNA. The realization of the first “work-conference” on Linguistics, Biology and Computer Science this year testifies of the growing research interest in this triple intersection. Surely, there are great opportunities for any of these areas to shed light from new perspectives into open questions of the other two.

References:

1. D. Chiang, A. Joshi, and D. B. Searls. Grammatical representations of macromolecular structure. Journal of Computational Biology, 13(5):1077–1100, Jan. 2006.

2. W. Clinton. http://clinton3.nara.gov/wh/new/html/genome-20000626.html, June, 26th 2006.

3. F. Crick. Central dogma of molecular biology. Nature, 227(5258):561–563, 1970.

4. M. Gall´e. Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem. Universite de Rennes 1, Feb. 2011.

5. George and M. Beadle. The language of life: an introduction to the science of genetics. Doubleday Publishing Group, 1966.

6. F. Jacob. The linguistic model in biology. In D. Armstrong and C. V. Schooneveld, editors, Roman Jakobson: Echoes of His Scholarship, pages 185–192. Humanities Pr, 1977.

7. W. Li. Zipf’s law everywhere. Glottometrics, 5:14–21, 2002.

8. C. Loose, K. Jensen, I. Rigoutsos, and G. Stephanopoulos. A linguistic model for the rational design of antimicrobial peptides. Nature, 443(7113):867–869, Jan. 2006.

9. R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C. K. Peng, M. Simons, and H. E. Stanley. Linguistic features of noncoding DNA sequences. Physical Review Letters, 73(23):3169–3172, 1994.

10. D. B. Searls. The computational linguistics of biological sequences. In L. Hunter, editor, Artificial Intelligence and Molecular Biology, page 75. AAAI Press Copublications, Mar. 1993.

11. D. B. Searls. Linguistic approaches to biological sequences. Computer Applications in the Biosciences, 13(4):333–344, Jan. 1997.

12. D. B. Searls. Reading the book of life. Bioinformatics, Jan. 2001.

13. D. B. Searls. Linguistics: trees of life and of language. Nature, 426(6965):391–2, Nov. 2003.

14. A. A. Tsonis, J. B. Elsner, and P. A. Tsonis. Is DNA a language? Journal of Theoretical Biology, 184(1):25–29, 1997.

15. D. Waters. The linguistic model in biology: Implications for recognizing life and intelligence. In Astrobiology Science Conference: Evolution and Life: Surviving Catastrophes and Extremes on Earth and Beyond, page 5368, Jan. 2010.

 

[starrater tpl=45]

Share and Enjoy:
  • Print
  • Digg
  • Facebook
  • Twitter
  • Google Bookmarks
  • LinkedIn
  • PDF
  • Technorati
Tags: ,