The Mathematica Journal
Volume 9, Issue 2

Search

In This Issue
Articles
Tricks of the Trade
In and Out
Trott's Corner
New Products
New Publications
Calendar
News Bulletins
New Resources
Classifieds

Download This Issue 

About the Journal
Editorial Policy
Staff
Submissions
Subscriptions
Advertising
Back Issues
Contact Information

Tricks of the Trade
Edited by Paul Abbott

Symbolic Analysis of DNA Sequences

Jose Manuel Gutiérrez

gutierjm@unican.es

DNA chains can be viewed as symbolic sequences of four possible nucleotides--adenine (A), cytosine (C), thymine (T), and guanine (G)--that define the structure of amino acids to form proteins. DNA sequences are available at the database of the National Center for Biotechnology Information (NCBI) at www.ncbi.nlm.nih.gov/entrez/viewer.fcgi as plain ASCII files that contain information about the organism, as well as the sequence of nucleotides. Discovering the structure encoded in DNA sequences is of paramount importance.

Given a symbolic sequence , consider the dynamical system , , which defines an orbit on the attractor of the iterated function sequence (IFS) from an arbitrary initial condition . The orbit only visits those regions of the attractor associated with the words appearing in the symbolic sequence. Jeffrey [1] applied this method to visualize the attractor of the IFS represented by DNA sequences.

The following commands implement an efficient version of the chaos-game algorithm for an IFS consisting of four transformations. We map each symbol, , , , to a corner of the unit square, , , , . FoldList is used to compute the orbit and compilation is used to speed up the computation.

Symbolic sequences can be extremely long so it is important to efficiently load and process such data. Moreover, if the data is available via the internet, there is no need to store it locally. Rolf Mertig supplied code for opening and reading from a URL stream using J/Link.

The following code reads in DNA sequences as a string of characters, separately printing out descriptive information.

There are two types of sequence identification numbers: GI numbers (a series of digits that are assigned consecutively by NCBI to each sequence it processes) and version numbers (which consist of the accession number followed by a dot and a version number). Either format can be used. Now, we use the previous command to import the mouse mitochondrion and man mitochondrion complete genomes.

Both genomes define similar patterns (they are close in the phylogenetic tree: mitochondrial eukaryotes, vertebrata). Perhaps it is, after all, hard to answer the question, "Are you a man or a mouse?"



     
About Mathematica | Download Mathematica Player
Copyright © Wolfram Media, Inc. All rights reserved.