lotus

previous page: 15 Is there a quick introduction to information theory somewhere?
  
page up: Biological Information Theory and Chowder Society FAQ
  
next page: 17 How can I learn more about information theory and biology? references

16 I'm confused: how could information equal entropy? (Information theory)




Description

This article is from the Biological Information Theory and Chowder Society FAQ, by Thomas D. Schneider toms@ncifcrf.gov.

16 I'm confused: how could information equal entropy? (Information theory)

If someone says that information = uncertainty = entropy, then they are
confused, or something was not stated that should have been. Those
equalities lead to a contradiction, since entropy of a system increases as
the system becomes more disordered. So information corresponds to disorder
according to this confusion.

If you always take information to be a decrease in uncertainty at the
receiver and you will get straightened out:

R = Hbefore - Hafter.

where H is the Shannon uncertainty:

H = - sum (from i = 1 to number of symbols) Pi log2 Pi (bits per symbol)

and Pi is the probability of the ith symbol. If you don't understand this,
please refer to "Is There a Quick Introduction to Information Theory
Somewhere?".

Imagine that we are in communication and that we have agreed on an alphabet.
Before I send you a bunch of characters, you are uncertain (Hbefore) as to
what I'm about to send. After you receive a character, your uncertainty goes
down (to Hafter). Hafter is never zero because of noise in the communication
system. Your decrease in uncertainty is the information (R) that you gain.

Since Hbefore and Hafter are state functions, this makes R a function of
state. It allows you to lose information (it's called forgetting). You can
put information into a computer and then remove it in a cycle.

Many of the statements in the early literature assumed a noiseless channel,
so the uncertainty after receipt is zero (Hafter=0). This leads to the
SPECIAL CASE where R = Hbefore. But Hbefore is NOT "the uncertainty", it is
the uncertainty of the receiver BEFORE RECEIVING THE MESSAGE.

A way to see this is to work out the information in a bunch of DNA binding
sites.

Definition of "binding": many proteins stick to certain special spots on DNA
to control genes by turning them on or off. The only thing that
distinguishes one spot from another spot is the pattern of letters
(nucleotide bases) there. How much information is required to define this
pattern?

Here is an aligned listing of the binding sites for the cI and cro proteins
of the bacteriophage (i.e., virus) named lambda:

alist 5.66 aligned listing of:
* 96/10/08 19:47:44, 96/10/08 19:31:56, lambda cI/cro sites
piece names from:
* 96/10/08 19:47:44, 96/10/08 19:31:56, lambda cI/cro sites
The alignment is by delila instructions
The book is from: -101 to 100
This alist list is from: -15 to 15

                       ------                   ++++++
                       111111--------- +++++++++111111
                       5432109876543210123456789012345
                       ...............................
OL1 J02459  35599 +  1 tgctcagtatcaccgccagtggtatttatgt
    J02459  35599 -  2 acataaataccactggcggtgatactgagca
OL2 J02459  35623 +  3 tttatgtcaacaccgccagagataatttatc
    J02459  35623 -  4 gataaattatctctggcggtgttgacataaa
OL3 J02459  35643 +  5 gataatttatcaccgcagatggttatctgta
    J02459  35643 -  6 tacagataaccatctgcggtgataaattatc
OR3 J02459  37959 +  7 ttaaatctatcaccgcaagggataaatatct
    J02459  37959 -  8 agatatttatcccttgcggtgatagatttaa
OR2 J02459  37982 +  9 aaatatctaacaccgtgcgtgttgactattt
    J02459  37982 - 10 aaatagtcaacacgcacggtgttagatattt
OR1 J02459  38006 + 11 actattttacctctggcggtgataatggttg
    J02459  38006 - 12 caaccattatcaccgccagaggtaaaatagt
                                             ^

Each horizontal line represents a DNA sequence, starting with the 5' end on
the left, and proceeding to the 3' end on the right. The first sequence
begins with: 5' tgctcag ... and ends with ... tttatgt 3'. Each of these
twelve sequences is recognized by the lambda repressor protein (called cI)
and also by the lambda cro protein.

What makes these sequences special so that these proteins like to stick to
them? Clearly there must be a pattern of some kind.

Read the numbers on the top vertically. This is called a "numbar". Notice
that position +7 always has a T (marked with the ^). That is, according to
this rather limited data set, one or both of the proteins that bind here
always require a T at that spot. Since the frequency of T is 1 and the
frequencies of other bases there are 0, H(+7) = 0 bits. But that makes no
sense whatsoever! This is a position where the protein requires information
to be there.

That is, what is really happening is that the protein has two states. In the
BEFORE state, it is somewhere on the DNA, and is able to probe all 4
possible bases. Thus the uncertainty before binding is Hbefore = log2(4) = 2
bits. In the AFTER state, the protein has bound and the uncertainty is
lower: Hafter(+7) = 0 bits. The information content, or sequence
conservation, of the position is Rsequence(+7) = Hbefore - Hafter = 2 bits.
That is a sensible answer. Notice that this gives Rsequence close to zero
outside the sites.

If you have uncertainty and information and entropy confused, I don't think
you would be able to work through this problem. For one thing, one would get
high information OUTSIDE the sites. Some people have published graphs like
this.

A nice way to display binding site data so you can see them and grasp their
meaning rapidly is by the sequence logo method. The sequence logo for the
example above is at
http://www-lecb.ncifcrf.gov/~toms/gallery/hawaii.fig1.gif. More information
on sequence logos is in the section What are Sequence Logos?

More information about the theory of BEFORE and AFTER states is given in the
papers http://www-lecb.ncifcrf.gov/~toms/paper/nano2 ,
http://www-lecb.ncifcrf.gov/~toms/paper/ccmm and
http://www-lecb.ncifcrf.gov/~toms/paper/edmm.

 

Continue to:













TOP
previous page: 15 Is there a quick introduction to information theory somewhere?
  
page up: Biological Information Theory and Chowder Society FAQ
  
next page: 17 How can I learn more about information theory and biology? references