How much data is stored in our bodies?

This cannot be answered fully today--- this is the question of what is the bit-content of the relevant computation that is going on in the body. If you know the bits to simulate the computation efficiently, excluding the bits that are effectively random, you could say.

Whenever you have a biological system, some bits are doing important things, like the bits that tell you what domains are bound to an active protein, and some bits are useless, like the bits that tell you the precise orientation of a subpart of the cytoskeleton, or the bit that describes the orientation of some water molecule. When you want to store the data in a person, you are generally speaking about the biologically relevant data. If you store this data, and destroy the organism, and then restore the organism from new molecules, arranged according to the instructions in this irreducible data, the organism will behave in a statistically indistringuishable way from the original.

One can say some general things about the size of this data:

  1. It isn't infinite

There are some people who claim that the computation is like an analog computer (or as close as one can come given quantum limitations). This means that there are molecules or large objects in the cell that store analog data in their positions.

This is not true at all, and this fallacy is persuasive enough that one must argue against it. When you have a system in a thermal bath, there is always diffusion going on between interactions. If you have a molecule which is storing data in some way relevant to the biology, it must store this data in a way that can be retrieved by the remaining computation effectively. If this data is randomized by diffusion, it is not effective storage, and this bit may be discarded and replaced by a random number generator.

If you have a diffusing protein with interactions with other proteins every time on average, and with diffusion constant D, the protein will randomized into a Gaussian of size before the next interaction. This means that it is pointless and wasteful in terms of storage to specify the position to more accuracy than this size, and the number of possible positions is on a lattice of size . The number of points grows as the log of , so the number of bits a position can encode is bounded by the log of this, and it only grows logarithmically. This means that no matter how absurdly quickly you try to make the protein interact, the number of bits it can store in the position is never very high. It would be better off adding 5 binding domains rather than trying to localize more precisely.

The result applies to all other continuous storage mechanisms you can dream up--- the position of untethered molecules diffuses, the angles of proteins randomize, the concentrations of atoms are only relevant to the extent that a localized channel or protein can discriminate between different concentrations. In all biological systems, the spatial resolution cutoff for all motions is coarse enough to make the dominant information storage mechanism molecular binding.

  1. It isn't that large

The binding of molecules at first glance includes a large number of bits, since each protein is a long sequence of amino acids. But this is also a false bit content. In order to be dynamical data, capable of computing, the data has to change in time. Even if you have a very complicated protein, if it's only action is to bind to a ligand, then it has exactly two states, and carries one bit of information. If it binds to a polymer, then it can have many different bits, but these bits should be associated to the binding sites of the polymer. The protein has two states, the polymer has many different states of protein binding.

To store this protein state efficiently on a computer, I just have to name the proteins with a unique name (this only takes a few bits), and give the state of each one of each type. The proteins which carry 1 bit of information will be fully specified by the number of 1-state proteins and the number of 0-state proteins. To specify this number only requires log growing information, so it has negligible bit content.

This is absurd, of course. If you include rough position information, you will always need about 10 bits per proteins (to name a billion locations in the cell where it could be) to really specify the state. So there aren't going to be 1 bit proteins which are not tethered to one spot. But the point here is that the dynamical data that is doing the computation is far smaller than the data encoded in the protein amino-acid sequence, because this data is ROM, it isn't RAM, and it can be specified ahead of time, you don't need to simulate to know the types of proteins that are running around

  1. It's very small in proteins

There is a simple formalism (see here, I authored this: http://arxiv.org/abs/q-bio.MN/0503028 ) which allows you to estimate the bit-capacity of binding proteins. The formalism is also useful for describing how proteins bind. It turns out to be related to D. Harel's higraphs, but it extends the formalism nontrivially to include polymerization (Harel was only interested in finite state automata, and didn't consider binding molecules).

The point of this formalism is to easily estimate the bit-capacity of protein networks. The estimate is simply by multiplying the bit-capacity of a typical protein by the total number of proteins. Excluding proteins of known function, metabolic stuff, and so on, you find that there is a range of bit-values in a human cell from as low as 10kB to as high as 1MB, but the higher end is extremely optimistic, and assumes that every different binding state is functionally discriminatable, so you can tell apart every phosphoryllation state of P53 from every other just by looking at the future dynamics. This is clearly false, and I tend to believe the lower estimates.

  1. It's rather big in RNA

On the other hand, RNA strands in the cell can do much more. RNA is self-binding in complementary pairs, and to predict the future behavior of a strand, you need to know the sequence. This is because the sequence can find another complementary sequence and bind, and this bound sequence can attach to a protein, and so on, in a closed loop computation.

In order for this information to contribute to the bit-content, one must assume that there is a vast undiscovered network of interacting RNA. I will assume this without any compunctions. This explains many enduring mysteries, which I will not go into.

The memory capacity in an RNA computer in the cell is easy to estimate, it's twice the number of nucleotides. Unlike for proteins, each nucleotide is carrying RAM, not ROM, if the interactions with other RNA is sufficiently complex. This isn't true of mRNA, this isn't true of tRNA, it isn't true of ribosomal RNA in isolation, so it requires more roles for RNA than were ever imagined. This is a prediction which does not surprise biologists anymore.

  1. It all hinges on how much RNA computing is going on in the cells

The estimate for RNA is the memory content of DNA, which is on the order of 10^9 bits. The RNA can be an order of magnitude greater, two orders of magnitude greater, but no more, since you will run out of space. This is 10^11 bits per cell.

  1. The biggest thing is the brain

In the brain, there is more genetic material than anywhere else. If you just take the weight of the brain and consider that it is all RNA, you get a reasonable estimate of the bit content of a person. It is approximately 10^22 bits per person, or Terabytes, a billion terabytes in RNA, much less in everything else (so the Wikipedia estimate is probably ok for everything else). This is the correct estimate of the memory capacity of a person, since it is riduclous to think that the information in the vast unknown RNA in the brain is decoupled from the neuron activity, considering that the neuron activity is otherwise completely computationally pathetic.

I described this idea in more detail here: Could we build a super computer out of wires and switches instead of a microchip? .