How did life on Earth begin?

There is a reasonable answer to this question, which is a synthesis and extension of some ideas of Stewart Kauffman (1969), Stephen Wolfram (1981), and other stuff that is not in the literature. These ideas are essentially updating the cybernetic approach due to Turing, Von-Neumann, and Wiener, which was marginalized and suppressed in biology once DNA was discovered.

The reason the computational ideas were put on the back burner is because DNA has an obvious replication mechanism, and the molecule gave people a picture of the origin of life immediately--- a nucleic acid formed, and began to replicate, and then evolution proceeded.

This idea is seductive, but I believe it is completely incorrect, and many others who thought about this, including Francis Crick, eventually came to the same conclusion--- self replicating nucleic acids are not the likeliest candidate for the origin of life. Crick was mystified, and proposed half-jokingly that it was pan-spermia. I don't think this is a reasonable answer either, since it just pushes the question to the origin of pan-sperm.


Computations in modern cells

The main characteristic that distinguishes living from nonliving systems is the ability to do Turing complete computation, in a finite approximation, with an essentially limitless memory capacity, and a processing quantity per unit volume. Each cell has an enormous store of stable memory, dwarfing the best solid-state memory chip in bit-density, and the processing speeds are on the molecular scale.

An RNA read and write can be done at thousands of bases per second, with error correction, and you can stuff millions of these things in a cell volume. So the potential RAM of a cell is 10s of gigabytes, and the processing speeds for this data is the rate at which the data can be copied and transformed, which can happen at rates of megabytes per second. These are comparable to a modern artificial home computer.

To see that modern life is potentially Turing complete is not very hard--- you can easily engineer a Turing complete system using bio-molecules, and it is easy to see that the storage capacity of DNA and RNA is sufficient for running software of the kind you have on a modern laptop. Further, we have a system capable of computing in biological systems for sure--- our brain. The computer itself was originally defined to abstract out the information processing done by the brain of a mathematician. So biological systems can compute, and do compute.

But the processes that biologists recognize as happening in a cell are not always sufficient to produce a full computation. at least not one of a significant size. If all that happens in a cell is the central dogma, then DNA produces RNA and the RNA produces proteins, and then only the proteins are computing anything. The proteins compute with a random access memory which is determined by their potential different chemical bonds to each other, and this is only a few kilobytes of RAM at the most. It's still computing, but it's a very small computation, compared to the amount of frozen data stored in the DNA.

It is unreasonable that a system that has gigabytes of ROM should only have kilobytes of RAM, especially that the DNA has to get written and proofread in the process of evolution. I will argue later that the proper computational ideas demand that there are exceedingly complex RNA networks active in modern cells, which compute at the gigabyte/teraflop rate.

This idea that RNA networks are required and appear in modern cells in a way that can do gigabyte computations is implicit in recent work of John Mattick. It is experimentally more and more certain every year, as new functions for RNA are discovered. I take this idea for granted, as it is the only way I can see to make sense of the computational capacity of modern cells.


Computations in non-living systems

If you start with a pre-biotic soup of molecules, it is very simple to make a naturally computing system. This became clear after Wolfram's work in 1981.

The basic idea of a cellular automaton is that it is a model for information transformation in a system which can store stable discrete data. An example is molecules, which store data in the pattern with which they are bound to one another. These molecular patterns are transformed by catalysis, using other molecules, and the result is that certain bit-patterns rewrite other bit patterns in a rule basd way.

Bit rewrite rules were studied by Von-Neumann, Coway, and Wolfram, and it was discovered in each of these cases that a relatively simple system will produce full Turing computation. Von-Neumann had a many-state one-dimensional automaton, with relatively complicated rules, but it was proved Turing complete relatively easily. Conway used a two-dimensional automaton with very simple rules, and this was proved Turing complete in the 1990s (although it was pretty clear that it should be Turing complete in the 1970s too). Wolfram found a very simple nearest neighbor automaton which was proved Turing complete around 2000 by Cook, a Wolfram employee. The proofs are relatively difficult, because they require building a computer from the information transformations in the cellular automata, but the general program makes it clear that as long as an automaton has "complex behavior", which means that the system doesn't die out to a stable pattern, doesn't devolve to a simple fractal pattern, and doesn't wash out to completely random noise, as long as there are identifiable structures that persist long enough to impress their data on other structures, then you have Turing completeness.

This is not a theorem, it is a principle, and the principle was called the "Principle of Computational Equivalence" by Wolfram. It generally says that whenever an automaton looks complex, when it isn't trivial, then it is going to be Turing complete. I will accept this, because it is true in simple examples, and it is difficult to construct something complex which is intermediate in Turing degree of complexity, Friedman and Muchnik needed to work hard, starting from something that is already a universal computer.

So in order to make a computer, all you need is a system with information stored in molecules, with rewrite-rules in the form of allowed catalysis. The peptides produces on the early earth from the atmosphere, together with primordial hydrocarbons from the Earth's formation, can produce polymers at the interface of the primordial oil and water which have these properties, simply by joining peptides into polypeptides. This is the computing soup. I believe that a sufficiently large and sufficiently fast computing soup is necessary and sufficient to explain the origin of life, there is nothing more and nothing less required.


Self and non-self

A computing soup is seeded random data, but the data doesn't stay random. It gets reworked depending on the local environment to acquire the characteristics of the molecules surrounding it. These characteristics build up progressively, because the system does not reach any sort of statistical steady state, and different regions of the large computation produce a different ecosystem of interacting molecules.

None of these molecules are self-replicating, but they are all self-replicating in a certain sense, in that they weed out and digest molecules which do not conform to the pattern that is compatible with the other molecules. This is a collective sort of replication.

Collective replication was proposed by Stuart Kauffman as an alternative to the self-replicating molecule idea, back in 1969. The idea was that a collection of independent molecules can each catalyze part of each other, so that together they autocatalyze the whole set. Kauffman argued that such an autocatalytic set is inevitable given a large enough diversity of molecular species.

This is probably true, to a certain extent, but one must keep in mind that this is also true in a computing soup, and without a computing soup. But a simple autocatalytic set suffers in general from the same problem as other replicators--- getting stuck in a rut.

In order for evolution to proceed, it is not enough to be replicating, you need to make sure there is a path for further evolution into ever more complex systems. The simplest replicators have the property that all they do is replicate themselves, and then the only evolution is a quick minimum-finding where they find the quickest and stablest replicator.

An example of such a parasitic replicator is fire, fire metabolizes and reproduces itself, but it is incapable of evolution. Similarly, small self-replicating computer programs with noise are capable of filling up the computer memory with copies of themselves, but they don't evolve past this point.

The systems that are capable of further evolution are those that are not precisely replicaitng, but that are precisely computing.

The recognition of self and non-self by computing automata means that if you divide an automaton in two, and wait, the two halves do not mix together well after a while, because they acquired different characteristics. The result is that if you allow the two halves to touch, they will compete, and the best one at spreading will take oer the computing volume.

This produces Darwinian competition long before any precise replication. The Darwinian competition allows for selection of traits that are favorable to spread throughout the computing soup.


Emergence of life

It is likely sufficient for life to have a computing soup of molecules, as these will then compete locally to make better and better synthesis systems, and eventually they will make compartments to localize the molecules into cells, long after developing nucleic acids, ribosomes, and all the other ideas we see in modern cells. The stable replicating DNA molecule, in this view, is the last to form. It evolves when there is a need to store RNA in a more permament fasion.

This idea is proteins and hydrocarbons first, RNA and genetic code second, DNA and cells last. It is hard to test the later stages, but early stages can be tested using cellular automata, which is something I did about a decade ago. It was hard to interpret what was going on in the cellular automata, even when they looked like they were computing, because the patterns are not obvious a-priori, but that was only because I did it half-heartedly, being more excited at the time about the computational patterns in modern cells.


Criticism of other ideas

The idea of RNA world assumes RNA can form. RNA has a sugar in it's backbone, and it has different bases, and it's much much too complicated to make RNA abiotically. By contrast, proteins are dead-simple to make, you can't avoid making amino acids from methane, carbon dioxide, and water. So it is obvious chemically that proteins are earlier than RNA.

Further, RNA can't self-replicate. That's really good, because if it could, it would kill the computation like a cancer, but this is what is assumed in RNA world--- some sort of self-replicating RNA.

The ideas of Dyson on cells-first suffer from the problem of no-computation. If you don't start with a computing automaton, you have no computing automaton inside the cells--- they are too small. They are unlikely to have a diverse enough collection of species to make a computation, and even if they did, it's potential for evolution is too small, becaue it is isolated, so it has a limited memory. These ideas are reasonable for the emergence of cells once the computing soup has evolve to a good enough point to package the machinary in isolated compartments.

The ideas of Thomas Gold on the importance of petrolium and deep-vents, archaea first if you like, I think are ok, but they are completely compatible with the view I am pushing here.