#### Can signal detection theory help us distinguish the signal from the noise in science?

This is known from experience, and it's just common sense: you make a prediction for a correlation on an existing data set, but you verify it with new data which is not using the same data, but new data specifically taken to test the hypothesis you found.

The problem of large-data sets, people finding spurious correlations by statistical accident, happens all the time in high-energy physics--- these are the "3-sigma events" which show up every few years, where there is a bump in the cross section that superficially has a 1 in a thousand chance of being random chance. The problem is that you are sifting through thousands of data points to find these exceptional events, so it isn't really 1 in a thousand, more like 1 in 1, and when the experiments are repeated to check whether there is something there, 9 times out of 10, there's nothing there.

So when you are data mining, you can say "look what I found", but you shouldn't be so sure it is real until someone tests it on new data, unless the statistics are 1 in a million certain, or 1 in a billion if you are going through millions of points. If you can't take new data realistically, you should split your data set in two ahead of time, look for effects in one half, and then verify these effects on the second half. If you are automating the search on the first half, make sure you are not finding thousands of examples in the first half and then "validating" a handful on the second half with 3-sigma confidence, since this is likely just a statistical fluke.

Going to 5-sigma confidence, one in a million chance of a fluke, gets rid of these in most practical circumstances, when there are only thousands of seach positions. This is why high-energy physicists wait until they have 5-sigma evidence before annoncing, even though a 1-in-1000 would normally be good enough to never get burned, because 1 in a 1000 is really 1 in 1 when you are sifting through thousands of positions.

The basic result is the common sense dictum that your confidence should be 1 in 1000 times the number of trials you make, if it isn't, you aren't finding something significant.