Robust fits and bad points

Next: Subtraction of dark and Up: Reductions at Previous: Reductions at

Robust fits and bad points

Already, we must decide how to deal with discordant data. As this is a subject of considerable importance, a short digression is in order.

Poincaré attributed to Lippmann the remark that ``Everyone believes in the normal law of errors: the mathematicians, because they think it is an experimental fact; and the experimenters, because they suppose it is a theorem of mathematics.'' However, it is neither an experimental fact nor a theorem of mathematics.

Experimentally, numerous investigations have shown that real errors are rarely if ever normally distributed. Nearly always, large errors are much more frequent than would be expected for a normal distribution (see [18], pp. 10 - 12, and [12], pp. 20 - 31). Menzies and Laing [17] show clear examples in photometric data.

Mathematically, the reason for this behavior is well understood: although the Central Limit Theorem promises a Gaussian distribution in the limit as the number of comparable error sources approaches infinity, the actual approach to this limit is agonizingly slow -- especially in the tails, where a small number of large individual contributors dominate. In fact, if there are n independent and identically distributed contributors, the rate of convergence is no faster than n^-1/2 [11]. If we wanted to be sure that our distribution was Gaussian to an accuracy of 1%, we would need some 10⁴ elemental contributions -- clearly, an unrealistic requirement. In practice, a few large error sources dominate the sum.

Furthermore, the proportionality constant in the convergence formula changes rapidly with distance from the center of the distribution, so that convergence is very slow in the tails. This guarantees that the tails of real error distributions are always far from Gaussian.

In the last 30 years, the implications of these deviations from ``normality'' for practical data analysis have become widely appreciated by statisticians. Traditionally, the excess of large errors was handled by applying the method of least squares, after rejecting some subset of the data that appeared suspiciously discordant. There are several problems with this approach.

First, the decision whether to keep or reject a datum has an arbitrary character. A great deal of experience is needed to obtain reliable results. But manual rejection may be impractical for large data sets; and automated rejection rules are known to have inferior performance. Second, rejection criteria based on some fixed number of standard deviations result in no rejections at all when the number of degrees of freedom is small, because a single aberrant point greatly inflates the estimated standard deviation ([12], pp. 64 - 69). The common ``3- $\sigma$ '' rejection rule rejects nothing in samples smaller than 11, no matter how large the biggest residual is; the inflation of the estimated standard deviation by just one wild point outruns the largest residual in smaller data sets. There is no hope of rejecting a bad point this way in samples of 10 or smaller; but one rarely measures the same star 10 times. For the more typical sample sizes of 3 and 4, the largest possible residuals are only 1.15 and 1.5 times the estimated standard deviation. Third, including or rejecting a single point typically introduces discontinuous changes in the estimated parameters that are comparable to their estimated errors, so that the estimated values undergo relatively large jumps in response to small changes in the data. We would have more trust in estimators that are continuous functions of the data.

Finally, the nature of most observed error distributions is not that data are clearly either ``good'' or ``bad'', but that the few obviously wrong points are accompanied by a much larger number of ``marginal'' cases. Thus the problem of rejection is usually not clear-cut, and the data analyst is left with doubts, no matter where the rejection threshold is set. The reason for this situation is also well understood: most data are affected by error sources that vary, so that the ``marginal'' cases represent data gathered when the dominant error source was larger than average. Such observations are not ``wrong'', though they clearly deserve smaller weights than those with smaller residuals.

In particular, we know that photometric data are afflicted with variable errors. For example, scintillation noise can vary by a factor of 2 on time scales of a few minutes; and by an additional factor of $sec \, Z$ at a given air mass, depending on whether one observes along or at right angles to the upper-atmospheric wind vector. Menzies and Laing [17] discuss other possible sources of error. Therefore, we know we must deal with an error distribution that is longer-tailed than a Gaussian. Furthermore, both scintillation and photon noise are decidedly asymmetrical. As these are the main sources of random error in photometric observations, we can be sure that we never deal with truly Gaussian errors in photometry.

Unfortunately, the method of least squares, which is optimal for the Gaussian distribution, loses a great deal of its statistical efficiency for even slightly non-Gaussian errors. (Statistical efficiency simply refers to the number of observations you need to get a desired level of reliability. If one estimator is twice as efficient as another, it will give you the same information with half as many observations.) The classical example is Tukey's contaminated distribution. Suppose all but some fraction $\epsilon$ of the data are drawn from a normal distribution, and the remainder are drawn from another Gaussian that is three times as broad. Tukey [23] asked for the level of contamination $\epsilon$ that would make the mean of the absolute values of the residuals (the so-called average deviation, or A.D.) a more efficient estimator of the population width than the standard deviation, which is the least-squares estimator of width.

Although the mean absolute deviation has only 88% of the efficiency of the standard deviation for a pure Gaussian, Tukey found that less than 0.2% contamination was enough to make the A.D. more efficient. The reason is simply that least squares weights large errors according to the squares of their magnitudes, which gives them an unreasonably large influence on the results.

Similar, though less spectacular, results exist for position estimators. For example, about 10% contamination is enough to make the median as efficient as the mean (the least-squares estimator); while several ``robust'' estimators are some 40% more efficient than the mean at this level of contamination. Real data seem to be somewhat longer tailed than this, so the mean (i.e., least squares) is typically even worse than this simple example suggests.

Because convergence of the central limit theorem is much faster near the center of the error distribution than in the tails, we can expect real error distributions to be nearly Gaussian in the middle, and this is in fact observed to be true. A practical approach to data analysis is then to treat the bulk of the data (in the middle of the distribution) as in least squares; but to reduce the contribution of the points with large residuals, which would be rare in a genuinely Gaussian distribution, in a smooth and continuous fashion.

There is now a large literature on ``robust'' estimation -- that is, on methods that are less critically dependent on detailed assumptions about the actual error distribution than is least squares. They can be regarded as re-weighted least squares, in which the weights of data with moderate to large residuals are decreased smoothly to zero. There are many ways to do this; all produce rather similar results. The really ``wild points'' are completely rejected; the marginal cases are allowed to participate in the solution, but with reduced weight. The result is only a few per cent less efficient than least squares for exactly Gaussian errors, and much better than least squares -- typically, by a factor of the order of two --for realistic error distributions. These methods are also typically 10% or so more efficient than results obtained by experienced data analysts using careful rejection methods ([12], pp. 67 - 69).

The particular method used here for reducing photometric data is known to the statisticians as ``Tukey's biweight''; it is easy to calculate, and produces results of uniformly high efficiency for a range of realistic distributions. To prevent iteration problems, it is always started with values obtained from even more robust (but less efficient) estimators, such as the median and its offspring, Tukey's robust line [13]. The usual method starts with a very robust but inefficient estimator such as the median or Tukey's robust line; switches to Huber's M-estimator for initial refinement until scale is well established; and then iterates to the final values using the biweight. If you are unaware of the need to precede the biweight with an estimator having an non-redescending influence function, don't worry. This is known to be a numerically stable procedure.

As robust methods depend on ``majority logic'' to decide which data to down-weight, they obviously require a certain amount of redundancy. One cannot find even a single bad point unless there are at least three to choose from (corresponding to the old rule about never going to sea with two chronometers). Therefore, it is better to obtain a large number of short integrations than a smaller number of longer ones, provided that the repetitions are separated in time enough to be independent. The planning program will help you get the necessary data.

In summary, photometric data are known to have decidedly non-Gaussian error distributions; so we use methods designed to be nearly optimal for these distributions, rather than the less reliable method of least squares. These methods are closely related to least squares, but are much less sensitive to the bigger-than-Gaussian tails of real error distributions. From the point of view of the average user, the methods employed here are simply a more effective refinement of the old method of rejecting outliers.

The advantage of using these well-established, modern methods is a gain in efficiency of some tens of per cent -- exactly equivalent to increasing the amount of observing time by such an amount. It's about like getting an extra night per week of observing time. This advantage is well worth having.

Next: Subtraction of dark and Up: Reductions at Previous: Reductions at

Petra Nass
1999-06-15