# Introduction

An essential component of calculations is to calibrate new methods, and to use the results of calculations to predict or rationalize the outcome of experiments. Both of these types of investigation compare two types of data and the interest is in characterizing how well one set of data can represent or predict the other. Unfortunately, one or both sets of data usually contain "noise", and it may be difficult to decide whether a poor correlation is due to noisy data or to a fundamental lack of connection. Statistics is a tool for quantifying such relationships. We will start with some philosophical considerations and move into elementary statistical measures, before embarking on more advanced tools.

The connection between reality and the outcome of a calculation can be illustrated as shown in Figure 17.1.

Model ^ Parameters ^ Computational implementation ^ Results Reality Hartree-Fock ^ Basis set ^ Various cutoffs ^ Total energies Atomization energy

### Figure 17.1 Relationship between Model and Reality

A specific example for "Reality" could be the (experimental) atomization energy of a molecule, defined as the energy required to separate a molecule into atoms, which is equivalent to the total binding energy. The atomization energy is closely related to the heat of formation, differing only by the zero point reference state and neglect of vibra-tional effects. For the atomization energy the zero point for the energy scale is the isolated atoms, while for the heat of formation it is the elements in their most stable form (e.g. H2 and N2). Since the dissociation energies for the reference molecules can also be measured, the atomization energy is an experimental observable quantity.

It is important to realize that each element in Figure 17.1 contains errors, and these can be either systematic or random. A systematic error is one due either to an inherent bias or to a user-introduced error. A random error is, as the name implies, a non-biased deviation from the "true" result. A systematic error can be removed or reduced, once the source of the error is identified. A random, also sometimes called a statistical error, can be reduced by averaging the results of many measurements. Note that random errors can be truly random, for example due to thermal fluctuations or a cosmic ray affecting a detector, but may also be due to many small unrecognized systematic errors adding up to an apparent random noise.

Experimental measurements may contain both systematic and random errors. The latter can be quantified by repeating the experiment a number of times and taking the deviation between these results as a measure for the uncertainty of the (average) result. Systematic errors, however, are difficult to identify. One possibility for detecting these is to measure the same quantity by different methods, or using the same method in different laboratories. The literature is littered with independent investigations reporting conflicting results for a given quantity, each with error bars smaller than the deviation between the results. Such cases clearly indicate that at least one of the experiments contains unrecognized systematic errors.

Theory almost always contains "errors", but these are called "approximations" in the community. The Hartree-Fock method, for example, systematically underestimates atomization energies since it neglects electron correlation, and the correlation energy is larger for molecules than for atoms. For other properties, the Hartree-Fock method has the same fundamental flaw, neglect of electron correlation, but this may not necessarily lead to systematic errors. For energy barriers for rotation around single bonds, which are differences between two energies for the same molecule with (slightly) different geometries, the contribution from the correlation energy is small, and Hartree-Fock calculations do not systematically over- or underestimate rotational barriers.

The use of a basis set also introduces a systematic error but the direction depends on the specific basis set and the molecule at hand. For a system composed of first row elements (such as C, N, O), the isolated atoms can be completely described with s- and p-functions at the Hartree-Fock level, but molecules require the addition of higher angular momentum (polarization) functions. Using a basis set containing only s- and p-functions will systematically underestimate the atomization energy, while a basis set containing few s- and p-functions but many polarization functions may overestimate the atomization energy. In principle one should chose a balanced basis set, defined as one where the error for the molecule is almost the same as for the atoms, but since the number of basis functions of each kind necessarily is quantized (one cannot have a fractional number of basis functions), this is not rigorously possible, and will depend on the computational level in any case. A very large (complete) basis set will fulfil the balance criteria but is usually impossible in practice. An example of a (systematic) user error is the use of one basis set for the molecule and another for the atoms, as is sometimes done by inexperienced users of electronic structure methods.

The computational implementation of a Hartree-Fock calculation involves choosing a specific algorithm for calculating the integrals and solving the HF equations. In addition, various cutoff parameters are usually implemented for deciding whether to neglect certain integral contributions, and a tolerance is set for deciding when the iterative HF equations are considered to be converged. Since computers perform arithmetic with a finite precision, given by the number of bits chosen to represent a number, this introduces truncation errors, which are of a random nature (roughly as many neg ative as positive deviations for a large sample). These random errors can be reduced by increasing the number of bits per number, by using smaller cutoffs and convergence criteria, but can never be completely eliminated. Usually these errors can be controlled, and reduced to a level where they are insignificant compared with the other approximations, such as neglect of electron correlation and the use of a finite basis set.