Empirical Bayes approach to climate model calibration

Climate data is highly correlated through the physics and dynamics of the atmosphere. Model evaluation often involves averages of various quantities over different regions and seasons making it difficult from a statistical perspective to quantify the significance of differences that arise between a model and observations. Here we present a strategy that makes use of a set of perfect modeling experiments to quantify the effects of these correlations on model evaluation metrics. This 5 information is incorporated into Bayesian inference through a precision parameter with informative priors. These concepts are illustrated through an example of fitting a line through data that includes either uncorrelated or correlated noise as well as to the calibration of CAM3.1. The concept of a precision parameter may be applied as a strategy to weight different climate model evaluation metrics within a multivariate normal framework. From the example with CAM3.1, the precision parame10 ter plays a central role in rescaling the estimated parametric uncertainties to better accommodate modeling structural errors.


Introduction
Within Bayesian inference, calibration refers to the problem of estimating model parameters given observations and uncertainties that affect model-data differences.The result of a calibration is a 15 joint probability distribution whose maximum density identifies the region of parameter space where model performance is optimal in the sense that a weighted sum of distance between the model and observations is minimized.Among the challenges that exist for Bayesian calibration of climate models is the question of how to assign weights to various observational targets, many of which are highly correlated through the physics and fluid dynamics of the atmosphere.Here we present and 20 discuss the use of a precision parameter which is a common device in Bayesian calibration to make use of information concerning how data is scattered about the model to better quantify uncertainties affecting model calibration.
One of the issues that needs to be considered in the application of a precision parameter to climate model calibration is that uncertainties affecting model evaluation are high dimensional.A scalar 25 will have limited scope to compensate for information that is missing within an evaluation metric 1 Geosci.Model Dev. Discuss., doi:10.5194/gmd-2016-20, 2016 Manuscript under review for journal Geosci.Model Dev.Published: 4 March 2016 c Author(s) 2016.CC-BY 3.0 License.concerning the many space and field dependencies that exist in the data.Thus one of our objectives here is to show both the usefulness and limitations of using a scalar parameter to assess climate model parametric uncertainties.
Ideally, dependency information is included within model evaluation metrics as it would provide 30 an excellent way to determine whether a climate model is capturing observed climate phenomena for the right reasons.However, it is still difficult to represent such dependency information mathematically (Mu et al., 2004).Here we explain how the use of a precision parameter can be enhanced by additional information that can be obtained from a set of perfect modeling experiments in which the only errors producing differences between a model and synthetic 'observations' come from internal 35 variability.This inherently empirical approach allows us to assess the significance of the distance between a model and data even though we don't have a perfect representation of dependency information.This circumstance is perhaps unique to climate models which are able to represent important aspects of the chaotic motions of the fluid atmosphere and ocean and processes affecting the radiative transport of energy between the top of the atmosphere and surface.Indeed, it would be difficult 40 to design a more sophisticated statistical model of climate modeling errors than climate models themselves.
Sections 2 and 3 provides background material concerning climate model calibration.While not new, this material may be unfamiliar to physical scientists and will be important to the understanding and application of ideas concerning use of idealized modeling experiments to help establish infor-45 mative priors to help deal with the sometimes arbitrary ways models are tested against observational data.
2 Climate model evaluation metrics

Multivariate normal metric
In order to explain the effects of dependences on parameter uncertainties, consider a multivariate nor-50 mal statistical model for assessing the significance of the distance between the output of a simulation with two observables d 1 and d 2 .Suppose x1 is an estimate of observations d 1 using climate model g(m) with parameters m.If errors are normally distributed, which is a good assumption for monthly mean or longer climate data, x1 = d 1 + 1 , where 1 is the model errors with 1 ∼ N (0, σ 2 1 ). 1 includes both the modeling error due to the grid resolution, the possibility of unparameterized physical 55 processes (i.e.missing physics), uncertainty in the initial and surface forcing conditions, and the instrumental error.Many of these types of errors are hard to estimate particularly if the process that generated them are correlated in space and time.
If it is assumed that 1 is distributed as a Gaussian with variance σ 2 1 , the likelihood of simulating data d 1 with the model g(m) using parameters m is given by In the case where the model simulates correlated observations d 1 and d 2 , the joint likelihood for simulating this data given the same model can be expressed as where ρ is the correlation coefficient of observables becomes the product of the two Gaussian distributions θ(d we would get if d 1 and d 2 were independent.Note that the probability density is not defined for the case where the observations are perfectly correlated, ρ = 1.Written in matrix notation, the joint 70 probability distribution for correlated observables d 1 and d 2 is where The definition for k observations is which only works mathematically if C −1 exists.Since many observations of climate are strongly correlated there is a good chance that the covariance matrix is rank deficient, its inverse is singular, and its determinant is 0. The typical solution to this problem is to apply singular value decomposition to C to identify a reduced number k e of orthogonal dimensions commonly referred to as empirical 80 orthogonal functions or 'eofs' in the atmospheric sciences (e.g., Mu et al., 2004).The argument of the exponent, when rotated into this orthogonal basis and truncated to include the first k e vectors associated with the largest eigenvalues, follows a χ 2 (ke) distribution with k e degrees of freedom If the model is unbiased, the χ 2 (ke) distribution in this case will have an expected mean of k e and a 85 variance of 2k e .Thus within a multivariate normal framework for testing a model against a set of observational targets, the average value of the argument of the exponent within equation (3), which tainties in parameters.One can think of it as representing uncertainties in the specification of C −1 .
Simply scaling C −1 can not affect the relative weighting between different observational targets.
However such a factor can affect the presumed strength of the observational evidence, similar to changing the effective degrees of freedom, such that solution probabilities are narrower or larger depending on the size of model-data mismatch.By introducing S as an uncertain parameter in ad-100 dition to climate model parameters m we need to consider a distribution for representing its 'prior' uncertainty.Moreover, because of the relationships between S and the covariance matrix, the value of S needs to scale with the errors that exist between a model and data.The point of this section is to describe the selection of a prior distribution for S and to provide guidance on how its co-dependency with model parameters m can be estimated through a hybrid of Metropolis and Gibbs sampling 105 (Gelman et al., 2013).
The Bayes expression for estimating a set of parameters m and parameter S conditional on obser- We start by assuming the priors for m and S are independent, i.e. p(m, S) = p(m)p(S).One choice 110 for prior p(S) is the gamma distribution, p(S) ∼ Ga(α, β) with shape parameter α > 0 and rate parameter β > 0, This choice of the functional form for S is convenient as it is conditionally conjugate to the multivariate normal formulation of the likelihood function, which facilitates the use of Gibbs sampling 115 for iteratively estimating co-dependencies between model parameters m and S within the likelihood.
The Ga(α, β) distribution appears as a skewed distribution with a mean and variance provided by where E(m) is the metric of climate model performance, which for a multivariate normal errors is the argument of the exponential given in equation ( 5) The term S ke 2 is the pre-exponential factor for a Gaussian distribution which due to uncertainties in S can no longer can be assumed to be constant.One may think of S ke 2 as a scalar factor affecting |C| 1 2 in equation ( 5) which includes S as the factor affecting the precision of each of the k e independent observations 130 such that An important implication for assuming independence between priors for m and S is that we now can use Gibbs sampling to iteratively estimate their co-dependency using two separate equations: and The last expression for equation ( 16) comes from the the product of equation ( 11) and equation ( 8 3. Repeat steps 1 and 2 several times until convergence is achieved.
According to this sampling strategy, the mean value for S conditional on m will be The mean value for S decreases with increasing errors between the model and data.As S decreases, 150 the likelihood function becomes less discriminating of alternate models and estimates of confidence intervals will increase in range.As will become clear in the example, the increase in range is not arbitrary.It is precisely what is needed to account for any errors in the description of uncertainty in the climate model metric as given by E(m), at least when the model is 'perfect'.

Empirical Bayes 155
The sampling of S within equation ( 16) depends on parameters k e , α, and β.While it is always necessary to provide information about k e , it is common to select non-informative values for α and β, such as α = 0 and β = 0, which would allow posterior estimates of S to be dominated by information about model-data misfit coming from the log-likelihood (equation 12).As climate model metrics often include multiple observational targets and quantities that are averaged over different 160 regions and seasons, we propose a process to make use of a set of 'perfect' modeling experiments to provide additional information about how all these different quantities could be weighted using S. A 'perfect' model is one where we replace model output for observational data.Repeated experiments with different initial conditions explores the effects of internal variability on the metric of model errors that is used within the likelihood function.Since we are using data to inform the prior, this is 165 generally referred to as empirical Bayes.Sections 5 and 6 will apply empirical Bayes to the example of fitting a line to data and discuss the application to climate model calibration.

Effective degrees of freedom k e
Here we determine the effective number of degrees of freedom for a set of experiments in which we use a climate model to simulate the effects of correlated noise on an evaluation metric.We start by 170 assuming that the evaluation metric will be proportional to a χ 2 (ke) distribution with k e degrees of freedom, where A is an unknown constant.E is related to degrees of freedom k e by < E >= Ake 2 and var(E) = A 2 ke 2 where the angle brackets denote an ensemble mean and var(E) is the ensemble 175 variance using N exp number of perfect modeling experiment samples.It follows by substitution that

Scale and rate parameters α and β
Using N exp perfect modeling experiments, one may estimate values for the gamma distribution parameters α and β.The goal is to select values for S that would allow N exp samples of SE(m) to generate the same mean and 95% credible interval as a 1 2 χ 2 (ke) distribution with k e degrees of freedom, where 1 2 k e is the variance of 1 2 χ 2 (ke) and σ E(m) is the standard deviation of E(m) using N exp samples.The variance in S is based on the uncertainty in estimating σ E(m) with only N exp samples, where samples of SNexp are generated according to 195 Here {σ χ 2 (ke ) } Nexp is an estimate of σ χ 2 (ke) using N exp samples.{σ χ 2 (ke ) } Nexp may be estimated any number of times.Note that < SNexp >∼ 1 because S represents the scaling parameter for a cost function in which the effective degrees of freedom is known.The expression for α ends up being independent of the perfect modeling experiments, 200 5 Example: fitting a line

Uncorrelated noise
Suppose one has a linear model of some data d obs taken at 100 points x.
In this first case the data is noisy and uncorrelated with i ∼ N (0, 25) and is generated with a = 2.5 Assuming all data are independent, i.e. k e = 100, Bayesian estimates of the optimal slope and intercept and their uncertainties are nearly equivalent to estimates based on least squares estimation (Figure 2; Table 1).The solutions are also nearly equivalent to estimates using non-informative values for α and β, i.e. α = 0 and β = 0.The two estimates are the same since the log-likelihood and the perfect modeling experiments assumed (correctly) that the data were uncorrelated data, resulting 215 in similar posterior estimates of S ∼ 1.

Correlated noise
We next consider the same problem but now where the data is affected by correlated noise.Estimates of the empirical Bayes parameters presumes that we have an adequate model of the uncertainties affecting the data.For this example we use a reddening process with parameter r to generate correlated 220 noise η from uncorrelated noise i ∼ N (0, 25),  where i is an index of consecutive data points.The model with correlated noise is Such a reddening process occurs naturally within climate data from the way the ocean integrates 225 weather noise from the atmosphere, producing power at longer time scales.Figure (3) shows the same model as before but with noise drawn from a red noise process with r = 0.8 (equation 25).
Correlations affect systematic changes in slope and intercept over a range of x values (Figure 3).
Thus we should expect that inferences of the slope and intercept are more uncertain.
Using N exp = 80 perfect modeling experiments with a red noise process model affecting 100 The results show that the largest uncertainties come from the lower estimates of the effective 240 degrees of freedom and uninformative priors for S. Including an informative prior for S can reduce this uncertainty and still (barely) accommodate the true values of the slope and intercept (Figure 3).However this is not true for providing informative priors for S without good information about the effective degrees of freedom.From this respect is it perhaps more important to provide good information about the effective degrees of freedom than informative priors for S. The following  < S q > for each quantity q.
Suppose one has created a set of N q model evaluation metrics E(m) q providing a normalized 255 measure of distance between a model and observations for different fields, regions, or seasons as specified by index q.Summing these cost components assumes these measures are independent, which is not likely true.With N exp perfect modeling experiments one can generate separate estimates of {k e } q and < S q > such that each component may be normalized separately and together such that the total cost function may be generated by where p(S tot ) ∼ Ga(α, β).As formulated here, this process does not give any flexibility to adapt individual component weightings during sampling as the gamma priors only apply to S tot .However this process does give opportunity to include correlation information, albeit that generated by the perfect model, into the total cost metric.calibration that is related to the calculation reported in Jackson et al. (2008) in which Multiple Very Fast Simulated Annealing (MVFSA) sampling (Jackson et al., 2004) is used to calibrate six parameters important to clouds, convection, and radiation within Community Atmosphere Model version 3.1 (CAM3.1)(Collins et al., 2006) with a resolution of 2.8 • longitude by 2.8 • latitude (T42) with 26 vertical levels.Important differences from the previous calculation include shorter 2-year rather than 11-year model integrations with specified climatological sea surface temperatures and sea ice, using only the last year for comparison to observations, and the use of ERA-40 (Uppala et al., 2005) for vertically averaged (mass weighted) air temperature and humidity, zonal winds at 300 mb, and sea level pressure, Willmott and Matsuura (2000) for 2-m air temperature over land, CERES2 (Loeb et al., 2010) for long and shortwave cloud forcing, OAFlux data (Yu et al., 2008) for latent heat fluxes over the ocean, and GPCP (Adler et al., 2009)  The metrics for each of these observational targets consists of taking the spatial mean of grid point differences divided by the spatial variance in the observational data.Global radiative balance was the exception insofar as we were seeking solutions that were within a 1 Wm −2 of the target radiative balance of 0.5 Wm −2 .A set of 20 idealized experiments were used to estimate the variance for each field and season which was used to weight each component of the cost according to equation 27.
The total cost calculation was given by the average of 41 cost components (ten quantities over four seasons and the component for global radiative balance).The calculation does not include prior information about degrees of freedom, but does include prior information for the gamma parameters MVFSA sampling is particularly efficient at providing qualitative estimates of the posterior probability of the solution uncertainties (Jackson et al., 2008;Villagran et al., 2008).The marginal distributions for the six CAM3.1 model parameters and precision parameter S as well as the distribution of cost values for this ensemble is shown in Figure 4.The marginals are fairly broad, with many 310 of the sixteen optimal parameter configurations covering a wide range of values.However that is not to say these parameters are unimportant.On the contrary, these parameters have a significant influence on simulated climates.The broad posterior distributions come about, in part, because of co-dependencies that arise during sampling model parameters to be consistent with observational data (Table 3).For instance the critical relative humidity for high clouds (rhminh) which perhaps 315 has the flattest distribution, is also the parameter that is the most strongly dependent on the choice of all the other parameters.The posterior mean value for < S >= 0.19 (Figure 4h) is quite a bit smaller from the prior mean value of < S >= α β = 367.Costs using observational data are quite a bit larger than those that occurred in the idealized experiments where model output was used as a surrogate for observational data.This has the effect of making the sampling algorithm more accepting of alternate model configurations and increasing the parametric uncertainties.Figure 5 illustrates which model configurations 325 were accepted or rejected by different choices of parameters rhminh and rhminl which control the critical relative humidity for cloud formation for high clouds and low clouds, respectively.Estimates of the response surface, given by changes in cost values as a function of these two parameters, show that the parameter settings that allow CAM3.1 to have the lowest total cost (Figure 5a) are not the same for particular cost components such as column average relative humidity (Figure 5b), global 330 radiative balance (Figure 5c), and shortwave cloud forcing (Figure 5d).The models that ended up being selected most frequently (and define the modes of the posterior distribution) represents a compromise.The sixteen model configurations representing separate optimizations (given by the red dots) are distributed near the center of these competing choices.All these configurations have a cost associated with their radiative balance of less than 50, which corresponds to models being within 335 4 Wm −2 of the target radiative balance.So while the cost function for radiative balance specified acceptability to be models that were within 1 Wm −2 of radiative balance, the high cost values and S allowed the sampling algorithm to accept models with comparable skill that happen to include models with a larger energy imbalance.

Summary 340
The strength of observational evidence is a key part in testing hypotheses concerning the physics of climate change.Since much of the data that is used to evaluate hypotheses are correlated, it becomes important to include the effects of these correlations on model evaluation metrics to appropriately account for observational evidence when testing different model versions.Here we proposed a way to make use of perfect modeling experiments as a surrogate for missing observational data to esti- While assumptions of multivariate normal errors are often appropriate for long-term mean climate data, no such assumption can be made for model biases.So while the use a precision parameter 355 is one way to expand uncertainties in proportion to model biases, prediction errors are not related in any obvious way to commonly proposed climate model evaluation metrics (Klocke et al., 2011;Masson and Knutti, 2013;Sanderson and Sanderson, 2013;Caldwell et al., 2014).While some have proposed including a statistical model that compensates biases in model predictions (Brynjarsdottir and O'Hagan, 2014), it is not clear how this can be accomplished for high-dimensional systems.One 360 of the advantages of using precision parameter S is that its interpretation is fairly straightforward mathematically and scientifically.This makes it a good starting point to think about how to address even more challenging aspects of quantifying uncertainties in climate projections using observational data.
9 Code availability 365 Matlab and R code for generating the results shown in figures ( 2) and (3) may be found at Jackson and Huerta (2015).
Dev. Discuss., doi:10.5194/gmd-2016-20,2016  Manuscript under  review for journal Geosci.Model Dev.Published: 4 March 2016 c Author(s) 2016.CC-BY 3.0 License.With the proposed gamma distribution for S, the distribution for the likelihood function, assuming errors are multivariate normal, is l(d obs |m, S) ∝ S ke 2 exp(−SE(m)) ), omitting constant factors.One can iteratively generate a value of m conditional on S and a value of 140 S conditional on m in the following way: 1.To simulate m conditional on S, apply a Markov Chain sampling algorithm for m (equation 15) but just one iteration.2. To simulate S conditional on m, sample from a gamma distribution (equation 16) which has scale parameter ke 2 +α and rate parameter E(m)+β.The values for k e , α, and β are specified 145 according to the empirical Bayes principles outlined in the following section.Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016-20,2016 Manuscript under review for journal Geosci.Model Dev.Published: 4 March 2016 c Author(s) 2016.CC-BY 3.0 License.

Figure 1 .
Figure1.Uncertainty in the estimate of the effective degrees of freedom using equation(19) as a function of the number of perfect modeling experiment samples Nexp.The 2σ uncertainty is expressed as a percentage of ke.
205 and b = 1.Using N exp = 80 perfect modeling experiments (that is experiments where the model generates its own data), one can determine an informative prior for S using equations in Section 4. In our case, these estimates gave α = 140 and β = 161.Repeating this estimate with a different set of perfect modeling experiments will yield similar but not equivalent results.The priors for a and b are distributed uniformly, i.e. p(a) ∝ 1 and p(b) ∝ 1.210

Figure 2 .
Figure 2. Example of fitting a line through noisy uncorrelated data (top) using ke = 100, α = 140, and β = 161.Also shown are solution marginal distributions for slope (a) and intercept (b).Although the estimated slope and intercept (green dashed lines) are slightly offset from their true values (red dashed lines), they are within the uncertainty of the fit.Solutions are nearly identical to those using mostly non-informative choices for prior on S, i.e. p(S) ∼ Ga(α, β) with α = 0, and β = 0.

Table 1 .
Results of different solution strategies for estimating slope (a) and intercept (b) when fitting a line through 100 data points with uncorrelated noise (equation 24).Estimates of optimal values for slope and intercept were nearly identical with a = 2.54 and b = 1.58.Similarly, uncertainties in these estimates were nearly identical whether the Bayes estimates used informed or uninformed priors or estimates were based on a

230
data points, we estimate k e = 20, α = 111, and β = 488.Table2shows the uncertainties in the estimates for the slope and intercept using either informative or uninformative values for the effective degrees of freedom k e or values for α, and β.Estimates using k e = 20 using either informed or uninformed priors give comparable results, both indicating uncertainties are much larger when data are correlated.The least squares estimation solution, while giving identical estimates of the slope and 235 intercept, give much smaller estimates of the uncertainty comparable to estimates with the Metropolis algorithm that assumes data were uncorrelated (i.e.k e = 100).The estimates of uncertainty from the least squares estimation are not large enough to include the model's actual slope and intercept, thus are too narrow.

245Figure 3 .
Figure 3. Example of fitting a line through data with correlated noise using Metropolis sampling and informative priors for ke = 20, α = 111, and β = 488.Also shown are solution marginal distributions for slope (a) and intercept (b).In contrast to the example with uncorrelated data (Figure 2) the estimated optimal parameter values (green dashed lines) are far from the true values (red dashed lines) and are only barely within estimated uncertainties.

2657
Application of S to the calibration of CAM3.1Although a precision parameter S has been included in previous calibrations of CAM3.1(Jackson et al., 2008), it was not fully explained nor didJackson et al. (2008) evaluate the extent to which the outcome of those calculations were being affected by S.Here we present the results of a new Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016-20,2016 Manuscript under review for journal Geosci.Model Dev.Published: 4 March 2016 c Author(s) 2016.CC-BY 3.0 License.
for precipitation.UnlikeJackson et al. (2008), we did not make use of any available data for low, medium, and high cloud amounts, estimates of short and longwave radiation at the top of the atmosphere, or surface latent and sensible heat fluxes other than what was mentioned above concerning OAFlux data.These changes were made to make the calibration mimic the process that occurs within the NCAR Atmosphere Model Working Group for evaluating the effects of different parameter choices as represented by the group's Diagnostics Package (https://www2.cesm.ucar.edu/working-groups/amwg/amwg-diagnostics-package)and "Top 10" Taylor metrics.Similar toJackson et al. (2008), we consider the same six parameters and same 30 • S to 30 • N domain and seasonal (DJF, MAM, JJA, SON) averages for making model comparisons to data as well as include a term in our cost function for global annual mean radiative balance.For the updated calculation, we completed 2261 experiments over 16 independent MVFSA chains.
using the 20 idealized experiments to get α = 3.45 and β = 0.0094.The mean for the prior distribution for S is 367.0.While it would have been better to include prior information about degrees of freedom, the dominant source of uncertainty likely comes from the inability of the model to match all of the data simultaneously.S captures this type of uncertainty and our purpose here is to highlight this aspect of the calibration.The result of the calibration is an ensemble representing uncertainties in specifying seven parameters (six model parameters and S) such that the model is consistent with observational data.While the number of ensemble members are relatively few (2261 samples), previous work has shown that Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016-20,2016 Manuscript under review for journal Geosci.Model Dev.Published: March 2016 c Author(s) 2016.CC-BY 3.0 License.

Figure 4 .
Figure 4. Marginals for 6 parameters within CAM3.1 important to clouds, convection, and radiation as well as for precision parameter S. Marginals are generated from 1948 samples of 2261 total which excludes experiments containing a cost greater than 25.Red dots indicate place in parameter space where each of the 16 chains were minimized.Black dot indicates the default model configuration.The parameters are alfa for the initial cloud downdraft mass flux, tau for the consumption rate of CAPE, ke for the environmental air entrainment rate, rhminh for the high cloud critical relative humidity, rhminl for the low cloud critical relative humidity, and c0 for the precipitation efficiency.

Figure 5 .
Figure 5. Estimates of the response surface for total cost (a), column average relative humidity (b), radiative balance (c), and shortwave cloud forcing (d).Each panel color shows a weighted averaged of experiments whose cost is less than 25 (given by the white '+') Jackson (2009).Experiments whose cost values are greater than 25 are indicated by a black '+' symbol.The sixteen red dots correspond to experiments whose cost were minimum within their respective sampling chains.The black dot corresponds to the default value of CAM3.1.The results of 2261 experiments are shown.All cost values have no units.
and running the python scripts that managed the sampling of CAM3.1 parametric uncertainties on the Lones-tar4 HPC resource at the UT Texas Advanced Computing Center.The portion of this work related to estimating CAM3.1 parametric uncertainties used allocation award ATM100049 from the Extreme Science and Engineering Discovery Environment (XSEDE) program, which is supported by the National Science Foundation.Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016-20,2016 Manuscript under review for journal Geosci.Model Dev.Published: 4 March 2016 c Author(s) 2016.CC-BY 3.0 License.