ClimateLearn : A machine-learning approach for climate prediction using network measures

We present the toolbox ClimateLearn to tackle problems in climate prediction using machine learning techniques and climate network analysis. The package allows basic operations of data mining, i.e. reading, merging, and cleaning data, and running machine learning algorithms such as multilayer artificial neural networks and symbolic regression with genetic programming. Because spatial-temporal information on climate variability can be efficiently represented by complex network measures, such data are considered here as input to 5 the machine-learning algorithms. As an example, the toolbox is applied to the prediction of the occurrence and the development of El Niño in the equatorial Pacific, first concentrating on the occurrence of El Niño events one year ahead and second on the evolution of sea surface temperature anomalies with a lead time of three months.


Introduction
Machine learning is a branch of computer science concerned with automated recognition of (spatio-temporal) 10 patterns from data (Mitchell, 1997).It has been increasingly employed in the study of "big data" with the aim to investigate data syntactically and semantically.In essence, this means an automated search for a best model, given a certain task and corresponding data.A large number of algorithms have been designed for different tasks in which the approach is borrowed from bio-inspired investigations on artificial intelligence (in older times a synonym of machine learning).Given a task, a human learns what to do and -hopefully-optimizes the working 15 schedule according to the given side conditions.This is how machines learn from data: a task is formulated and then a learning process starts, which consists in building statistical models (in terms of probability distributions) or functional models.Eventually, optimality criteria and discriminant functions are used to evaluate the performance of such a model given new data.
The algorithms are divided roughly into three different categories: supervised learning, unsupervised learning reliably forecast an El Niño event one year ahead.
When machine-learning techniques are applied to the prediction of climate variability using data from CNs, one typical task is to infer or 'learn' the dynamics of the climate system from past states and predict its future states.
In this paper, we present a machine-learning approach for climate forecasting using the measures of CNs.The originality and advantage of this approach is that the temporal information is already contained in the measures of 65 the CNs, so the machine-learning techniques will take those into account when making predictions of the future states of the system.This is a big advantage that is not that common in most of the applications where machine learning is used for prediction.In section 2, we start with an explanation on how the data for the machine-learning approach is obtained from complex network analysis.The machine-learning methodology itself is described in section 3 and subsequently applied in section 4 to the prediction of El Niño events.A summary and discussion are 70 given in section 5.

Climate Networks
Climate scientists have been long interested in studying the statistical correlations between observables for gaining a good understanding of the large-scale development of the climate system.By investigating the correlation structures of global or regional fields, such as surface air temperature and geopotential height, much insight is 75 gained into the patterns of climate variability.For example, through such analyses, the Southern Oscillation was discovered by Sir Gilbert Walker and also its relation with the equatorial Tropical sea surface variability, i.e.El Niño, was clarified (Katz, 2002).
Suppose that a certain climate system observable indicated by O below, such as sea surface temperature (SST) or surface atmospheric temperature (SAT), is available at fixed measurement stations, certain predefined re-80 gions, or at grid cells (e.g. from observations, proxy reconstructions, reanalysis, or model simulations).The corresponding data can be represented by an n × N matrix F , ordered in such a way that each column vector As mentioned above, one way to define the links in the climate network is to use the Pearson Correlation, defining a PCCN, or the Mutual Information, defining a MICN (Feng and Dijkstra, 2014).To reconstruct a PCCN, 85 first the linear Pearson correlation coefficient between the time series at two grid points i and j is determined.The elements R P ij of the correlation matrix R P are given by . (1) To reconstruct a MICN , the correlation between the time series of two grid points i and j is determined by the nonlinear mutual information coefficient, giving where p(x,y) is the joint probability density function of events x and y and p(x) and p(y) are the marginal probability density functions.
We consider that two nodes i and j have an unweighted link, if the absolute value of their correlation coefficient ) is larger than a certain threshold value τ .All links are then represented by an 95 N × N adjacency matrix A, which can be determined from the correlation matrix R X according to where H is the Heaviside function.The threshold τ is in most cases based on statistical significance (say above the 95% level) of the correlations between the time series (Donges et al., 2015).
Another way to define a link between nodes i and j was presented in Gozolchiani et al. (2011) and also used in 100 Ludescher et al. (2014).First, the cross-correlation function C ij (∆t) between the time series at locations i and j is calculated, where ∆t is a positive time lag and is maximal (or minimal) is determined.Finally, weights (W ± ij ) for positive and negative links are defined as: and where MAX and MIN are the maximum and minimum values and MEAN, STD are the mean and standard deviation, respectively.In this way, a weighted and directed link between nodes i and j is obtained (Gozolchiani et al., 2011;Wang et al., 2013).
There are many other ways to reconstruct climate CNs and an overview is given in Donges et al. (2015).By 110 reconstructing CNs, the correlations in time series of observables at different locations is represented with a graph, defined by its adjacency matrix A. Subsequently, many topological properties of this graph are analysed, such as the degree d i of each node i, given by which is the total number of links that a node possesses.Next step is to use the properties of such a CN as the 115 input of machine-learning techniques.Besides the statistical properties of the CN, such as those of the correlation matrices, also the topological properties of the graph can be used.

Machine learning approach
In ClimateLearn, supervised learning approaches are implemented for the prediction of climate variability.
Specifically we focus on multilayer artificial neural networks (ANN) and symbolic regression with genetic pro-120 gramming (GP), both explained in this subsection.The approaches follow the typical outline of machine learning: the algorithms are trained or 'learn' a certain behavior from the data This results in a model that is evaluated using test data, which are different from the ones used for training.

Artificial neural networks
Artificial neural networks are a class of statistical learning models inspired by the physiology of biological neural 125 networks.They consists of a network of computing units, the neurons, which process input information transforming it in an output signal whose form depends on the network internal state.Their importance has increased due to recent availability of software capable to efficiently train the network and allow to use these methods for a large variety of problems, from speech and image recognition up to the forecasting of time series and high-dimensional clustering.Many neural network topologies have been proposed in the literature, as specific problems require 130 specific topologies to be solved efficiently.Here, we concentrate on a specific configuration known as multilayer perceptron or multilayer neural network.In Fig. 1a the typical structure of a multilayer perceptron is shown: the inputs enter the network and are processed by one or more hidden layers and exit at an output layer.Therefore the computation can be seen as a mapping operation from a n-dimensional input vector to a m-dimensional output vector.In a multilayer perceptron, information travels from the input to the output layer because the neuron 135 connections are chosen to be unidirectional.When all neurons of one layer are connected to all the following we speak about a fully connected multilayer perceptron.
Each neuron performs a specific kind of computation and Fig. 1b shows the functionality of the Rosenblatt perceptron.First, a weighted sum of the input variables and the bias term b is built, with the result being then processed by an activation function f (t).Once the single neuron operation is specified, one can easily calculate 140 the network outputs given an input vector by evaluating the output of each layer by forward input propagation.
The result is a function of the network configuration, i.e. its topology and the value of the connection weights.It will be the job of the training phase to learn the weights in order to induce the desired computation; training and learning are used here as synonymous.
Since the advent of neural networks, the training phase has been considered a computationally demanding 145 problem mainly because of the absence of efficient algorithms relative to the available computing power.This has been overcome by the back-propagation algorithm (Bishop, 2006), nowadays widely applied in training multilayer perceptrons.Given a supervised training set {x i ,t i : i = 1...N } with x i input variables and t i target variables, we denote by y i the correspondent output computed by the network when x i is fed forward.In general we have A global error on the training set can be then defined as a quadratic function of the form 150 and can be seen as a function of the network weights w.Other error definitions are possible, for example by choosing a different norm.The idea behind back propagation is to minimize this error by updating the weights using the gradient descend method (with k as iteration index), i.e. x 1 x 1 the training is considered successful and the network can be employed for generalization.

Genetic programming
Genetic programming and genetic algorithms are a class of evolutionary based algorithms whose principles are based on Charles Darwin's Theory of Evolution (Darwin, 1959).In this masterpiece, Darwin explains that, given a Given a problem P to be solved, we imagine to have an ensemble of solutions S to this problem.According to its efficiency in solving P , each solution can be considered as an individual for which the degree of adaptation to its environment can be measured in terms of a fitness value.Genetic Programming (GP) is a particular case where the 180 evolved individuals are computer programs (Koza, 1993).The aim is to appropriately evolve computer programs by creating new generations, evaluating their fitness value and finally selecting the best program that solves the problem at hand.Here, we restrict such programs to functions f (x 1 ,...,x n ) of a given number of variables x i ,i = 1,...n and we aim at finding a given function that approximates the solution to our problem accurately enough.
Therefore our application is nothing more than a symbolic regression achieved through genetic programming 185 algorithms.The fitness values can be represented mathematically by a real valued functional mapping the space of possible solutions onto the real axis.In GP, the programs are typically represented as trees, where each tree represents an expression of a potential solution to a problem (cf.examples in Fig. 2).
To implement the variations in genetic programming, two operators are commonly used: mutation and crossover.Their behaviour is very similar to the biological mutation and crossover concepts.In a mutation 190 step, a random node in the tree that represents the individual is selected and the corresponding subtree is replaced randomly by another one (Fig. 2a).Mutation is very important to keep diversity inside the population, and diversity helps the algorithm to explore all the search space and preventing encounters of local maxima of the fitness functional.Crossover is based on the exchange of characteristics among two individuals.In GP, this is implemented by selecting randomly a node in two trees that represent two individuals and exchange the two subtrees 195 attached to two those nodes (Fig. 2b).In this way, the two individuals inherit characteristics of both parent trees.
In a so-called symbolic regression scheme, an initial random population of individuals is generated randomly using elementary functions taken from a function set (e.g.sinx,exp x,max[x 1 ,x 2 ]) and combined using a certain algebraical or functional operations (e.g.+,−, * ) in an operation set, whose choice depends on the problem at hand.The initial population is then evaluated according to a fitness function.The genetic operations mutation and 200 crossover are then applied on the population in order to create a second generation which is as well evaluated.This process continues until a prescribed termination condition, for example when the maximum number of generations has been reached, or an individual of high enough fitness is found (e.g. when the absolute value of the functional F is smaller than a certain tolerance).x 1 As an example of the application of ClimateLearn, we consider the forecasting of El Niño events using two different approaches.First, we focus on the forecasting of the occurrence of events, i.e. the presence (or not) of an El Niño in a given interval of time regardless of the intensity of the phenomenon.This problem can be considered as a classification problem, where a set of discrete classes is the output of the model (section 4.1).The second approach is the forecasting of the time evolution of a scalar characteristic of El Niño, where we aim at the 210 prediction of a real-valued time series by regression.The results in this case will give information on both the presence and intensity of the event (section 4.2).In section 4.3, we provide specific results for the occurrence and development of the El Niño conditions in the year 2014.(Kalnay et al., 1996).From this dataset, a directed, weighted network was reconstructed (Gozolchiani et al., 2011;Ludescher et al., 2014) using the methodology presented in section 2. Several measures x i ,i = 1,...,N of this network are used as the attributes in the machine-learning approach and for each quantity a time series (x 1 i ,...,x T i ) is available.We use a time interval of 10 days (which gives T = 2365) and choose eight measures (with time included, N = 9).These eight measures are the maximum 220 correlation MAX(C ij ), the minimum correlation MIN(C ij ), the maximum delay MAX(∆t * ), the minimum delay MIN(∆t * ), the maximum link weight MAX(W ij ), the minimum link weight MIN(W ij ), the standard deviation of the correlation STD(C ij ), and the mean correlation MEAN(C ij ) (see section 2).
The target variable is discrete valued and distinguishes the presence or absence of an event.Operationally (http://www.cpc.ncep.noaa.gov/),an El Niño event is said to occur when the sea surface temperature anomaly  2014), the machine-learning toolbox enables us to give a better prediction for the occurrence of El Niño 240 events when using more measures of the same CN.
One advantage of using supervised learning for prediction is that the predictor model is constructed automatically from the training set without subjective decisions like the choice of thresholds.However, because the available data for prediction as well as the amount of instances is limited, (for example, only a few El Niño events occurred between May 1949 and March 2014), the accuracy of the prediction will mostly depend on the length the test set is ≤ 20%.Of course, we should also maximize the length of the test set to incorporate more El Niño events for testing, and this motivated our choice of 20% (Fig. 3).

Results: NINO3.4 index development
Predictions for the development of the NINO3.4index are more difficult than those for the occurrence of El Niño 255 events.For example, consider the results of the CFS version 2 (CFSv2) model developed by the Environmental Modeling Center at National Centers for Environmental Prediction (NCEP).This is a fully coupled model representing the interaction between the Earth's atmosphere, oceans, land and sea ice (Saha et al., 2014).In August 2014 this model predicted that the NINO3.4index would go over +1.0 • C in October 2014 but the actual value in October 2014 was just around +0.5 • C. Hence even for short term predictions (up to few months) a good skill of 260 the NINO3.4index development is still hard to achieve by this model.Short-term development of the NINO3.4index is strongly related to the stability of the Pacific background state and the occurrence of westerly wind bursts (WWBs) near the dateline.In Feng and Dijkstra (2015), PCCNs were reconstructed using sea surface temperature data from the HadISST dataset Rayner (2003) using the methodology presented in section 2. As a measure of the coherence in the PCCN, they determine the number of links of each 265 node, i.e. the degree of the node.As a measure of the stability of the Pacific climate, they use the skewness S d of the degree distribution of the PCCN.In addition to S d , also the time series of the second principal component (PC2) of the wind stress residual (the signal due to SST variability is filtered out) is used as a measure the WWB strength (Feng and Dijkstra, 2015).index development by supervised learning regression.The attributes are therefore the background stability index x 1 = S d , the westerly wind burst measure x 2 = PC2 and the time x 3 = t from November 1961 to October 2014 with a time interval of one month (i.e., T = 636 and N = 3).Given the data set we again have to choose a training and a test set.In the case of regression we can randomly choose a given percentage of the instances to belong to a training set and the rest to a test set.Since we do not possess a large amount of data, it is however important that 275 these two dataset are as homogeneous as possible in order to avoid overfitting issues.
The training set chosen is from November 1961 to April 2004 (80% of T ) and the test set is from May 2004 to October 2014 (20% of T ).The quality of the predicted results in the test set is measured by the normalized root mean squared error (NRMSE) defined as where y k A is the actual time series of NINO3.4 index, the predicted is indicated by y k P ,n and n is the number of points in the test set.
We first employ an ANN with a 2 × 1 layer structure (2 neurons in the first layer and 1 neuron in the second one) to do the regression.Since we do not know the optimal prediction time τ which would give a reasonable prediction y(t + τ ) at time t + τ , we vary τ from 1 month up to 12 months.Fig. 4 shows the regression results

285
on the test set for the 2-4 months lead time NINO3.4forecasts.The best prediction, with the smallest value of NRMSE=0.18, is given at τ = 3 months (Fig. 4b).
To test the robustness of the regression result for the three months lead time NINO3.4index forecast (Fig. 4b), we perform a series of cross-validations by keeping specific percentage splits between training set and test set (70-30, 75-25, 80-20, and 85-15), but randomly choosing 200 initial times t test 1 of the test set from November 290 1961 to October 2014 for each percentage split.From Fig. 5, one can see that the peak values of the NRMSE remain near 0.17, independent of the choices of the percentage splits and t test 1 .Therefore the regression result in Fig. 4b is considered robust.
Due to the irregular behavior of the PC2 representing the WWBs (cf. Figure 3 3b, our forecast scheme tends to ignore the ENSO-neutral favored events or the weak El Niño events.

315
Hence, no El Niño event between January 2014 to March 2015 is predicted one year ahead (Fig. 7a).Second, we consider the development of the NINO3.4index from January 2014 till January 2015 using the same data used in section 4.2 till October 2014.The accuracy of the predicted NINO3.4 index over 2014 with a lead time of three months (Fig. 7b) is quite good (NRMSE=0.19)for example compared with the one given by CFSv2 model over that period (NRMSE=0.34).

5 Summary and discussion
In this paper, we have presented the machine-learning toolbox ClimateLearn for climate prediction problems, based on climate data obtained from complex network reconstruction and analysis.Besides handing multivariate data from these networks and other sources, another advantage of using this machine-learning toolbox for climate variability prediction is that the development of predictor models is dynamic and data-driven (Bishop, 2006).

325
Using machine-learning techniques with the measures from reconstructed Climate Networks (CNs), we have provided novel prediction schemes for the occurrence of an El Niño event (with a lead time of one year) and for the development of the NINO3.4index (with a lead time of three months).
By using measures of a directed and weighted CN (Ludescher et al., 2014)

155
The calculation of the partial derivatives is thus crucial for the algorithm.It is done by using directly the dependence of the error function on the training set instances.When all the instances have been used, one 'epoch' of training is completed.Usually many epochs of training are needed in order for the error function to converge to a local, or global minimum, resulting in longer training periods.Geosci.Model Dev.Discuss., doi:10.5194/gmd-2015-273,2016 Manuscript under review for journal Geosci.Model Dev.Published: 11 February 2016 c Author(s) 2016.CC-BY 3.0 License.

Fig. 1 :
Fig. 1: (a) Schematics of a fully connected multilayer perceptron with three input variables x 1 ,x 2 and x 3 and two output variables y 1 and y 2 , with two hidden layers.(b) The Rosenblatt perceptron, with three inputs and a bias unit.The weighted input sum is added to the bias term and then enters as argument of the activation function f which generates the neuron output.
Dev. Discuss., doi:10.5194/gmd-2015-273,2016 Manuscript under review for journal Geosci.Model Dev.Published: 11 February 2016 c Author(s) 2016.CC-BY 3.0 License.population of individuals living within an environment, only a subset of them are properly fitted and therefore have higher chances of survival and reproduction.New generations may inherit these favourable genetic characteristics and they will end up being dominant inside the population.Variations in the individual characteristics can be classified in three categories.In the first category are variations that are damageable for the individual.To the second belong the beneficial ones and in the third category, the variations have no effective influence.Natural selection 175 consists in the preservation of beneficial characteristics and their transmittance to the next generation, since those fitted individuals live longer and most of the time are better able to beat the competition for reproduction.

Fig. 2 :
Fig. 2: (a) Example of mutation operation: a branch of the tree on the left is changed (mutated) by substitution with another compatible branch determined randomly.(b) Example of crossover: two individuals are selected from the population and a new individual is created by mixing the highlighted branches.

4. 1
Results: Occurrence of El Ni ño events Just as in Ludescher et al. (2014), the data consist of atmospheric surface temperature anomalies over the May 1949 215 -March 2014 from the NCEP Reanalysis project

225
Fig.3ashows the classification results on the test set, where 1 stands for the occurrence of an El Niño event and 0 means absence.The result is then filtered by eliminating the isolated and transient events, and by batching the adjacent events together.Fig.3bthen shows that our forecasting scheme gives accurate alarms 12 months ahead for the El Niño events in2002, 2006 and 2009, and no alarm in 2004.Compared with the results inLudescher et al. (2014), the machine-learning toolbox enables us to give a better prediction for the occurrence of El Niño

245Fig. 3 :
Fig. 3: Prediction results on the test set from June 2001 to March 2014 (a) without filtering and (b) with filtering, using an artificial neural network (ANN) with a 3 × 3 layer structure (3 neurons per layer) for a 12 months lead time prediction for the occurrence of El Niño events.The red dashed lines are the actual nominal quantity of the NINO3.4index (1 stands for the occurrence of an El Niño event where NINO3.4 values are continuously above the threshold of +0.4 • C for five months, and 0 for the absence of such an event), and the blue solid lines indicate the predicted ones.

Fig. 4 :
Fig. 4: Regression results on the test set from May 2004 to October 2014 using an ANN with a 2×1 layer structure (2 neurons in the first layer and 1 neuron in the second one) for the prediction of the NINO3.4index with a lead time of (a) 2 months (NRMSE=0.23),(b) 3 months (NRMSE=0.18),and (c) 4 months (NRMSE=0.22).The red dashed lines are the actual values of NINO3.4 index, and the blue solid lines indicate the predicted ones.

Fig. 5 :
Fig. 5: Cross-validation results of NINO3.4 index forecast on the test set by keeping certain percentage splits between training set and test set (70-30, 75-25, 80-20, and 85-15), but randomly choosing 200 initial times of the test set t test i from November 1961 to October 2014 for each percentage split.The blue dashed curve is the NRMSE distribution of 70-30 split (70% of T as the training sets and 30% of T as test sets), the green solid line for a 75-25 split, the red solid curve for a 80-20 split and the cyan solid curve for a 85-15 split.

Fig. 6 :
Fig. 6: Results for a 3-month running mean regression on the test set from May 2004 to October 2014 using (a) an ANN with a 2 × 1 layer structure (2 neurons in the first layer and 1 neuron in the second one, NRMSE=0.14),(b) an ensemble of 49 ANNs with different binary layer structures and up to 7 neurons per layer (only the ensemble mean of the best 10 is showed, NRMSE=0.15) and (c) an ensemble of genetic programmings (only the ensemble mean of the best 10 is showed, NRMSE=0.17) for the three months ahead prediction for the development of the NINO3.4index.The red dashed curves are the actual values of NINO3.4 index, and the blue solid curves indicate the predicted ones.
Geosci.Model Dev.Discuss., doi:10.5194/gmd-2015-273,2016 Manuscript under review for journal Geosci.Model Dev.Published: 11 2016 c Author(s) 2016.CC-BY 3.0 License.dict the occurrence and development of El Niño events.It can also be directly applied to the prediction of other 350 climate variability phenomena.