Time series forecasting with neural networks: a comparative study using the airline data - PDF

Please download to get full document.

View again

of 17
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report

Graphics & Design


Views: 4 | Pages: 17

Extension: PDF | Download: 0

Related documents
Appl. Statist. (1998) 47, Part 2, pp.231^250 Time series forecasting with neural networks: a comparative study using the airline data Julian Faraway University of Michigan, Ann Arbor, USA and Chris Chat
Appl. Statist. (1998) 47, Part 2, pp.231^250 Time series forecasting with neural networks: a comparative study using the airline data Julian Faraway University of Michigan, Ann Arbor, USA and Chris Chat eld{ University of Bath, UK [Received January Final revision April 1997] Summary. This case-study ts a variety of neural network (NN) models to the well-known airline data and compares the resulting forecasts with those obtained from the Box±Jenkins and Holt± Winters methods. Many potential problems in tting NN models were revealed such as the possibility that the tting routine may not converge or may converge to a local minimum. Moreover it was found that an NN model which ts well may give poor out-of-sample forecasts. Thus we think it is unwise to apply NN models blindly in `black box' mode as has sometimes been suggested. Rather, the wise analyst needs to use traditional modelling skills to select a good NN model, e.g. to select appropriate lagged variables as the `inputs'. The Bayesian information criterion is preferred to Akaike's information criterion for comparing different models. Methods of examining the response surface implied by an NN model are examined and compared with the results of alternative nonparametric procedures using generalized additive models and projection pursuit regression. The latter imposes less structure on the model and is arguably easier to understand. Keywords: Airline model; Akaike information criterion; Autoregressive integrated moving average model; Bayesian information criterion; Box±Jenkins forecasting; Generalized additive model; Holt± Winters forecasting; Projection pursuit regression 1. Introduction Neural networks (NNs) have been vigorously promoted in the computer science literature for tackling a wide variety of scienti c problems. Recently, statisticians have started to investigate whether NNs are useful for tackling various statistical problems (Ripley, 1993; Cheng and Titterington, 1994) and there has been particular attention to pattern recognition (Bishop, 1995; Ripley, 1996). NNs also appear to have potential application in time series modelling and forecasting but nearly all such work has been published outside the mainstream statistical literature. This work is reviewed in Section 7 and reveals that, contrary to some rather grandiose claims, the empirical evidence on NN forecasts indicates varying degrees of success. It is pertinent to ask whether the success of NN modelling depends on (a) the type of data (b) the skill of the analyst in selecting a suitable NN model and/or (c) the numerical methods used to t the model and to compute predictions. Experience regarding (a) can be built up with forecasting competitions, whereas case-studies {Address for correspondence: Department of Mathematical Sciences, University of Bath, Bath, BA2 7AY, UK. & 1998 Royal Statistical Society 0035±9254/98/47231 232 J. Faraway and C. Chat eld are better suited to assess (b) and (c), as well as throwing some light on (a). This paper describes one such case-study, based on the well-known airline data (see Section 2). After an introduction to NNs in Section 3, we describe our experiences in tting and using NN models in Sections 4 and 6. Section 5 discusses ways of trying to understand the response surface implied by an NN model. Section 7 reviews our ndings in the context of other studies and urges more understanding between statisticians and computer scientists. We stress that we do not claim to propose any new methodology. Moreover, we deliberately chose to use public domain software to carry out our analyses to replicate the likely circumstances of an applied statistician trying out NNs for the rst time. The `novelty' of the paper lies in giving practical guidance on NN modelling and a comparison with alternative approaches from a statistical, as opposed to computing science, point of view. 2. Box±Jenkins analysis of the airline data The main time series used in this paper is the so-called airline data, listed by Box et al. (1994), series G, and earlier by Brown (1962). Fig. 1 shows that the data have an upward trend together with seasonal variation whose size is roughly proportional to the local mean level (called multiplicative seasonality). The presence of multiplicative seasonality was one reason for choosing this data set. A common approach to dealing with this type of seasonality is to choose an appropriate transformation, usually logarithms, to make the seasonality additive. However, the discussion of Chat eld and Prothero (1973) demonstrated the dif cult nature of such a choice, and if NN models could deal with the non-linearity that is inherent in multiplicative seasonality and allow the raw data to be analysed this would obviate one awkward step in the usual approach to time series analysis. Moreover, the length of the series was typical of data found in forecasting situations, the data were widely available and non- Fig. 1. Airline data; monthly totals (in thousands) of international airline passengers from January 1949 to December 1960: (a) raw data; (b) natural logarithms Time Series Forecasting with Neural Networks 233 con dential and we wanted to see whether we could replicate the promising results obtained by some computer scientists (Tang et al., 1991). The standard Box±Jenkins analysis (e.g. Harvey (1993) and Box et al. (1994)) involves taking natural logarithms of the data followed by seasonal and non-seasonal differencing to make the series stationary. A special type of seasonal autoregressive integrated moving average (SARIMA) model, of order (0, 1, 1 0, 1, 1 12 in the usual notation (e.g. Box et al. (1994), p. 333), is then tted. This model is often called the airline model and is used as the yardstick for future comparisons, though other SARIMA models could be found with a similar t and forecast accuracy. Most of the results reported in this paper were computed by tting a model to the rst 11 years of data and then making forecasts of the last 12 monthly observations. The forecasts were obtained either from the base month 132 using only data available at that time, giving multistep forecasts, or by bringing in the recent observed data one at a time, giving one-step forecasts. The model parameters were not re-estimated at each step when computing one-step forecasts. For each model tted, using data up to time T, we computed the following statistics: (a) S, the sum of squared residuals up to time T (the residuals are the within-sample onestep-ahead forecast errors); (b) ^ ˆ pfs= n p g, the estimate of residual standard deviation, where n denotes the number of effective observations used in tting the model and p denotes the number of parameters tted in the model; thus, when tting the airline model to the airline data with T ˆ 132, the value of n is ˆ 119, since 13 observations are `lost' by differencing; (c) the Akaike information criterion (AIC), n ln S=n 2p; (d) the Bayesian information criterion (BIC), n ln S=n p p ln n ; (e) SS MS, the sum of squares of multistep-ahead forecast errors made at time T of the observations from time T 1 to the end of the series; these are the out-of-sample (genuine ex ante) forecasts; (f) SS 1S, the sum of squares of one-step-ahead (out-of-sample) forecast errors of the observations from time T 1 to the end of the series. The residual sum of squares, S, can only become smaller and the residual standard deviations ^ will tend to become smaller as a model is made `larger'. Thus the minimization of a criterion such as the AIC or BIC is more satisfactory for choosing a `best' model from candidate models having different numbers of parameters. Strictly speaking (c) and (d) above are approximations to the variable parts of the AIC and BIC respectively. In both cases the rst term is a measure of (lack of) t and the remainder is a penalty term to prevent over- tting. The BIC penalizes extra parameters more severely than the AIC does, leading to `smaller' models. Several similar criteria have been proposed including alternative closely related Bayesian criteria which depend on different priors on model size. In particular Schwarz's Bayesian criterion (SBC) has the penalty term p ln n rather than p p ln n (Priestley (1981), pages 375±376). The SBC gives results that are qualitatively similar to the BIC used in this paper and its use here for the airline data would not change the models that we select. The reader should note that the SBC is sometimes (confusingly) abbreviated as BIC. For the airline model tted to the airline data with T ˆ 132, the MINITAB package (release 9.1) gave the following values (after back-transforming all forecasts from the model for the logged data into the original units): 234 J. Faraway and C. Chat eld (i) S ˆ 10789; (ii) ^ ˆ 9:522; (iii) AIC ˆ 540:35; (iv) BIC ˆ 547:91; (v) SS MS ˆ 3910; (vi) SS 1S ˆ The one-step forecasts have slightly worse accuracy than the multistep forecasts as will happen occasionally. If the airline model is tted to the raw data, rather than to the logarithms, then the t is about 20% worse (S ˆ while the accuracy of forecasts suffers even more (e.g. SS MS ˆ 5230 is 34% worse). An alternative way to compute forecasts for the airline data, without taking logarithms, is to use the multiplicative version of Holt±Winters exponential smoothing (e.g. Chat eld and Yar (1988)) and this gave forecasts with comparable accuracy to Box±Jenkins forecasts. Although applied to the raw data, the multiplicative Holt±Winters method is inherently nonlinear in that the formula for a point forecast is a non-linear function of past observations. 3. Neural networks The following brief account of NNs is intended to make this paper as self-contained as possible. However, the reader may also nd it helpful to read Ripley (1993), Sarle (1994), Stern (1996), Warner and Misra (1996) and/or Chat eld (1996a), section Alternatively an introduction from a computer science perspective such as Hertz et al. (1991) and Gershenfeld and Weigend (1994) may be helpful, while econometric and nancial perspectives are provided by Kuan and White (1994) and Azoff (1994) respectively. This paper considers one popular form of (arti cial) NN called the feed-forward NN with one hidden layer. In time series forecasting, we wish to predict future observations by using some function of past observations. One key point about NNs is that this function need not be linear, so that an NN can be thought of as a sort of non-linear (auto)regression model. Fig. 2 depicts a typical architecture as applied to time series forecasting with monthly data. The value at time t is to be forecasted using the values at lags 1 and 12. The latter are regarded as inputs whereas the forecast is the output. The illustrated example includes one hidden layer of two neurons (often called nodes or processing units or just units). In addition there is a Fig. 2. Architecture of a typical NN for time series forecasting with one hidden layer of two neurons: the output (the forecast) depends on the lagged values at times t 1 and t 12 constant input term which for convenience may be taken as 1. Each input is connected to both the (hidden) neurons, and both neurons are connected to the output. There is also a direct connection from the constant input to the output. The `strength' of each connection is measured by a quantity called a weight. A numerical value is calculated for each neuron as follows. First a linear function of the inputs is found, say w ij y i where w ij denotes the weight of the connection between input y i and the jth neuron. The values of the inputs in our example are y 1 ˆ 1, y 2 ˆ x t 1 and y 3 ˆ x t 12. The linear sum, say j, is then transformed by applying a function called an activation function, which is typically non-linear. A commonly used function is the logistic function, z j ˆ 1=f1 exp j g, which gives values in the range (0, 1). In our example this gives values z 1 and z 2 for the two neurons. A similar operation can then be applied to the values of z 1, z 2 and the constant input to obtain the predicted output. However, the logistic function should not be used at the output stage in time series forecasting unless the data are suitably scaled to lie in the interval (0, 1). Instead a linear function of the neuron values may be used, which implies the identity activation function at the output stage. The introduction of a constant input unit, connected to every neuron in the hidden layer and also to the output, avoids the necessity of separately introducing what computer scientists call a bias, and what statisticians would call an intercept term, for each unit. Essentially the biases just become part of the set of weights (the model parameters). For an NN model with one hidden level, the general prediction equation for computing a forecast of x t (the output) using selected past observations, x t j1,..., x t jk, as the inputs, may be written (rather messily) in the form ^x t ˆ o w co P h Time Series Forecasting with Neural Networks 235 w ho h w ch P i w ih x t ji 1 where fw ch g denote the weights for the connections between the constant input and the hidden neurons and w co denotes the weight of the direct connection between the constant input and the output. The weights fw ih g and fw ho g denote the weights for the other connections between the inputs and the hidden neurons and between the neurons and the output respectively. The two functions h and o denote the activation functions used at the hidden layer and at the output respectively. One minor point is that the labels on the hidden neurons can be permuted without changing the model. We use the notation NN j 1,..., j k ; h to denote the NN with inputs at lags j 1,..., j k and with h neurons (or units) in the one hidden layer. Thus Fig. 2 represents an NN(1, 12; 2) model. The weights to be used in the NN model are estimated from the data by minimizing the sum of squares of the within-sample one-step-ahead forecast errors, namely S ˆ t ^x t x t 2, over the rst part of the time series, called the training set in NN jargon. This is not an easy task as the number of weights may be large and the objective function may have local minima. Various algorithms have been proposed, but even the better procedures may take several hundred iterations to converge, and yet may still converge to a local minimum. The NN literature tends to describe the iterative estimation procedure as being a `training' algorithm which `learns by trial and error'. Our software used a popular algorithm called back-propagation for computing the rst derivatives of the objective function. There are many ways to use these derivatives for optimization and our tting method relied on the Broyden± Fletcher±Goldfarb±Shanno algorithm (Fletcher, 1987) which is a quasi-newton method. The starting values chosen for the weights can be crucial and it is advisable to try several different 236 J. Faraway and C. Chat eld sets of starting values to see whether consistent results are obtained. Other optimization methods are still being investigated and different packages may use different tting procedures. For example, a technique called simulated annealing (e.g. van Laarhoven and Aarts (1987)) can be used to try to avoid local minima but this requires the analyst to set numerical parameters with names like `the cooling rate', and even then there is no guarantee that convergence to a global minimum will occur. The last part of the time series, called the test set, is kept in reserve so that genuine out-ofsample (ex ante) forecasts can be made and compared with the actual observations. Equation (1) effectively gives a one-step-ahead forecast as it uses the actual observed values of all lagged variables as inputs. If multistep-ahead forecasts are required, then it is possible to proceed in one of two ways. Firstly, we could construct a new architecture with several outputs, giving ^x t, ^x t 1, ^x t 2,..., where each output would have separate weights for each connection to the neurons. Secondly, we could `feed back' the one-step-ahead forecast to replace the lag 1 value as one of the input variables, and the same architecture could then be used to construct the two-step-ahead forecast, and so on. We adopted the latter iterative approach because of its numerical simplicity and because it requires fewer weights to be estimated. Some analysts t NN models to obtain the best forecasts of the test set data, rather than the best t to the training data. Then a third section of data needs to be kept in reserve so that genuine out-ofsample forecasts can be assessed. The number of parameters in an NN model is typically much larger than in traditional time series models and for a single-layer NN model is given by p ˆ n i 2 n u 1 where n i denotes the number of input variables (excluding the constant) and n u denotes the number of hidden neurons (or units). For example, the architecture in Fig. 2 (where n i and n u are both 2) contains nine connections and hence has nine parameters (weights). Because of this large number, there is a real danger that the algorithm may `overtrain' the data and produce a spuriously good t which does not lead to better forecasts. This motivates the use of model comparison criteria, such as the BIC, which penalize the addition of extra parameters. It also motivates the use of an alternative tting technique called regularization (e.g. Bishop (1995), section 9.2) wherein the `error function' is modi ed to include a penalty term which prefers `small' parameter values (analogous to the use of a `roughness' penalty term in nonparametric regression with splines). We did not pursue this approach. NN modelling is nonparametric in character and it has been suggested that the whole process can be completely automated on a computer `so that people with little knowledge of either forecasting or neural nets can prepare reasonable forecasts in a short space of time' (Hoptroff, 1993). This black box character can be seen as an advantage but we think it potentially dangerous. Certainly black boxes can sometimes give silly results and NN models obtained like this are no exception. Thus Gershenfeld and Weigend (1994), p. 7, found that `there was a general failure of simplistic ``black-box'' approaches Ð in all successful entries (in the Sante Fe competition), exploratory data analysis preceded the algorithm application'. Our case-study demonstrated that a good NN model for time series data must be selected by combining traditional modelling skills with knowledge of time series analysis and of the particular problems involved in tting NN models. Problem formulation is, as always, critical and it is `unlikely that applied statistics will be reduced to an automatic process in the foreseeable future' (Sarle, 1994). 4. Fitting neural network models to the airline data Many commercial packages are available for tting NN models. James (1994) reviewed 12 such items. We deliberately eschewed commercial software, partly for nancial reasons (some are very expensive), and partly because they are typically written for the business user. For example, whereas James (1994) would `strongly recommend' a package called 4Thought, Harvey and Toulson (1994) described the claims made in its publicity material as being `totally unsubstantiated'. Appendix A gives information on the software that we developed based on public domain S-PLUS functions, and we think that this sort of software will appeal more to the average statistician. We realize that different software packages use different optimization methods for tting NN models but, although the implementations differ, we think that the problems encountered below are representative of those that are likely to occur with all packages, given the dif culty in choosing an appropriate architecture, the large number of parameters which must be estimated, the non-linearity of the model and the ex
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks