The forecasting of multiple heterogeneous time series is one of the most challenging problems in the field. So much that the Wikipedia foundation launched a Kaggle completion on the matter. As did the German Drugstore, Rossman

Until recently, most forecasting problems involved a single time series or up to a couple of dozens. So for example, let’s say you want to forecast how many Macbooks Pros will the Apple Store in Palo Alto sell on a given day. You would collect all the past daily sales for that store and fit a single model. If you also want to forecast it for two additional Apple Stores, you could build two other separate models, and so on.

But what if you want to forecast the sales of all Macbooks types for all the Apple stores in the world? Then we would be dealing with thousands of time series. Or for the task of forecasting web traffic for Wikipedia pages, we could be talking millions. It clearly becomes impractical to build a single model for each of them, in terms of resources and time.

So is the solution to build one single model for them all? Kind of. The problem is that we are making the assumption that they all come from the same data generating process. In other words, that each time series is an event generated by the same probability distribution and, therefore, they are somewhat homogeneous.

In real-world applications, this is not necessarily the case. The underlying mechanisms driving sales of MacBook Airs in Palo Alto can be very different than, say, in Shangai. This heterogeneity adds noise to the data and increases the difficulty of the learning process by the algorithm.

So how is the community tackling this kind of problem? From the research I did, I found two interesting approaches. Both use LSTMs. These are neural networks that work great on sequence data. They are used in many applications of Natural Language Processing, like machine translation and speech recognition. If you’re interested in digging deeper I recommend this excellent blog post

So, one approach involves clustering similar time series together and then training one LSTM model for each cluster. This is a balance between one model for each series and one huge model for them all. The assumption is that by training models on similar sequences you reduce noise in the data.

The other approach is a bit more complicated but also a bit more elegant. It involves two steps. First, train an LSTM autoencoder to learn an embedding of all the time series. In other words, try to find a lower dimension representation of the data which contains all the information needed to replicate itself. Then, concatenate it with any other external features (like temperature) and feed it into a final predictor, like a multilayer perceptron.

This does not mean that simply applying what is proposed in those articles is an easy task. Both of them involve training deep neural networks with many parameters and this is sometimes more of a craft than science.

I plan to explore both of these approaches in future posts. If you know of one that could be interesting, please let me know!