timeseriesdb - Manage Official Statistics’ Time Series Data with R and PostgreSQL

timeseriesdb maps R time series objects to PostgreSQL relations for permanent storage. Instead of writing time series to spreadsheet files or .RData on disk, timeseriesdb uses a PostgreSQL schema which allows to store data alongside extensive, context aware, multi-lingual meta information. timeseriesdb aims at time series from official statistics which are typically published on a monthly, quarterly or yearly basis. Unlike many time series database which excel at appending single observations at record speed and on-the-fly aggregations on database level, timeseriesdb is focused on large upsert operations. That is, check whether a series is available and perform and update if that is the case and an insert if not. This is the case for example when an entire time series is revised such as in the case of GDP revisions or seasonal adjustment.

Getting Started

This quick start guide assumes you’ve PostgreSQL installed or have access to a remote PostgreSQL database including the necessary access rights to create new schemas and tables. Also, let’s assume you installed R.

If that’s not case please refer to:

Before you install the R Package

timeseriesdb depends on RPostgreSQL to connect to the PostgreSQL database, the user needs to make sure that the PostgreSQL’s own library and header files are present and can be found by RPostgreSQL. For Windows, the library called libpq is attached to the RPostgreSQL package and will thus be installed with the R package. Hence Windows users are unlikely to experience further troubles.

For OS X and Linux the installation is a bit different when libpq is not present. For some Linux distributions the corresponding library can be obtained with the postgresql-devel package. Similarly, on OS X, the user needs to make sure that libpq is present and can be found by RpostgreSQL. I recommend to use the homebrew package manager and run brew install postgresql. OS X and Linux users should note that previously installed versions may not contain the libraries provided by postgresql-devel package.

Install the R package (stable version)

Yay! timeseriesdb is on CRAN now. Installing the stable version (right choice for most users) is a matter of running on your R console

install.packages("timeseriesdb")

or using R (Studio’s) Graphical User Interface.

Install the R packages (developer version)

The latest version of the timeseriesdb is available from github account. The devtools package is an easy way to directly install packages directly from github.

library(devtools)
install_github('mbannert/timeseriesdb')

Set up PostgreSQL

The basic idea behind timeseriesdb’s storage concept is to use the PostgreSQL extension hstore to store time series in a key-value-pair. Thus you’ll have to install the hstore extension before you can create the relations needed to operate timeseriesdb. Just run:

CREATE EXTENSION hstore;

Plus you will need the btree_gist extension if you want to use the vintages table, that is, store different version of the same time series. The extension is used to make sure different versions do not overlap in their validity. Installing is as simple as:

CREATE EXTENSION btree_gist;

If set up timeseriesdb from scratch, create a new postgres schema:

CREATE SCHEMA timeseries;

Note that ‘timeseries’ is the default schema name. That is, the defaults of the R functions use that name. Of course, you can have multiple schemas and manipulate those, by explicitly passing their name as a parameter. In the standard case setting up all relations on localhost sandbox database is just a matter of:

library(timeseriesdb)
con <- createConObj(dbhost = "localhost",
                    dbname = "sandbox",
                    passwd = "")

runCreateTables(con)

Update PostgreSQL (if you used older version of timeseriesdb)

In case you’ve used older version of timeseriesdb before, you can run the following in order to update your tables to add the new release date feature.

library(timeseriesdb)
con <- createConObj(dbhost = "localhost",
                    dbname = "sandbox",
                    passwd = "")

runUpgradeTables(con)

Basic Usage

As stated before, timeseriesdb maps R time series objects into their PostgreSQL counterparts for permanent storage. The core functionality comes from the readTimeSeries and storeTimeSeries functions.

Simple CREATE, UPDATE, DELETE (CRUD)

Let’s create a few random time series and store them and read them into R again. Note that both functions can handle vectors of time series identifiers. The result of read operation will always be a list – even if you only read a single series. That is because we want to return the same type always.

library(timeseriesdb)
con <- createConObj(dbhost = "localhost",
                    dbname = "sandbox",
                    passwd = "")


tsl <- list(
      ts1 = ts(rnorm(100),start = c(1990,1),freq = 4),
      ts2 = ts(rnorm(100),start = c(1985,1),freq = 12),
      ts3 = ts(rnorm(50),start = c(2000,1),freq = 4))

storeTimeSeries(names(tsl), con, tsl)

read_it <- readTimeSeries(c("ts1","ts2"),con)

Deleting one or more series is just a matter of:

deleteTimeSeries("ts3",con)

I suggest to prevent users from deleting series using PostgreSQL’s sophisticated rights management. Thanks to PostgreSQL’s row low level security (RLS) you could even set rights on a per record (time series) basis.

Storing Time Series Vintages (Real Time Reproducibility)

Sometime it is important to store different versions of the same time series. Revision of official statistics are common example of why you want to store so-called vintages. Assume you use GDP figures published by your national statistical office in your research. At the time you compiled your computations, the official bureau did their best to publish the GDP, but had to revise their estimations later on. As a researcher who cares about reproducibility you may not want to rely on a source being online forever or to store different versions of their own publications for you. Note that vintages are not allowed to overlap and need to succeed each other without gaps. This will be enforced by timeseriesdb.


tsl <- list(ts1 = ts(rnorm(100),start = c(1990,1),freq = 4))

storeTimeSeries(names(tsl),con, tsl,   valid_from = '2018-07-01')

Adding another version of the same series with a validity before the former series is not allowed. If you add a version with a validity after the existing version, the validity of the current version is set to end when the new validity starts.


storeTimeSeries(names(tsl),con, tsl, valid_from = '2018-08-01')

Note how series with the same time series key can co-exist because of a composite primary key which consists of the identifier and validity.

The release date feature - sharing data

One big advantage of a database over file based storage is the leverage when sharing data. Instead of storing data in your_personal_folder/in_a_file_with_a_wierd_name/some_where_on_your_hard_disk/ a SQL database is an industry standard which can be accessed through a ton of programming languages including the ones that are popular in web development. So by storing data with timeseriesdb you facilitate sharing and publishing it.

Yet you may not want to share the full information with everyone immediately. Maybe a supervisor needs to do her work before you publish the data or there’s a press conference and everybody should get news at the same time? Therefore we’ve added a release date to timeseriesdb. If you don’t set it, its default will be 1900. Unless you pull of a Marty McFly that’s way in the past. But if you set future release date, a web developer can easily use it to cut off the last observation before sharing it through an API.

storeTimeSeries(names(tsl),con, tsl, release_date = '2018-07-01')

Localized and unlocalized meta information

timeseriesdb was developen in Switzerland – a country that speaks four official languages (German, French, Italian, Rumantsch). The latter is only spoken by a small minority and does not really play a role in describing official statistics (though it would be a dream come true to also have data description in Rumantsch). However, the point is, timeseriesdb allows you to store meta information in several languages and associate it with the same series!

  m_e <- new.env()
  m_en <- list("wording" = "let's have a word.",
               "check" = "it's english man.!! SELECTION DELETE123")
  addMetaInformation("ts1",m_en,m_e)
  
  # DE
  m_d <- new.env()
  m_de <- list("wording" = "Wir müssen uns mal unterhalten......",
               "check" = "Das ist deutsch. wirklich")
  addMetaInformation("ts1",m_de,m_d)
  
  updateMetaInformation(m_e,
                      con,
                      locale = "en",
                      tbl = "meta_data_localized")

  updateMetaInformation(m_d,
              con,
              locale = "de",
              tbl = "meta_data_localized")

In addition the localized meta information, timeseriesdb stores a few things that do not need to be translated such as username or last updated etc.

Packages that work well with timeseriesdb

timeseriesdb was the first of several time series related packages built at KOF. In the meantime there a few more to address some other concerns. The package are designed to work well with each other and form a little suite of helpers to make every day life with time series in official statistics a little easier.

Mostly plotting and export of time series tstools
An API wrapper to get time series data through our timeseriesdb backed API kofdata
@christopsax’s Best in class wrapper for X13-ARIMA-SEATS seasonal adjustment seasonal

timeseriesdb

Archive time series persistently with R and PostgreSQL

Matthias Bannert

2018-08-06