Data cleaning of high-frequency data is a necessary step in all finance and financial econometrics applications. The reason is that most data providers like tick data offer raw instead of preprocessed data. The upside is that you can follow the cleaning process yourself instead of relying on another party. In this vignette, we will address a common workflow that makes use of both trades and quotes data on a microsecond resolution.
We start by loading the highfrequency package and printing the head of one example data set that contains trade data for one stock on two days.
library(highfrequency)
head(sample_tdataraw_microseconds)
#> DT EX SYMBOL SYM_SUFFIX TR_SCOND SIZE PRICE TR_CORR
#> 1: 2018-01-02 05:01:21 P XXX NA FTI 2 157.80 0
#> 2: 2018-01-02 05:23:50 P XXX NA FTI 3 157.80 0
#> 3: 2018-01-02 05:23:50 P XXX NA FTI 1 157.80 0
#> 4: 2018-01-02 07:11:54 P XXX NA T 130 158.00 0
#> 5: 2018-01-02 07:23:12 P XXX NA TI 30 158.10 0
#> 6: 2018-01-02 07:28:44 P XXX NA TI 30 158.28 0
#> TR_SEQNUM TR_SOURCE TR_RF
#> 1: 48101 C
#> 2: 48701 C
#> 3: 48801 C
#> 4: 67901 C
#> 5: 72101 C
#> 6: 73301 C
A quick look via summary into the data let’s us see that there are trades listed in the data set out of the bounds of the regular trading hours that are 9:30 to 16:00 on regular work days.
summary(sample_tdataraw_microseconds[, c("DT", "SIZE", "PRICE")])
#> DT SIZE PRICE
#> Min. :2018-01-02 05:01:21 Min. : 1.0 Min. :155.4
#> 1st Qu.:2018-01-02 12:50:07 1st Qu.: 29.0 1st Qu.:156.4
#> Median :2018-01-02 15:59:06 Median : 100.0 Median :156.7
#> Mean :2018-01-03 00:33:21 Mean : 132.7 Mean :156.8
#> 3rd Qu.:2018-01-03 12:21:06 3rd Qu.: 100.0 3rd Qu.:157.1
#> Max. :2018-01-03 19:55:37 Max. :443901.0 Max. :159.4
Next, we will use one of three main cleaning functions called tradesCleanup. It implements the steps laid out in the seminal paper of Barndorff-Nielsen et al. (2009). It processes the given data by
When using tradesCleanup, the return value is a list of two. It contains a report about the number of observations after each cleaning step and the cleaned data itself.
tdata_cleaned$report
#> initial_number no_zero_trades select_exchange
#> 77263 77263 11191
#> remove_sales_condition merge_same_timestamp
#> 3741 3030
summary(tdata_cleaned$tdata[, c("DT", "SIZE", "PRICE")])
#> DT SIZE PRICE
#> Min. :2018-01-02 09:30:34 Min. : 100.0 Min. :155.4
#> 1st Qu.:2018-01-02 11:55:10 1st Qu.: 100.0 1st Qu.:156.4
#> Median :2018-01-02 15:59:28 Median : 100.0 Median :156.7
#> Mean :2018-01-03 00:20:45 Mean : 144.7 Mean :156.9
#> 3rd Qu.:2018-01-03 12:12:11 3rd Qu.: 167.0 3rd Qu.:157.2
#> Max. :2018-01-03 15:59:59 Max. :3048.5 Max. :159.4
As just one example, one can see that all trades that had time stamps outside the openinig hours of the NYSE. The report of the cleaning procedure shows us that most of the observations are removed either due to originating from another exchange or having irregular sales conditions.
As a last step for cleaning tick data on trades it is advised to check whether compatible quotes have been made before the recorded transaction. For doing so, we first need to implement a similar cleaning procedure for quotes data.
The highfrequency function also brings a similarly-named function for cleaning quotes data, quotesCleanup. Here, the cleaning steps are
quotesCleanup also provides a list with two elements, a report and the cleaned data.
qdata_cleaned$report
#> initial_number no_zero_quotes select_exchange
#> 464221 463191 94422
#> remove_negative_spread remove_large_spread merge_same_timestamp
#> 94422 94422 46566
#> remove_outliers
#> 46566
summary(qdata_cleaned$qdata[, c("DT", "OFR", "OFRSIZ", "BID", "BIDSIZ", "MIDQUOTE")])
#> DT OFR OFRSIZ
#> Min. :2018-01-02 09:30:00 Min. :155.4 Min. : 1.000
#> 1st Qu.:2018-01-02 12:21:55 1st Qu.:156.5 1st Qu.: 1.000
#> Median :2018-01-02 15:54:41 Median :156.7 Median : 2.000
#> Mean :2018-01-03 00:03:51 Mean :156.9 Mean : 4.211
#> 3rd Qu.:2018-01-03 12:12:14 3rd Qu.:157.2 3rd Qu.: 4.000
#> Max. :2018-01-03 16:05:18 Max. :159.4 Max. :417.000
#> BID BIDSIZ MIDQUOTE
#> Min. :155.4 Min. : 1.000 Min. :155.4
#> 1st Qu.:156.4 1st Qu.: 1.000 1st Qu.:156.4
#> Median :156.7 Median : 2.000 Median :156.7
#> Mean :156.9 Mean : 4.304 Mean :156.9
#> 3rd Qu.:157.1 3rd Qu.: 4.000 3rd Qu.:157.1
#> Max. :159.4 Max. :426.000 Max. :159.4
Given our freshly cleaned trades and quotes data, we now want to filter for trades that are proceeded by a compatible set of quotes. Currently, the tradesCleanupUsingQuotes function only supports submitting trades and quotes data for one day only, Hence, we filter our data set to one of the days.
tqdata_cleaned <- tradesCleanupUsingQuotes(tdata = tdata_cleaned$tdata[as.Date(DT) == "2018-01-02"],
qdata = qdata_cleaned$qdata[as.Date(DT) == "2018-01-02"])
tqdata_cleaned
#> DT SYMBOL BID OFR BIDSIZ OFRSIZ MIDQUOTE PRICE
#> 1: 2018-01-02 09:30:34 XXX 158.46 158.58 1 1 158.520 158.525
#> 2: 2018-01-02 09:30:37 XXX 158.46 158.58 3 1 158.520 158.520
#> 3: 2018-01-02 09:30:40 XXX 158.49 158.62 5 1 158.555 158.490
#> 4: 2018-01-02 09:30:49 XXX 158.48 158.55 4 1 158.515 158.480
#> 5: 2018-01-02 09:30:51 XXX 158.48 158.55 4 1 158.515 158.460
#> ---
#> 1435: 2018-01-02 15:59:56 XXX 157.04 157.08 1 4 157.060 157.050
#> 1436: 2018-01-02 15:59:57 XXX 157.04 157.06 5 10 157.050 157.050
#> 1437: 2018-01-02 15:59:57 XXX 157.04 157.06 5 10 157.050 157.050
#> 1438: 2018-01-02 15:59:59 XXX 157.04 157.06 8 25 157.050 157.020
#> 1439: 2018-01-02 15:59:59 XXX 157.04 157.06 1 3 157.050 157.020
Here, we see that once more half of the trades are filtered out due to non-plausability.
Barndorff-Nielsen, O. E., P. R. Hansen, A. Lunde, and N. Shephard (2009). Realized kernels in practice: Trades and quotes. Econometrics Journal 12, C1-C32.