Data cleaning

Data cleaning of high-frequency data is a necessary step in all finance and financial econometrics applications. The reason is that most data providers like tick data offer raw instead of preprocessed data. The upside is that you can follow the cleaning process yourself instead of relying on another party. In this vignette, we will address a common workflow that makes use of both trades and quotes data on a microsecond resolution.

Cleaning tick data on trades

We start by loading the highfrequency package and printing the head of one example data set that contains trade data for one stock on two days.

library(highfrequency)
head(sample_tdataraw_microseconds)
#>                     DT EX SYMBOL SYM_SUFFIX TR_SCOND SIZE  PRICE TR_CORR
#> 1: 2018-01-02 05:01:21  P    XXX         NA      FTI    2 157.80       0
#> 2: 2018-01-02 05:23:50  P    XXX         NA      FTI    3 157.80       0
#> 3: 2018-01-02 05:23:50  P    XXX         NA      FTI    1 157.80       0
#> 4: 2018-01-02 07:11:54  P    XXX         NA        T  130 158.00       0
#> 5: 2018-01-02 07:23:12  P    XXX         NA       TI   30 158.10       0
#> 6: 2018-01-02 07:28:44  P    XXX         NA       TI   30 158.28       0
#>    TR_SEQNUM TR_SOURCE TR_RF
#> 1:     48101         C      
#> 2:     48701         C      
#> 3:     48801         C      
#> 4:     67901         C      
#> 5:     72101         C      
#> 6:     73301         C

A quick look via summary into the data let’s us see that there are trades listed in the data set out of the bounds of the regular trading hours that are 9:30 to 16:00 on regular work days.

summary(sample_tdataraw_microseconds[, c("DT", "SIZE", "PRICE")])
#>        DT                           SIZE              PRICE      
#>  Min.   :2018-01-02 05:01:21   Min.   :     1.0   Min.   :155.4  
#>  1st Qu.:2018-01-02 12:50:07   1st Qu.:    29.0   1st Qu.:156.4  
#>  Median :2018-01-02 15:59:06   Median :   100.0   Median :156.7  
#>  Mean   :2018-01-03 00:33:21   Mean   :   132.7   Mean   :156.8  
#>  3rd Qu.:2018-01-03 12:21:06   3rd Qu.:   100.0   3rd Qu.:157.1  
#>  Max.   :2018-01-03 19:55:37   Max.   :443901.0   Max.   :159.4

Next, we will use one of three main cleaning functions called tradesCleanup. It implements the steps laid out in the seminal paper of Barndorff-Nielsen et al. (2009). It processes the given data by

tdata_cleaned <- tradesCleanup(tdataraw = sample_tdataraw_microseconds, exchange = "N")

When using tradesCleanup, the return value is a list of two. It contains a report about the number of observations after each cleaning step and the cleaned data itself.

tdata_cleaned$report
#>         initial_number         no_zero_trades        select_exchange 
#>                  77263                  77263                  11191 
#> remove_sales_condition   merge_same_timestamp 
#>                   3741                   3030

summary(tdata_cleaned$tdata[, c("DT", "SIZE", "PRICE")])
#>        DT                           SIZE            PRICE      
#>  Min.   :2018-01-02 09:30:34   Min.   : 100.0   Min.   :155.4  
#>  1st Qu.:2018-01-02 11:55:10   1st Qu.: 100.0   1st Qu.:156.4  
#>  Median :2018-01-02 15:59:28   Median : 100.0   Median :156.7  
#>  Mean   :2018-01-03 00:20:45   Mean   : 144.7   Mean   :156.9  
#>  3rd Qu.:2018-01-03 12:12:11   3rd Qu.: 167.0   3rd Qu.:157.2  
#>  Max.   :2018-01-03 15:59:59   Max.   :3048.5   Max.   :159.4

As just one example, one can see that all trades that had time stamps outside the openinig hours of the NYSE. The report of the cleaning procedure shows us that most of the observations are removed either due to originating from another exchange or having irregular sales conditions.

As a last step for cleaning tick data on trades it is advised to check whether compatible quotes have been made before the recorded transaction. For doing so, we first need to implement a similar cleaning procedure for quotes data.

Cleaning tick data on quotes

The highfrequency function also brings a similarly-named function for cleaning quotes data, quotesCleanup. Here, the cleaning steps are

qdata_cleaned <- quotesCleanup(qdataraw = sample_qdataraw_microseconds, exchange = "N")

quotesCleanup also provides a list with two elements, a report and the cleaned data.

qdata_cleaned$report
#>         initial_number         no_zero_quotes        select_exchange 
#>                 464221                 463191                  94422 
#> remove_negative_spread    remove_large_spread   merge_same_timestamp 
#>                  94422                  94422                  46566 
#>        remove_outliers 
#>                  46566

summary(qdata_cleaned$qdata[, c("DT", "OFR", "OFRSIZ", "BID", "BIDSIZ", "MIDQUOTE")])
#>        DT                           OFR            OFRSIZ       
#>  Min.   :2018-01-02 09:30:00   Min.   :155.4   Min.   :  1.000  
#>  1st Qu.:2018-01-02 12:21:55   1st Qu.:156.5   1st Qu.:  1.000  
#>  Median :2018-01-02 15:54:41   Median :156.7   Median :  2.000  
#>  Mean   :2018-01-03 00:03:51   Mean   :156.9   Mean   :  4.211  
#>  3rd Qu.:2018-01-03 12:12:14   3rd Qu.:157.2   3rd Qu.:  4.000  
#>  Max.   :2018-01-03 16:05:18   Max.   :159.4   Max.   :417.000  
#>       BID            BIDSIZ           MIDQUOTE    
#>  Min.   :155.4   Min.   :  1.000   Min.   :155.4  
#>  1st Qu.:156.4   1st Qu.:  1.000   1st Qu.:156.4  
#>  Median :156.7   Median :  2.000   Median :156.7  
#>  Mean   :156.9   Mean   :  4.304   Mean   :156.9  
#>  3rd Qu.:157.1   3rd Qu.:  4.000   3rd Qu.:157.1  
#>  Max.   :159.4   Max.   :426.000   Max.   :159.4

Cleaning transaction data using cleaned quotes

Given our freshly cleaned trades and quotes data, we now want to filter for trades that are proceeded by a compatible set of quotes. Currently, the tradesCleanupUsingQuotes function only supports submitting trades and quotes data for one day only, Hence, we filter our data set to one of the days.

tqdata_cleaned <- tradesCleanupUsingQuotes(tdata = tdata_cleaned$tdata[as.Date(DT) == "2018-01-02"], 
                                           qdata = qdata_cleaned$qdata[as.Date(DT) == "2018-01-02"])
tqdata_cleaned
#>                        DT SYMBOL    BID    OFR BIDSIZ OFRSIZ MIDQUOTE   PRICE
#>    1: 2018-01-02 09:30:34    XXX 158.46 158.58      1      1  158.520 158.525
#>    2: 2018-01-02 09:30:37    XXX 158.46 158.58      3      1  158.520 158.520
#>    3: 2018-01-02 09:30:40    XXX 158.49 158.62      5      1  158.555 158.490
#>    4: 2018-01-02 09:30:49    XXX 158.48 158.55      4      1  158.515 158.480
#>    5: 2018-01-02 09:30:51    XXX 158.48 158.55      4      1  158.515 158.460
#>   ---                                                                        
#> 1435: 2018-01-02 15:59:56    XXX 157.04 157.08      1      4  157.060 157.050
#> 1436: 2018-01-02 15:59:57    XXX 157.04 157.06      5     10  157.050 157.050
#> 1437: 2018-01-02 15:59:57    XXX 157.04 157.06      5     10  157.050 157.050
#> 1438: 2018-01-02 15:59:59    XXX 157.04 157.06      8     25  157.050 157.020
#> 1439: 2018-01-02 15:59:59    XXX 157.04 157.06      1      3  157.050 157.020

Here, we see that once more half of the trades are filtered out due to non-plausability.

References

Barndorff-Nielsen, O. E., P. R. Hansen, A. Lunde, and N. Shephard (2009). Realized kernels in practice: Trades and quotes. Econometrics Journal 12, C1-C32.