Supporting additional objects

Introduction

The skim() function summarizes data types contained within data frames. It comes with a set of default summary functions for a wide variety of data types, but this is not comprehensive. Package authors can add support for skimming their specific data types in their packages, and they can provide different defaults in their own summary functions.

This example will illustrate this by creating support for the sf object produced by the "sf: Simple Features for R" package. For any object this involves two required elements and one optional element.

experiment with interactive changes
create methods to get_skimmers for different objects within this package
if needed, define any custom statistics

If you are adding skim support to a package you will also need to add skimr to the list of imports. Note that to run the code in this vignette you will need to install the sf package. We suggest not doing that, and instead substitute whatever package you are working with.

library(skimr)
library(sf)

## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0

nc <- st_read(system.file("shape/nc.shp", package = "sf"))

## Reading layer `nc' from data source `/Library/Frameworks/R.framework/Versions/4.0/Resources/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## CRS:            4267

class(nc)

## [1] "sf"         "data.frame"

class(nc$geometry)

## [1] "sfc_MULTIPOLYGON" "sfc"

Unlike the example of having a new type of data in a column of a simple data frame in the "Using skimr" vignette, this is a different type of object with special attributes.

In this object there is also a column of a class that does not have default skimmers. By default, skimr falls back to use the sfl for character variables.

skim(nc$geometry)

## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.

Data summary
Name	nc$geometry
Number of rows	100
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
geometry	0	1	232	1965	0	100	0

Experiment interactively

skimr has an opinionated list of functions for each class (e.g. numeric, factor) of data. The core package supports many commonly used classes, but there are many others. You can investigate these defaults by calling get_default_skimmer_names().

What if your data type isn't covered by defaults? skimr usually falls back to treating the type as a character, which isn't necessarily helpful. In this case, you're best off adding your data type with skim_with().

Before we begin, we'll be using the following custom summary statistic throughout. The function gets the geometry's crs and combines it into a string.

get_crs <- function(column) {
  crs <- sf::st_crs(column)

  paste0("epsg: ", crs[["epsg"]], " proj4string: '", crs[["proj4string"]], "'")
}

This function, like all summary functions used by skimr has two notable features.

It accepts a vector as its single argument
It returns a scalar

There are a lot of functions that fulfill these criteria:

existing functions from base, stats, or other packages,
lambda's created using the Tidyverse-style syntax
custom functions that have been defined in the skimr package
custom functions that you have defined.

Not fulfilling the two criteria can lead to some very confusing behavior within skimr. Beware! An example of this issue is the base quantile() function in default skimr percentiles are returned by using quantile() five times.

Next, we create a custom skimming function. To do this, we need to think about the many specific classes of data in the sf package. From above, you can see the geometry column has two classes: 1st the specific geometry type (e.g. sfc_MULTIPOLYGON sfc_LINESTRING, sfc_POLYGON, sfc_MULTIPOINT) and 2nd the general sfc class. Skimr will try to find a sfl() helper function for the classes in the order they appear in class(.) (see S3 classes for more detail Advanced R). The following example will build support for sfc, which encompasses all sf objects: sfc_MULTIPOLYGON sfc_LINESTRING, sfc_POLYGON, sfc_MULTIPOINT. If we want custom skim_with functions we can write sfl() helper functions for the geometry type.

skim_sf <- skim_with(
  sfc = sfl(
    n_unique = n_unique,
    valid = ~ sum(sf::st_is_valid(.)),
    crs = get_crs
  )
)

## Creating new skimming functions for the following classes: sfc.
## They did not have recognized defaults. Call get_default_skimmers() for more information.

The example above creates a new function, and you can call that function on a specific column with sfc data to get the appropriate summary statistics. The skim_with factory also uses the default skimrs for things like factors, characters, and numerics. Therefore our skim_sf is like the regular skim function with the added ability to summarize sfc columns.

skim_sf(nc$geometry)

Data summary
Name	nc$geometry
Number of rows	100
Number of columns	1
_______________________
Column type frequency:
sfc	1
________________________
Group variables	None

Variable type: sfc

skim_variable	n_missing	complete_rate	n_unique	valid	crs
geometry	0	1	100	100	epsg: proj4string: ''

While this works for any data type and you can also include it within any package (assuming your users load skimr), there is an even better approach in this case. To take full advantage of skimr, we'll dig a bit into its API.

Adding new methods

skimr has a lookup mechanism, based on the function get_skimmers(), to find default summary functions for each class. This is based on the S3 class system. You can learn more about it in Advanced R.

This requires that you add skimr to your list of dependencies.

To export a new set of defaults for a data type, create a method for the generic function get_skimmers. Each of those methods returns an sfl, a skimr function list. This is the same list-like data structure used in the skim_with() example above. But note! There is one key difference. When adding a generic we also want to identify the skim_type in the sfl. You will probably want to use skimr::get_skimmers.sfc() but that will not work in a vignette.

#' @importFrom skimr get_skimmers
#' @export
get_skimmers.sfc <- function(column) {
  sfl(
    skim_type = "sfc",
    n_unique = n_unique,
    valid = ~ sum(sf::st_is_valid(.)),
    crs = get_crs
  )
}

The same strategy follows for other data types.

Create a method
return an sfl
make sure that the skim_type is there

Users of your package should load skimr to get the skim() function (although you could import and reexport it). Once loaded, a call to get_default_skimmer_names() will return defaults for your data types as well!

get_default_skimmer_names()

## $AsIs
## [1] "n_unique"   "min_length" "max_length"
## 
## $Date
## [1] "min"      "max"      "median"   "n_unique"
## 
## $POSIXct
## [1] "min"      "max"      "median"   "n_unique"
## 
## $Timespan
## [1] "min"      "max"      "median"   "n_unique"
## 
## $character
## [1] "min"        "max"        "empty"      "n_unique"   "whitespace"
## 
## $complex
## [1] "mean"
## 
## $difftime
## [1] "min"      "max"      "median"   "n_unique"
## 
## $factor
## [1] "ordered"    "n_unique"   "top_counts"
## 
## $list
## [1] "n_unique"   "min_length" "max_length"
## 
## $logical
## [1] "mean"  "count"
## 
## $numeric
## [1] "mean" "sd"   "p0"   "p25"  "p50"  "p75"  "p100" "hist"
## 
## $sfc
## [1] "n_unique" "valid"    "crs"     
## 
## $ts
##  [1] "start"      "end"        "frequency"  "deltat"     "mean"      
##  [6] "sd"         "min"        "max"        "median"     "line_graph"

They will then be able to use skim() directly.

skim(nc)

Data summary
Name	nc
Number of rows	100
Number of columns	15
_______________________
Column type frequency:
character	2
numeric	12
sfc	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
NAME	0	1	3	12	0	100	0
FIPS	0	1	5	5	0	100	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
AREA	1	0.13	0.05	0.04	0.09	0.12	0.15	0.24	▆▇▆▃▂
PERIMETER	1	1.67	0.48	1.00	1.32	1.61	1.86	3.64	▇▇▂▁▁
CNTY_	1	1985.96	106.52	1825.00	1902.25	1982.00	2067.25	2241.00	▇▆▆▅▁
CNTY_ID	1	1985.96	106.52	1825.00	1902.25	1982.00	2067.25	2241.00	▇▆▆▅▁
FIPSNO	1	37100.00	58.02	37001.00	37050.50	37100.00	37149.50	37199.00	▇▇▇▇▇
CRESS_ID	1	50.50	29.01	1.00	25.75	50.50	75.25	100.00	▇▇▇▇▇
BIR74	1	3299.62	3848.17	248.00	1077.00	2180.50	3936.00	21588.00	▇▁▁▁▁
SID74	1	6.67	7.78	0.00	2.00	4.00	8.25	44.00	▇▂▁▁▁
NWBIR74	1	1050.81	1432.91	1.00	190.00	697.50	1168.50	8027.00	▇▁▁▁▁
BIR79	1	4223.92	5179.46	319.00	1336.25	2636.00	4889.00	30757.00	▇▁▁▁▁
SID79	1	8.36	9.43	0.00	2.00	5.00	10.25	57.00	▇▂▁▁▁
NWBIR79	1	1352.81	1976.00	3.00	250.50	874.50	1406.75	11631.00	▇▁▁▁▁

Variable type: sfc

skim_variable	n_missing	complete_rate	n_unique	valid	crs
geometry	0	1	100	100	epsg: proj4string: ''

Supporting additional objects

2020-07-05

Introduction

Experiment interactively

Adding new methods

Conclusion