Preparing data for finalfit

This vignette shows you how to upload and prepare any dataset for use with finalfit. The demonstration will use the boot::melanoma. Use ?boot::melanoma to see the help page with data description. I will use library(tidyverse) methods. First I’ll write_csv() the data just to demonstrate reading it.

Read data

Note the various options in read_csv(), including providing column names, variable type, missing data identifier etc.

library(readr)

# Save example
write_csv(boot::melanoma, "boot.csv")

# Read data
melanoma = read_csv("boot.csv")
#> Parsed with column specification:
#> cols(
#>   time = col_double(),
#>   status = col_double(),
#>   sex = col_double(),
#>   age = col_double(),
#>   year = col_double(),
#>   thickness = col_double(),
#>   ulcer = col_double()
#> )

Column types

Note the output shows how the columns/variables have been parsed. For full details see ?readr::cols().

Continuous data

Integer (whole numbers) - col_integer()
Double or numeric (real numbers; the name comes from “double-precision floating point”) - col_double()

Categorical data

Factor (a fixed set of names/strings or numbers) - col_factor()
Character (sequences letters, numbers, and symbols) - col_character()
Logical (containing only TRUE or FALSE) - col_logical()

Dates and times

Date - col_date()
Time - col_time()
Date-time - col_datetime()

Check data

ff_glimpse() provides a convenient overview of all data in a tibble or data frame. It is particularly important that factors are correctly specified. Hence, ff_glimpse() separates variables into continuous and categorcial. As expected, no factors are yet specified in the melanoma dataset.

library(finalfit)
ff_glimpse(melanoma)
#> $Continuous
#>               label var_type   n missing_n missing_percent   mean     sd
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1
#> status       status    <dbl> 205         0             0.0    1.8    0.6
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5
#> age             age    <dbl> 205         0             0.0   52.5   16.7
#> year           year    <dbl> 205         0             0.0 1969.9    2.6
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5
#>              min quartile_25 median quartile_75    max
#> time        10.0      1525.0 2005.0      3042.0 5565.0
#> status       1.0         1.0    2.0         2.0    3.0
#> sex          0.0         0.0    0.0         1.0    1.0
#> age          4.0        42.0   54.0        65.0   95.0
#> year      1962.0      1968.0 1970.0      1972.0 1977.0
#> thickness    0.1         1.0    1.9         3.6   17.4
#> ulcer        0.0         0.0    0.0         1.0    1.0
#> 
#> $Categorical
#> # A tibble: 205 x 0

If you wish to see the variables in the order in which they appear in the data frame or tibble, missing_glimpse() or tibble::glimpse() are useful.

missing_glimpse(melanoma)
#>               label var_type   n missing_n missing_percent
#> time           time    <dbl> 205         0             0.0
#> status       status    <dbl> 205         0             0.0
#> sex             sex    <dbl> 205         0             0.0
#> age             age    <dbl> 205         0             0.0
#> year           year    <dbl> 205         0             0.0
#> thickness thickness    <dbl> 205         0             0.0
#> ulcer         ulcer    <dbl> 205         0             0.0

Specify factors

Use an original description of the data (often called a data dictionary) to correctly assign and label any factor variables. This can be done in a single pipe.

library(dplyr)
melanoma %>% 
  mutate(
    status.factor = factor(status, levels = c(1, 2, 3), 
      labels = c("Died from melanoma", "Alive", "Died from other causes")) %>% 
    ff_label("Status"),
    sex.factor = factor(sex, levels = c(1, 0),
      labels = c("Male", "Female")) %>% 
    ff_label("Sex"),
    ulcer.factor = factor(ulcer, levels = c(1, 0),
      labels = c("Present", "Absent")) %>% 
    ff_label("Ulcer")
  ) -> melanoma

ff_glimpse(melanoma)
#> $Continuous
#>               label var_type   n missing_n missing_percent   mean     sd
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1
#> status       status    <dbl> 205         0             0.0    1.8    0.6
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5
#> age             age    <dbl> 205         0             0.0   52.5   16.7
#> year           year    <dbl> 205         0             0.0 1969.9    2.6
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5
#>              min quartile_25 median quartile_75    max
#> time        10.0      1525.0 2005.0      3042.0 5565.0
#> status       1.0         1.0    2.0         2.0    3.0
#> sex          0.0         0.0    0.0         1.0    1.0
#> age          4.0        42.0   54.0        65.0   95.0
#> year      1962.0      1968.0 1970.0      1972.0 1977.0
#> thickness    0.1         1.0    1.9         3.6   17.4
#> ulcer        0.0         0.0    0.0         1.0    1.0
#> 
#> $Categorical
#>                label var_type   n missing_n missing_percent levels_n
#> status.factor Status    <fct> 205         0             0.0        3
#> sex.factor       Sex    <fct> 205         0             0.0        2
#> ulcer.factor   Ulcer    <fct> 205         0             0.0        2
#>                                                                levels
#> status.factor "Died from melanoma", "Alive", "Died from other causes"
#> sex.factor                                           "Male", "Female"
#> ulcer.factor                                      "Present", "Absent"
#>               levels_count   levels_percent
#> status.factor  57, 134, 14 27.8, 65.4,  6.8
#> sex.factor         79, 126           39, 61
#> ulcer.factor       90, 115           44, 56

Everything looks good and you are ready to start analysis.