This vignette shows you how to upload and prepare any dataset for use with finalfit. The demonstration will use the boot::melanoma
. Use ?boot::melanoma
to see the help page with data description. I will use library(tidyverse)
methods. First I’ll write_csv()
the data just to demonstrate reading it.
Note the various options in read_csv()
, including providing column names, variable type, missing data identifier etc.
library(readr)
# Save example
write_csv(boot::melanoma, "boot.csv")
# Read data
melanoma = read_csv("boot.csv")
#> Parsed with column specification:
#> cols(
#> time = col_double(),
#> status = col_double(),
#> sex = col_double(),
#> age = col_double(),
#> year = col_double(),
#> thickness = col_double(),
#> ulcer = col_double()
#> )
Note the output shows how the columns/variables have been parsed. For full details see ?readr::cols()
.
col_integer()
col_double()
col_factor()
col_character()
col_logical()
col_date()
col_time()
col_datetime()
ff_glimpse()
provides a convenient overview of all data in a tibble or data frame. It is particularly important that factors are correctly specified. Hence, ff_glimpse()
separates variables into continuous and categorcial. As expected, no factors are yet specified in the melanoma dataset.
library(finalfit)
ff_glimpse(melanoma)
#> $Continuous
#> label var_type n missing_n missing_percent mean sd
#> time time <dbl> 205 0 0.0 2152.8 1122.1
#> status status <dbl> 205 0 0.0 1.8 0.6
#> sex sex <dbl> 205 0 0.0 0.4 0.5
#> age age <dbl> 205 0 0.0 52.5 16.7
#> year year <dbl> 205 0 0.0 1969.9 2.6
#> thickness thickness <dbl> 205 0 0.0 2.9 3.0
#> ulcer ulcer <dbl> 205 0 0.0 0.4 0.5
#> min quartile_25 median quartile_75 max
#> time 10.0 1525.0 2005.0 3042.0 5565.0
#> status 1.0 1.0 2.0 2.0 3.0
#> sex 0.0 0.0 0.0 1.0 1.0
#> age 4.0 42.0 54.0 65.0 95.0
#> year 1962.0 1968.0 1970.0 1972.0 1977.0
#> thickness 0.1 1.0 1.9 3.6 17.4
#> ulcer 0.0 0.0 0.0 1.0 1.0
#>
#> $Categorical
#> # A tibble: 205 x 0
If you wish to see the variables in the order in which they appear in the data frame or tibble, missing_glimpse()
or tibble::glimpse()
are useful.
Use an original description of the data (often called a data dictionary) to correctly assign and label any factor variables. This can be done in a single pipe.
library(dplyr)
melanoma %>%
mutate(
status.factor = factor(status, levels = c(1, 2, 3),
labels = c("Died from melanoma", "Alive", "Died from other causes")) %>%
ff_label("Status"),
sex.factor = factor(sex, levels = c(1, 0),
labels = c("Male", "Female")) %>%
ff_label("Sex"),
ulcer.factor = factor(ulcer, levels = c(1, 0),
labels = c("Present", "Absent")) %>%
ff_label("Ulcer")
) -> melanoma
ff_glimpse(melanoma)
#> $Continuous
#> label var_type n missing_n missing_percent mean sd
#> time time <dbl> 205 0 0.0 2152.8 1122.1
#> status status <dbl> 205 0 0.0 1.8 0.6
#> sex sex <dbl> 205 0 0.0 0.4 0.5
#> age age <dbl> 205 0 0.0 52.5 16.7
#> year year <dbl> 205 0 0.0 1969.9 2.6
#> thickness thickness <dbl> 205 0 0.0 2.9 3.0
#> ulcer ulcer <dbl> 205 0 0.0 0.4 0.5
#> min quartile_25 median quartile_75 max
#> time 10.0 1525.0 2005.0 3042.0 5565.0
#> status 1.0 1.0 2.0 2.0 3.0
#> sex 0.0 0.0 0.0 1.0 1.0
#> age 4.0 42.0 54.0 65.0 95.0
#> year 1962.0 1968.0 1970.0 1972.0 1977.0
#> thickness 0.1 1.0 1.9 3.6 17.4
#> ulcer 0.0 0.0 0.0 1.0 1.0
#>
#> $Categorical
#> label var_type n missing_n missing_percent levels_n
#> status.factor Status <fct> 205 0 0.0 3
#> sex.factor Sex <fct> 205 0 0.0 2
#> ulcer.factor Ulcer <fct> 205 0 0.0 2
#> levels
#> status.factor "Died from melanoma", "Alive", "Died from other causes"
#> sex.factor "Male", "Female"
#> ulcer.factor "Present", "Absent"
#> levels_count levels_percent
#> status.factor 57, 134, 14 27.8, 65.4, 6.8
#> sex.factor 79, 126 39, 61
#> ulcer.factor 90, 115 44, 56
Everything looks good and you are ready to start analysis.