The wordbankr
package allows you to access data in the Wordbank database from R
. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.
There are three different data views that you can pull out of Wordbank: by-administration, by-item, and administration-by-item. Additionally, you can get metadata about the sources and instruments underlying the data. Advanced functionality let’s you get estimates of words’ age of acquistion and word mappings across languages.
The get_administration_data()
function gives by-administration information, either for a specific language and/or form or for all instruments.
get_administration_data(language = "English (American)", form = "WS")
## # A tibble: 5,520 x 15
## data_id age comprehension production language form
## <dbl> <int> <int> <int> <chr> <chr>
## 1 129242 27 497 497 English (American) WS
## 2 129243 21 369 369 English (American) WS
## 3 129244 26 190 190 English (American) WS
## 4 129245 27 264 264 English (American) WS
## 5 129246 19 159 159 English (American) WS
## 6 129247 30 513 513 English (American) WS
## 7 129248 25 444 444 English (American) WS
## 8 129249 24 582 582 English (American) WS
## 9 129250 28 558 558 English (American) WS
## 10 129251 18 7 7 English (American) WS
## # ... with 5,510 more rows, and 9 more variables: birth_order <fctr>,
## # ethnicity <fctr>, sex <fctr>, zygosity <chr>, norming <lgl>,
## # mom_ed <fctr>, longitudinal <lgl>, source_name <chr>, license <chr>
get_administration_data()
## # A tibble: 73,230 x 15
## data_id age comprehension production language form birth_order
## <dbl> <int> <int> <int> <chr> <chr> <fctr>
## 1 29821 13 293 88 Croatian WG <NA>
## 2 29822 16 122 12 Croatian WG <NA>
## 3 29823 9 3 0 Croatian WG <NA>
## 4 29824 12 0 0 Croatian WG <NA>
## 5 29825 12 44 0 Croatian WG <NA>
## 6 29826 8 14 5 Croatian WG <NA>
## 7 29827 9 2 1 Croatian WG <NA>
## 8 29828 10 44 1 Croatian WG <NA>
## 9 29829 13 172 51 Croatian WG <NA>
## 10 29830 16 241 68 Croatian WG <NA>
## # ... with 73,220 more rows, and 8 more variables: ethnicity <fctr>,
## # sex <fctr>, zygosity <chr>, norming <lgl>, mom_ed <fctr>,
## # longitudinal <lgl>, source_name <chr>, license <chr>
The get_item_data()
function gives by-item information, either for a specific language and/or form or for all instruments.
get_item_data(language = "Italian", form = "WG")
## # A tibble: 505 x 11
## item_id definition language form type
## <chr> <chr> <chr> <chr> <chr>
## 1 item_1 Risponde quando è chiamato per nome Italian WG first_signs
## 2 item_2 Risponde ad un No Italian WG first_signs
## 3 item_3 Reagisce ad un C'è la mamma/il papà Italian WG first_signs
## 4 item_4 Vuoi la pappa Italian WG phrases
## 5 item_5 Hai sonno? Sei stanco Italian WG phrases
## 6 item_6 Vuoi bere? Italian WG phrases
## 7 item_7 Stai attento Italian WG phrases
## 8 item_8 Stai buono Italian WG phrases
## 9 item_9 Batti le manine Italian WG phrases
## 10 item_10 Cambiamo il pannolino Italian WG phrases
## # ... with 495 more rows, and 6 more variables: category <chr>,
## # lexical_category <chr>, lexical_class <chr>, uni_lemma <chr>,
## # complexity_category <chr>, num_item_id <dbl>
get_item_data()
## # A tibble: 28,898 x 11
## item_id definition language form type category lexical_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 item_81 gristi Croatian WG word action_words predicates
## 2 item_264 puhati Croatian WG word action_words predicates
## 3 item_269 razbiti Croatian WG word action_words predicates
## 4 item_64 donijeti Croatian WG word action_words predicates
## 5 item_153 kupiti Croatian WG word action_words predicates
## 6 item_36 čistiti Croatian WG word action_words predicates
## 7 item_384 zatvoriti Croatian WG word action_words predicates
## 8 item_243 plakati Croatian WG word action_words predicates
## 9 item_246 plesati Croatian WG word action_words predicates
## 10 item_42 crtati Croatian WG word action_words predicates
## # ... with 28,888 more rows, and 4 more variables: lexical_class <chr>,
## # uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>
If you are only looking at total vocabulary size, admins
is all you need, since it has both productive and receptive vocabulary sizes calculated. If you are looking at specific items or subsets of items, you need to load instrument data, using the get_instrument_data()
function. Pass it an instrument language and form, along with a list of items you want to extract (by item_id
).
get_instrument_data(
language = "English (American)",
form = "WS",
items = c("item_26", "item_46")
)
## # A tibble: 11,692 x 3
## data_id value num_item_id
## <dbl> <chr> <dbl>
## 1 129242 produces 26
## 2 129243 produces 26
## 3 129244 produces 26
## 4 129245 produces 26
## 5 129246 26
## 6 129247 produces 26
## 7 129248 produces 26
## 8 129249 produces 26
## 9 129250 produces 26
## 10 129251 26
## # ... with 11,682 more rows
By default get_instrument_table()
returns a data frame with columns of the administration’s data_id
, the item’s num_item_id
(numerical item_id
), and the corresponding value. To include administration information, you can set the administrations
argument to TRUE
, or pass the result of get_administration_data()
as administrations
(that way you can prevent the administration data from being loaded multiple times). Similarly, you can set the iteminfo
argument to TRUE
, or pass it result of get_item_data()
.
Loading the data is fast if you need only a handful of items, but the time scales about linearly with the number of items, and can get quite slow if you need many or all of them. So, it’s a good idea to filter down to only the items you need before calling get_instrument_data()
.
As an example, let’s say we want to look at the production of animal words on English Words & Sentences over age. First we get the items we want:
animals <- get_item_data(language = "English (American)", form = "WS") %>%
filter(category == "animals")
Then we get the instrument data for those items:
animal_data <- get_instrument_data(language = "English (American)",
form = "WS",
items = animals$item_id,
administrations = TRUE)
Finally, we calculate how many animals words each child produces and the median number of animals of each age bin:
animal_summary <- animal_data %>%
mutate(produces = value == "produces") %>%
group_by(age, data_id) %>%
summarise(num_animals = sum(produces, na.rm = TRUE)) %>%
group_by(age) %>%
summarise(median_num_animals = median(num_animals, na.rm = TRUE))
ggplot(animal_summary, aes(x = age, y = median_num_animals)) +
geom_point() +
labs(x = "Age (months)", y = "Median animal words producing")
The get_instruments()
function gives information on all the CDI instruments in Wordbank.
get_instruments()
## # A tibble: 50 x 6
## instrument_id language form age_min age_max has_grammar
## <int> <chr> <chr> <int> <int> <int>
## 1 1 British Sign Language WG 8 36 0
## 2 2 Cantonese WS 16 30 0
## 3 3 Croatian WG 8 16 0
## 4 4 Croatian WS 16 30 0
## 5 5 Danish WS 16 36 1
## 6 6 English (American) WG 8 18 0
## 7 7 English (American) WS 16 30 1
## 8 8 German WS 18 30 0
## 9 9 Hebrew WG 11 25 0
## 10 10 Hebrew WS 18 24 0
## # ... with 40 more rows
The get_sources()
function gives information on all the data sources in Wordbank, either for a specific language and/or form or for all instruments. If the admin_data
argument is set to TRUE
, the results will also include the number of administrations in the database from that source and the minimum and maximum ages of those administrations.
get_sources(form = "WG")
## # A tibble: 26 x 9
## source_id name dataset instrument_language
## <int> <chr> <chr> <chr>
## 1 9 Marchman Norming English (American)
## 2 10 Byers English (American)
## 3 11 Thal 13 English (American)
## 4 12 Thal 16 English (American)
## 5 14 Marchman Norming Spanish (Mexican)
## 6 18 Kristoffersen Norwegian
## 7 19 Kristoffersen longitudinal Norwegian
## 8 20 CLEX Croatian
## 9 24 CLEX Russian
## 10 26 CLEX Swedish
## # ... with 16 more rows, and 5 more variables: instrument_form <fctr>,
## # contributor <chr>, citation <chr>, longitudinal <lgl>, license <fctr>
get_sources(language = "Spanish (Mexican)", admin_data = TRUE) %>%
select(source_id, name, dataset, instrument_form, n_admins, age_min, age_max)
## # A tibble: 4 x 7
## source_id name dataset instrument_form n_admins age_min age_max
## <int> <chr> <chr> <fctr> <int> <int> <int>
## 1 13 Marchman Norming Words & Sentences 1094 16 30
## 2 14 Marchman Norming Words & Gestures 778 8 19
## 3 65 Fernald Outreach Words & Gestures 55 16 22
## 4 66 Fernald Outreach Words & Sentences 80 18 38
The fit_aoa()
function computes estimates of items’ age of acquisition (AoA). It needs to be provided with a data frame returned by get_instrument_data()
– one row per administration x item combination, and minimally the columns age
and num_item_id
. It returns a data frame with one row per item and an aoa
column with the estimate, preserving and item-level columns in the input data. The AoA is estimated by computing the proportion of administrations for which the child understands/produces (measure
) each word, smoothing the proportion using method
, and taking the age at which the smoothed value is greater than proportion
.
eng_ws_data <- get_instrument_data(language = "English (American)",
form = "WS",
items = c("item_1", "item_42"),
administrations = TRUE,
iteminfo = TRUE)
fit_aoa(eng_ws_data)
## # A tibble: 2 x 10
## num_item_id aoa item_id definition type category lexical_category
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 16 item_1 baa baa word sounds other
## 2 42 24 item_42 owl word animals nouns
## # ... with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## # complexity_category <chr>
fit_aoa(eng_ws_data, measure = "understands", method = "glmrob", proportion = 0.7)
## # A tibble: 2 x 10
## num_item_id aoa item_id definition type category lexical_category
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 21 item_1 baa baa word sounds other
## 2 42 27 item_42 owl word animals nouns
## # ... with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## # complexity_category <chr>
One of the item-level fields is uni_lemma
(“universal lemma”), which is intended to be an approximate semantic mapping between words across the languages in Wordbank. The function get_crossling_items()
simply gives all the available uni_lemma
values.
get_crossling_items()
## # A tibble: 1,379 x 1
## uni_lemma
## <chr>
## 1 a
## 2 a little
## 3 a lot
## 4 able
## 5 about
## 6 above
## 7 after
## 8 afternoon
## 9 again
## 10 air conditioner
## # ... with 1,369 more rows
The function get_crossling_data()
takes a vector of uni_lemmas
and returns a data frame of summary statistics for each item mapped to that uni_lemma in any language (on WG
forms). Each row is combination of item and age, and the columns indicate the number of children (n_children
), means (comprehension
, production
), standard deviations (comprehension_sd
, production_sd
), and item-level fields.
get_crossling_data(uni_lemmas = c("hat", "nose")) %>%
ungroup() %>%
select(language, uni_lemma, definition, age, n_children, comprehension,
production, comprehension_sd, production_sd) %>%
arrange(uni_lemma)
## # A tibble: 365 x 9
## language uni_lemma definition age n_children
## <chr> <chr> <chr> <int> <int>
## 1 British Sign Language hat hat 8 4
## 2 British Sign Language hat hat 9 4
## 3 British Sign Language hat hat 10 4
## 4 British Sign Language hat hat 11 6
## 5 British Sign Language hat hat 12 6
## 6 British Sign Language hat hat 13 6
## 7 British Sign Language hat hat 14 7
## 8 British Sign Language hat hat 15 6
## 9 British Sign Language hat hat 16 7
## 10 British Sign Language hat hat 17 7
## # ... with 355 more rows, and 4 more variables: comprehension <dbl>,
## # production <dbl>, comprehension_sd <dbl>, production_sd <dbl>