Accessing the Wordbank database

Mika Braginsky

2018-03-14

The wordbankr package allows you to access data in the Wordbank database from R. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.

There are three different data views that you can pull out of Wordbank: by-administration, by-item, and administration-by-item. Additionally, you can get metadata about the sources and instruments underlying the data. Advanced functionality let’s you get estimates of words’ age of acquistion and word mappings across languages.

Administrations

The get_administration_data() function gives by-administration information, either for a specific language and/or form or for all instruments.

get_administration_data(language = "English (American)", form = "WS")
## # A tibble: 5,520 x 15
##    data_id   age comprehension production           language  form
##      <dbl> <int>         <int>      <int>              <chr> <chr>
##  1  129242    27           497        497 English (American)    WS
##  2  129243    21           369        369 English (American)    WS
##  3  129244    26           190        190 English (American)    WS
##  4  129245    27           264        264 English (American)    WS
##  5  129246    19           159        159 English (American)    WS
##  6  129247    30           513        513 English (American)    WS
##  7  129248    25           444        444 English (American)    WS
##  8  129249    24           582        582 English (American)    WS
##  9  129250    28           558        558 English (American)    WS
## 10  129251    18             7          7 English (American)    WS
## # ... with 5,510 more rows, and 9 more variables: birth_order <fctr>,
## #   ethnicity <fctr>, sex <fctr>, zygosity <chr>, norming <lgl>,
## #   mom_ed <fctr>, longitudinal <lgl>, source_name <chr>, license <chr>
get_administration_data()
## # A tibble: 73,230 x 15
##    data_id   age comprehension production language  form birth_order
##      <dbl> <int>         <int>      <int>    <chr> <chr>      <fctr>
##  1   29821    13           293         88 Croatian    WG        <NA>
##  2   29822    16           122         12 Croatian    WG        <NA>
##  3   29823     9             3          0 Croatian    WG        <NA>
##  4   29824    12             0          0 Croatian    WG        <NA>
##  5   29825    12            44          0 Croatian    WG        <NA>
##  6   29826     8            14          5 Croatian    WG        <NA>
##  7   29827     9             2          1 Croatian    WG        <NA>
##  8   29828    10            44          1 Croatian    WG        <NA>
##  9   29829    13           172         51 Croatian    WG        <NA>
## 10   29830    16           241         68 Croatian    WG        <NA>
## # ... with 73,220 more rows, and 8 more variables: ethnicity <fctr>,
## #   sex <fctr>, zygosity <chr>, norming <lgl>, mom_ed <fctr>,
## #   longitudinal <lgl>, source_name <chr>, license <chr>

Items

The get_item_data() function gives by-item information, either for a specific language and/or form or for all instruments.

get_item_data(language = "Italian", form = "WG")
## # A tibble: 505 x 11
##    item_id                          definition language  form        type
##      <chr>                               <chr>    <chr> <chr>       <chr>
##  1  item_1 Risponde quando è chiamato per nome  Italian    WG first_signs
##  2  item_2                   Risponde ad un No  Italian    WG first_signs
##  3  item_3 Reagisce ad un C'è la mamma/il papà  Italian    WG first_signs
##  4  item_4                       Vuoi la pappa  Italian    WG     phrases
##  5  item_5               Hai sonno? Sei stanco  Italian    WG     phrases
##  6  item_6                          Vuoi bere?  Italian    WG     phrases
##  7  item_7                        Stai attento  Italian    WG     phrases
##  8  item_8                          Stai buono  Italian    WG     phrases
##  9  item_9                     Batti le manine  Italian    WG     phrases
## 10 item_10               Cambiamo il pannolino  Italian    WG     phrases
## # ... with 495 more rows, and 6 more variables: category <chr>,
## #   lexical_category <chr>, lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>, num_item_id <dbl>
get_item_data()
## # A tibble: 28,898 x 11
##     item_id definition language  form  type     category lexical_category
##       <chr>      <chr>    <chr> <chr> <chr>        <chr>            <chr>
##  1  item_81     gristi Croatian    WG  word action_words       predicates
##  2 item_264     puhati Croatian    WG  word action_words       predicates
##  3 item_269    razbiti Croatian    WG  word action_words       predicates
##  4  item_64   donijeti Croatian    WG  word action_words       predicates
##  5 item_153     kupiti Croatian    WG  word action_words       predicates
##  6  item_36    čistiti Croatian    WG  word action_words       predicates
##  7 item_384  zatvoriti Croatian    WG  word action_words       predicates
##  8 item_243    plakati Croatian    WG  word action_words       predicates
##  9 item_246    plesati Croatian    WG  word action_words       predicates
## 10  item_42     crtati Croatian    WG  word action_words       predicates
## # ... with 28,888 more rows, and 4 more variables: lexical_class <chr>,
## #   uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>

Administrations x Items

If you are only looking at total vocabulary size, admins is all you need, since it has both productive and receptive vocabulary sizes calculated. If you are looking at specific items or subsets of items, you need to load instrument data, using the get_instrument_data() function. Pass it an instrument language and form, along with a list of items you want to extract (by item_id).

get_instrument_data(
  language = "English (American)",
  form = "WS",
  items = c("item_26", "item_46")
)
## # A tibble: 11,692 x 3
##    data_id    value num_item_id
##      <dbl>    <chr>       <dbl>
##  1  129242 produces          26
##  2  129243 produces          26
##  3  129244 produces          26
##  4  129245 produces          26
##  5  129246                   26
##  6  129247 produces          26
##  7  129248 produces          26
##  8  129249 produces          26
##  9  129250 produces          26
## 10  129251                   26
## # ... with 11,682 more rows

By default get_instrument_table() returns a data frame with columns of the administration’s data_id, the item’s num_item_id (numerical item_id), and the corresponding value. To include administration information, you can set the administrations argument to TRUE, or pass the result of get_administration_data() as administrations (that way you can prevent the administration data from being loaded multiple times). Similarly, you can set the iteminfo argument to TRUE, or pass it result of get_item_data().

Loading the data is fast if you need only a handful of items, but the time scales about linearly with the number of items, and can get quite slow if you need many or all of them. So, it’s a good idea to filter down to only the items you need before calling get_instrument_data().

As an example, let’s say we want to look at the production of animal words on English Words & Sentences over age. First we get the items we want:

animals <- get_item_data(language = "English (American)", form = "WS") %>%
  filter(category == "animals")

Then we get the instrument data for those items:

animal_data <- get_instrument_data(language = "English (American)",
                                   form = "WS",
                                   items = animals$item_id,
                                   administrations = TRUE)

Finally, we calculate how many animals words each child produces and the median number of animals of each age bin:

animal_summary <- animal_data %>%
  mutate(produces = value == "produces") %>%
  group_by(age, data_id) %>%
  summarise(num_animals = sum(produces, na.rm = TRUE)) %>%
  group_by(age) %>%
  summarise(median_num_animals = median(num_animals, na.rm = TRUE))
  
ggplot(animal_summary, aes(x = age, y = median_num_animals)) +
  geom_point() +
  labs(x = "Age (months)", y = "Median animal words producing")

Metadata

Instruments

The get_instruments() function gives information on all the CDI instruments in Wordbank.

get_instruments()
## # A tibble: 50 x 6
##    instrument_id              language  form age_min age_max has_grammar
##            <int>                 <chr> <chr>   <int>   <int>       <int>
##  1             1 British Sign Language    WG       8      36           0
##  2             2             Cantonese    WS      16      30           0
##  3             3              Croatian    WG       8      16           0
##  4             4              Croatian    WS      16      30           0
##  5             5                Danish    WS      16      36           1
##  6             6    English (American)    WG       8      18           0
##  7             7    English (American)    WS      16      30           1
##  8             8                German    WS      18      30           0
##  9             9                Hebrew    WG      11      25           0
## 10            10                Hebrew    WS      18      24           0
## # ... with 40 more rows

Sources

The get_sources() function gives information on all the data sources in Wordbank, either for a specific language and/or form or for all instruments. If the admin_data argument is set to TRUE, the results will also include the number of administrations in the database from that source and the minimum and maximum ages of those administrations.

get_sources(form = "WG")
## # A tibble: 26 x 9
##    source_id          name      dataset instrument_language
##        <int>         <chr>        <chr>               <chr>
##  1         9      Marchman      Norming  English (American)
##  2        10         Byers               English (American)
##  3        11          Thal           13  English (American)
##  4        12          Thal           16  English (American)
##  5        14      Marchman      Norming   Spanish (Mexican)
##  6        18 Kristoffersen                        Norwegian
##  7        19 Kristoffersen longitudinal           Norwegian
##  8        20          CLEX                         Croatian
##  9        24          CLEX                          Russian
## 10        26          CLEX                          Swedish
## # ... with 16 more rows, and 5 more variables: instrument_form <fctr>,
## #   contributor <chr>, citation <chr>, longitudinal <lgl>, license <fctr>
get_sources(language = "Spanish (Mexican)", admin_data = TRUE) %>%
  select(source_id, name, dataset, instrument_form, n_admins, age_min, age_max)
## # A tibble: 4 x 7
##   source_id     name  dataset   instrument_form n_admins age_min age_max
##       <int>    <chr>    <chr>            <fctr>    <int>   <int>   <int>
## 1        13 Marchman  Norming Words & Sentences     1094      16      30
## 2        14 Marchman  Norming  Words & Gestures      778       8      19
## 3        65  Fernald Outreach  Words & Gestures       55      16      22
## 4        66  Fernald Outreach Words & Sentences       80      18      38

Advanced functionality: Age of acquisition

The fit_aoa() function computes estimates of items’ age of acquisition (AoA). It needs to be provided with a data frame returned by get_instrument_data() – one row per administration x item combination, and minimally the columns age and num_item_id. It returns a data frame with one row per item and an aoa column with the estimate, preserving and item-level columns in the input data. The AoA is estimated by computing the proportion of administrations for which the child understands/produces (measure) each word, smoothing the proportion using method, and taking the age at which the smoothed value is greater than proportion.

eng_ws_data <- get_instrument_data(language = "English (American)",
                                   form = "WS",
                                   items = c("item_1", "item_42"),
                                   administrations = TRUE,
                                   iteminfo = TRUE)
fit_aoa(eng_ws_data)
## # A tibble: 2 x 10
##   num_item_id   aoa item_id definition  type category lexical_category
##         <dbl> <dbl>   <chr>      <chr> <chr>    <chr>            <chr>
## 1           1    16  item_1    baa baa  word   sounds            other
## 2          42    24 item_42        owl  word  animals            nouns
## # ... with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>
fit_aoa(eng_ws_data, measure = "understands", method = "glmrob", proportion = 0.7)
## # A tibble: 2 x 10
##   num_item_id   aoa item_id definition  type category lexical_category
##         <dbl> <dbl>   <chr>      <chr> <chr>    <chr>            <chr>
## 1           1    21  item_1    baa baa  word   sounds            other
## 2          42    27 item_42        owl  word  animals            nouns
## # ... with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>

Advanced functionality: Cross-linguistic data

One of the item-level fields is uni_lemma (“universal lemma”), which is intended to be an approximate semantic mapping between words across the languages in Wordbank. The function get_crossling_items() simply gives all the available uni_lemma values.

get_crossling_items()
## # A tibble: 1,379 x 1
##          uni_lemma
##              <chr>
##  1               a
##  2        a little
##  3           a lot
##  4            able
##  5           about
##  6           above
##  7           after
##  8       afternoon
##  9           again
## 10 air conditioner
## # ... with 1,369 more rows

The function get_crossling_data() takes a vector of uni_lemmas and returns a data frame of summary statistics for each item mapped to that uni_lemma in any language (on WG forms). Each row is combination of item and age, and the columns indicate the number of children (n_children), means (comprehension, production), standard deviations (comprehension_sd, production_sd), and item-level fields.

get_crossling_data(uni_lemmas = c("hat", "nose")) %>%
  ungroup() %>%
  select(language, uni_lemma, definition, age, n_children, comprehension,
         production, comprehension_sd, production_sd) %>%
  arrange(uni_lemma)
## # A tibble: 365 x 9
##                 language uni_lemma definition   age n_children
##                    <chr>     <chr>      <chr> <int>      <int>
##  1 British Sign Language       hat        hat     8          4
##  2 British Sign Language       hat        hat     9          4
##  3 British Sign Language       hat        hat    10          4
##  4 British Sign Language       hat        hat    11          6
##  5 British Sign Language       hat        hat    12          6
##  6 British Sign Language       hat        hat    13          6
##  7 British Sign Language       hat        hat    14          7
##  8 British Sign Language       hat        hat    15          6
##  9 British Sign Language       hat        hat    16          7
## 10 British Sign Language       hat        hat    17          7
## # ... with 355 more rows, and 4 more variables: comprehension <dbl>,
## #   production <dbl>, comprehension_sd <dbl>, production_sd <dbl>