Creating a Corpus Data Package: Wrapping Europarl

Andreas Blätte (andreas.blaette@uni-due.de)

2020-07-21

Creating the ‘europarl’

The European Parliament Proceedings Parallel Corpus 1996-2011, known as Europarl is a classical resource in the CWB community for demonstration purposes. It is a resource valuable in itself. So we will be happy in the following example to use the CWB version of Europarl as an example how to wrap a CWB indexed corpus into a R data package.

Preparations

The example keeps the number of dependencies to a minimum, so we only need to load the cwbtools package.

library(cwbtools)
library(devtools)
## Loading required package: usethis

The procedure will be as follow:

We then conclude the example by running a simple test whether the corpora can be used.

For the steps 1 through 5, we use a temporary directory. One precaution in advance: The files that will be downloaded are large (~ 2 GB). So it will be important to remove the temporary data at the end.

tmp_dir <- tempdir()

And because the whole operation is too much for an ordinary process to build R packages, we set an argument whether all the code is executed to FALSE.

exec <- FALSE

Downloading the CWB version of Europarl

The OPUS project offers a tarball with the Europarl corpus in English, French, German, Dutch, Spanish and Italian. The download may take a while. Take care that you have a good internet connection!

europarl_url <- "http://corpora.linguistik.uni-erlangen.de/demos/download/Europarl3-CWB-2010-02-28.tar.gz"
europarl_tarball <- file.path(tmp_dir, basename(europarl_url))
download.file(url = europarl_url, destfile = europarl_tarball)

Extract the compressed file

The tarball is large, so please do not be surprised that extracting it may take a while.

untar(tarfile = europarl_tarball, exdir = tmp_dir)
unlink(europarl_tarball)

Adjust paths

The directory with the extracted files includes a directory registry and data. To be able to move the corpora later on, the paths in the registry files need to point to the correct directories in the data directory.

europarl_registry_dir <- file.path(tmp_dir, "Europarl3-CWB", "registry")
europarl_data_dir <- file.path(tmp_dir, "Europarl3-CWB", "data")
corpora <- list.files(europarl_registry_dir)
corpora

So these are the corpora. We iterate through the registry files and adjust the path to the data directories.

for (corpus in corpora){
  registry <- registry_file_parse(corpus = corpus, registry_dir = europarl_registry_dir)
  registry[["home"]] <- file.path(europarl_data_dir, gsub("^europarl-(.*)$", "\\1", corpus))
  registry_file_write(data = registry, corpus = corpus, registry_dir = europarl_registry_dir)
}

You now already have working, albeit temporary installation of Europarl on your system. But we want to move things into a data package.

Create data package

We create a directory for the package and generate the directory structure required to put CWB indexed corpus in there.

europarl_pkg_dir <- file.path(tmp_dir, "europarl")
if (!file.exists(europarl_pkg_dir)) dir.create(europarl_pkg_dir)
pkg_create_cwb_dirs(pkg = europarl_pkg_dir)

To acknowledge authorship for the Europarl corpus, we use some information in the readme.txt that is included in the Europarl tarball.

europarl_desc <- paste0(
  readLines(file.path(tmp_dir, "Europarl3-CWB", "readme.txt")),
  collapse = " "
  )
europarl_desc

The package might include much more, so we choose a version for the package indicating infancy.

pkg_version <- "0.0.2"

So we generate the DESCRIPTION file …

pkg_add_description(
  pkg = europarl_pkg_dir, package = "europarl", version = pkg_version,
  date = Sys.Date(),
  author = "cwbtools",
  maintainer = "Andreas Blaette <andreas.blaette@uni-due.de>",
  description = europarl_desc
  )
pkg_add_configure_scripts(pkg = europarl_pkg_dir)

Moving corpora

for (corpus in corpora){
  pkg_add_corpus(pkg = europarl_pkg_dir, corpus = corpus, registry = europarl_registry_dir)
}

The untarred corpus will not be needed at this stage. We remove it, so that we do not waste disk space.

unlink(file.path(tmp_dir, "Europarl3-CWB"), recursive = TRUE)

Create and install package tarball

europarl_tarball <- build(pkg = europarl_pkg_dir, path = tmp_dir, vignettes = TRUE)

There are large binary data in the package, so this may take a while. The return value of the build-function is the path to the tarball of the package that has been generated. The install.packages-function can be used to install from a local file if we set the argument param to NULL.

install.packages(pkgs = europarl_tarball, repos = NULL)

Does it work?

The more convenient way to work with the corpora in the package is to use the polmineR-package. You would load the package (library(polmineR)), activate the corpora in the europarl-package (use("europarl")), check which corpora are present (corpus()), and perform some simple query (kwic("EUROPARL-EN", query = "Europe")).

To keep installation requirements to a minimum, the cwbtools-package does not make polmineR a dependency, so we use the low-level functionality of RcppCWB to check whether everything works.

The first question is: Which corpora do we have in the europarl package?

library(RcppCWB)
europarl_pkg_registry <- system.file(package = "europarl", "extdata", "cwb", "registry")
Sys.setenv(CORPUS_REGISTRY = europarl_pkg_registry)
cqp_initialize()
cqp_list_corpora()

Second, we want to have a look at the concordances of ‘Europe’ in the English version of Europarl.

query <- "Europe"
id <- cl_str2id(
  corpus = "europarl-en", registry = europarl_pkg_registry,
  str = query, p_attribute = "word"
  )
cpos <- cl_id2cpos(
  corpus = "europarl-en", registry = europarl_pkg_registry,
  id = id, p_attribute = "word"
)
tab <- data.frame(
  i = unlist(lapply(1:length(cpos), function(x) rep(x, times = 11))),
  cpos = unlist(lapply(cpos, function(x) (x - 5):(x + 5)))
  )
tab[["id"]] <- cl_cpos2id(
  corpus = "europarl-en", registry = europarl_pkg_registry,
  cpos = tab[["cpos"]], p_attribute = "word"
  )
tab[["str"]] <- cl_id2str(
  corpus = "europarl-en", registry = europarl_pkg_registry,
  p_attribute = "word", id = tab[["id"]]
  )
concordances_list <- split(tab[["str"]], as.factor(tab[["i"]]))
concordances <- unname(sapply(concordances_list, function(x) paste(x, collapse = " ")))
head(concordances)

As mentioned before: Using polmineR is what we recommend for working with CWB corpora from R.

Finally, time to clean up. There are some large files in the temporary directory that has been used. It should be deleted to avoid wasting space on your disk.

unlink(tmp_dir, recursive = TRUE)
unlink(file.path(tmp_dir, "europarl"))

Enjoy!