NEW FEATURES
- The (weak) dependency on the polmineR package (it was in the ‘Suggests:’ section of the DESCRIPTION file) has been removed. Changes are purely internal (higher-level polmineR functions have been replaced by lower-level RcppCWB functions, some tests were re-written). Dropping the dependency has the advantage that there is a much clearer structure of dependencies now (RcppCWB -> cwbtools -> polmineR).
MINOR IMPROVEMENTS
- A remaining CLI formatting issue has been removed from the user dialogue for modifying the .Renviron file.
- Unit tests used a test download of the United Nations General Assembly (UNGA) corpus from Zenodo. To reduce the time required for testing the package, a test download of the (much smaller) GermaParlSample copus is performed.
BUG FIXES
- The
corpus_install() function tried to ask for user feedback when not in an interactive session. The function now checks whether it is possible to ask for user feedback.
- Part of the output of the
cwbtools::create_cwb_directories() function did show if verbose was FALSE. Fixed.
NEW FEATURES
- The
corpus_install() gives much better and nicer reports on steps performed during corpus downloads. User dialogues have been reworked thoroughly to provide better user guidance.
- The
use_corpus_registry_envvar() function is called by corpus_install() and will amend the .Renviron file as appropriate if the user so desires.
- To resolve a DOI, the ‘zen4R’ package is used, to extract information on the whereabouts of a corpus tarball efficiently from the Zenodo API.
- A
corpus_testload() has been implemented to check whether a (newly installed) corpus is accessible.
MINOR IMPROVEMENTS
- Extracting the version number from the corpus tarball is somewhat more forgiving if the version number does not start with “v”.
- The registry file for a newly downloaded corpus is refreshed only if a temporary registry directory is used.
- To remedy the fairly common error that the path to the info file is not stated correctly in the registry file, a fallback mechanism will look up potential alternatives to an info file stated wrongly.
BUG FIXES
- The json string returned from Zenodo may include newline strings that are escaped such that they cannot be processed by
jsonlite::fromJSON(). The auxiliary function to get and process information from Zenodo now ensures that newline characters are escaped such that they can be processed.
- The
corpus_copy() function did not set the path to the info file to the new data directory - corrected.
- The
corpus_install() function failed when the registry_dir got a NULL value from the default call to cwbtools::cwb_registry_dir(). But if the directories are created, the registry directory is there. Fixed.
- Removed a bug (faulty assignment) that would prevent that the path of a registry file is handled correctly (i.e. wrapped in quotation marks) by
registry_file_compose() when the path includes any whitespace characters.
DOCUMENTATION FIXES
- A problem with updating the
curl dependency of cwbtools that may arise when devtools::install_github() is used is addressed in an extended explanation in the README.md file how to install the development version of cwbtools using remotes::install_github() (#21).
NEW FEATURES
- The
install_corpus() function has been reworked thoroughly. Using system directories for the registry and the corpus directory is now supported. This is a prerequisite that corpora can be installed outside of R packages Installing corpora within corpora is not allowed by CRAN.
- A set of new auxiliary functions (
cwb_directories(), cwb_registry_dir(), cwb_corpus_dir()) will get the whereabouts of the registry directory and the corpus directory. In particular, they consider that the polmineR package may have generated a temporary corpus registry, resetting the CORPUS_REGISTRY environment variable.
- The
install_corpus() function accepts an argument doi to provide a Document Object Identifier (DOI). At this stage, the DOI is assumed to be awarded by Zenodo. Information available at the Zenodo site will be resolved to get the URL of a corpus tarball that can be downloaded. Upon installing a corpus from Zenodo, the DOI and the version number will be written as corpus properties into the registry file.
- To avoid removing corpora accidentally, the
corpus_install() function will ask the user for feedback if a corpus would be installed that is already present and that would be deleted or overwritten.
- New auxiliary functions
create_cwb_directories and use_corpus_registry_envvar() will assist users to create the required directory structure for CWB indexed corpora.
MINOR IMPROVEMENTS
- The default value of the argument “repo” that defines the repository for packaged corpora is now the drat repository of the PolMine GitHub account (“https://PolMine.github.io/drat/”).
DOCUMENTATION FIXES
- New R6 Roxygen documentation used for documenting the
CorpusData class.
- A (preliminary) vignette has been added that explains how to add a sentence annotation can be added to an existing indexed corpus.
BUG FIXES
- Trying to remove the entire temporary session directory at the end of the package vignettes caused problems to build the package documentation. A more limited approach to clean up temporary files after build the vignettes will omit this problem.
MINOR IMPROVEMENTS
- The
pkg_add_corpus() function will now create the cwb directories (registry and data directory) if necessary. Previously, these directories were required to exist before moving a corpus into a package, making it necessary to put dummy files into packages to keep R CMD build from issuing warnings and git from dropping these directories. Creating the directories on demand is a precondition for a CRAN release of data packages (#11).
BUG FIXES
- In the upcoming R version 4.0, the
matrix class will inherit from class array. The new package version now takes into account that length(class(matrix(1:4,2,2))) will return the value 2.
DOCUMENTATION FIXES
- The NEWS file now follows the styleguide such that
pkgdown::build_site() will generate a proper changelog page.
- updated vignette so that annex explains installation of CoreNLP v3.9.2 (2018-10-05)
- New functions
s_atttribute_get_regions() and s_attribute_get_values().
- In
corpus_install(), using download.file() replaces curl::curl_download() for Windows because curl apparently is not able to process target filenames that include special characters.
- For Windows machines, there is a check for non-ASCII characters in the file path. If TRUE, a path generated by a call to
shortPathName() is used.
- In the vignette, the registry is reset after creating the new corpora, to make the new corpus available.
- A (preliminary)
decode()-method will turn a partition into an Annotation object from the NLP package.
- A new
conll_get_regions()-function will turn an CoNLL-style annotated token stream into a table with regions that can be encoded using s_attribute_encode().
- A new function
s_attribute_merge() will merge two data.table objects defining s-attributes, checking for overlaps.
- New functions
p_attribute_recode(), s_attribute_recode(), and supplementary s_attributed_files(), and corpus_recode().
- Any call to
tempdir() is now wrapped as normalizePath(tempdir(), winslash = "/") to avoid Problems on Windows, when different file separators may be used.
- When calling
file.path(), the argument fsep is “/” to prevent confusion of file seperators.
- A new function
corpus_copy() is available to create a copy a corpus.
- Working example for
s_attribute_encode().
- A call to
cl_delete_corpus() from RcppCWB is added to s_attribute_encode(), so that newly added s-attributes can be used without restarting the R session.
- The
corpus_copy() was defined (and documented) twice in a confusing manner. This is cleaned up.
- Calls to
installed.packages() were replaced to meet an advice of the CRAN team in the submission process.
- Missing documentation written for fields of class CorpusData.
- New fields ‘sentences’ and ‘named_entities’ added to class CorpusData, as a basis for encoding annotation of sentences and named entities.
- issue with parsing path correctly in registry_file_path when path is in inverted commas solved (adjusted regex)
- issue with ALTREP vector for corpus positions resolved
- layout of progress bars consistently using pbapply package
- sanity checks for s_attribute_encode, ensure that region_matrix is integer matrix
- s_attribute_encode when called with method = “R” will now add s_attribute to registry
- s_attribute_encode will add structural attribute to registry when using R implementation, too
- corpus_as_tarball-function added
- install_corpus able to install from tarball
- progress option for
CorpusData$import_xml()-method
- Minimal rework of progress bar in
CorpusData$add_corpus_positions() (helper function .fn)
- Three dots (…) are passed into
download.file() by install_corpus(), if argument tarball is specified. This is a precondition for passing arguments to download password-protected corpora.
- major bug removed when writing regions to disk (s_attribute_encode) with R
- when creating/removing files in p_attribute_encode, only basenames of filenames are outputted
- for CorpusData$encode(), an already existing corpus will be removed
- bug removed in function pkg_create_cwb_dirs causing error when a directory already exists
- new vignette ‘europarl’: sample workflow for putting indexed corpus into package
- for $tokenize()-method of CorpusData: stricter requirement that chunkdata is data.table
- progress bar for $tokenize()-method, when tokenizers package is used
- tilde expansion for paths that are passed into p_attribute_encode
- stri_detect_regex replacing grepl to speed things up in p_attribute_encode
- awful workaround for coping with latin1 removed in p_attribute_encode
- stip_punct = FALSE for $tokenize() method of CorpusData
- purging the data for the CWB has been moved away from p_attribute_encode to a $purge()-method of CorpusData (to be performed on chunkdata) as a matter of efficiency.
- continuous removal of objects and garbage collection in p_attribute_encode to be as parsimonious with memory as possible
- checking of encoding in p_attribute_encode has been moved to $check_encoding() method in CorpusData-class to keep necessity to copy around vectors (potentially exceeding memory) to a minimum.
- additional parameters passed into tokenizers::tokenize_words by …
- writing hex for content of s_attributes to cope with encoding issues
- values coerced to character
- DataPackage class turned into pkg_*-functions
- first version that passes all tests
- askYesNo function has been replaced by readlines(), to ensure compatibility with R versions < 3.5