‘soundcorrs’ is a small package whose purpose in life is to help linguists analyse sound correspondences between languages. It does not attempt to draw any conclusions on its own; this responsibility is placed entirely on the user. ‘soundcorrs’ merely automates and facilitates certain tasks, such as preparing the material part of the paper, or looking for examples of specific correspondences and, by making various functions available, suggests possible paths of analysis which may not be immediately obvious to the more traditional linguist.

This vignette assumes that the reader not only is a linguist and has at least a general idea about what kind of outputs he or she might want from ‘soundcorrs’, but also has at least a passing familiarity with R and a basic understanding of statistics. Most problems can probably be read up on as they appear in the text, but it is nevertheless recommended to start by very briefly acquainting oneself with R by reading the first page of maybe Quick-R, R Tutorial, or another R primer. In particular, it is assumed that the reader will know how to access and understand the built-in documentation, as not all arguments are discussed here.

A less technical introduction to ‘soundcorrs’ is also available in Stachowski K. [forthcoming]. soundcorrs: Tools for Semi-Automatic Analysis of Sound Correspondences. If you use ‘soundcorrs’ in your research, please cite this paper.

The first section of this vignette discusses in short how to prepare data for ‘soundcorrs’. The second section is an overview of all the analytic functions exported by ‘soundcorrs’ organized by their output, and of helper functions in the alphabetical order.

As of version 0.1.1, most ‘soundcorrs’ functions operate on pairs of words, coming from two different languages. (Technically, “language” can of course be anything, so long as it clearly determines which word is the first in the pair, and which one the second.) The discussion below will use ‘L1’ to refer to the first language, and ‘L2’ to refer to the second.

Naturally, all the examples given below assume that ‘soundcorrs’ is installed and loaded:

Data preparation

‘soundcorrs’ requires two kinds of data: transcription and word pairs/triples/…. Both are stored in tsv files, i.e. as tab-separated tables in text files.

Under BSD, Linux, and macOS, the recommended encoding is UTF-8. Unfortunately, it has been found to cause problems under Windows, so Windows users are advised to not use characters outside of the ASCII standard. Some issues can be fixed by converting from UTF-8 to UTF-8 (sic!) with ‘iconv()’, but others resist this and other treatments. Future versions of ‘soundcorrs’ hope to include a solution for this problem.

Transcription

Transcription is not strictly necessary for the functioning of ‘soundcorrs’, but without it linguistic regular expresssions (“wildcards”) could not be defined, and involvement of phonetics in the analysis would be made more difficult. Transcription is stored in tsv files with two or three columns:

GRAPHEME, which contains the graphemes. Characters used by R as metacharacters in regular expressions, i.e. . + * ^ $ ? | ( ) [ ] { }, are not allowed. Multigraphs also should not be used as they can lead to unexpected and incorrect results, especially in the case metacharacters (“wildcards”).
VALUE, which contains a comma-separated list of features of the given grapheme. These are intended to be phonetic but do not necessarily have to be so. If the column META is missing, it is generated based on the column VALUE.
META, which contains a regular expression covering all the graphemes the given grapheme is meant to represent. In regular graphemes, this is simply the grapheme itself. In a metacharacter, such as ‘C’ for ‘any consonant’, this needs to be a listing of all consonantal graphemes in the transcription file, formatted as a regular expression. It is recommended to leave this column empty, as in such case ‘soundcorrs’ will generate it automatically.

‘soundcorrs’ contains two sample transcription files: ‘trans-common.tsv’ and ‘trans-ipa.tsv’. Both only cover the basics and are intended more as an illustration than anything else. To load one of them:

# establish the paths of the samples included in ‘soundcorrs’
path.trans.com <- system.file ("extdata", "trans-common.tsv", package="soundcorrs")
path.trans.ipa <- system.file ("extdata", "trans-ipa.tsv", package="soundcorrs")

# and load them
trans.com <- read.transcription (path.trans.com)
trans.ipa <- read.transcription (path.trans.ipa)
#> Warning in transcription(data, col.grapheme, col.meta, col.value): Missing the
#> metacharacters column. The "META" column was generated.

# transcription needs to be an object of class ‘transcription’
class (trans.com)
#> [1] "transcription"

# a basic summary
trans.com
#> A "transcription" object.
#>   File: /tmp/Rtmp8ER37j/Rinst16af2cb0d881/soundcorrs/extdata/trans-common.tsv.
#>   Graphemes: 78.

# ‘data’ is the original data frame
# ‘cols’ is a guide to column names in ‘data’
# ‘zero’ are the characters denoting the linguistic zero
str (trans.com, max.level=1)
#> List of 3
#>  $ data:'data.frame':    78 obs. of  3 variables:
#>  $ cols:List of 3
#>  $ zero: chr "(-)"
#>  - attr(*, "class")= chr "transcription"
#>  - attr(*, "file")= chr "/tmp/Rtmp8ER37j/Rinst16af2cb0d881/soundcorrs/extdata/trans-common.tsv"

Data

Like the transcription, the data are also stored in tsv files. Two formats are theoretically possible: the “long format” in which every word is given its own row, and the “wide format” in which one row holds a pair/triple/… of words (see below).

Words ought to be segmented, and all words in a pair/triple/… must have the same number of segments. The default segment separator is ‘|’. If the words are not segmented, the function ‘addSeparators()’ can be used to facilitate the process of manual segmentation and alignment (see below). Tools for automatic alignment also exist (e.g. alineR, LingPy, PyAline), but it is recommended that their results be thoroughly checked by a human. Apart from the segmented and aligned form, each word must be assigned a language.

Hence, the two obligatory columns in the “long format” are

ALIGNED which holds the segmented and aligned word, and
LANGUAGE which holds the name of the language.

In the “wide format”, similarly, a minimum of two columns is necessary, each holding words from a different language. The information about which column holds which language can then be encoded simply as their names (e.g. ‘LATIN’), or in the form of a suffix attached to the names of columns (e.g. ‘ALIGNED.Latin’).

Regarding the two formats, see also ‘long2wide()’ and ‘wide2long()’ below.

It is possible, though not necessarily recommended, to store data from each language in a separate file; it is also possible to use a different transcription for each language. This flexibility can easily lead to a somewhat cumbersome string of arguments for the reader function, so instead a helper ‘scOne’ class is used to read each language individually before merging them into a ‘soundcorrs’ object. It only accepts data in the “wide format”.

‘soundcorrs’ has three sample datasets: 1. the entirely made-up ‘data-abc.tsv’; 2. ‘data-capitals.tsv’ which contains the names of EU capitals in German, Polish and Spanish – from the linguistic point of view, this of course makes no sense; it is merely an example that will hopefully not be seen as too exotic regardless of which language or languages the user specializes in (my gratitude is due to José Andrés Alonso de la Fuente, PhD (Cracow, Poland) for help with Spanish data); and 3. ‘data-ie.tsv’ with a dozen examples of Grimm’s and Verner’s laws (adapted from Campbell L. 2013. Historical Linguistics. An Introduction. Edinburgh University Press. Pp. 136f). The ‘abc’ dataset is in the “long format”, the ‘capitals’ and ‘ie’ datasets are in the “wide format”. All three are also available as preloaded datasets ‘sampleSoundCorrsData.abc’, ‘sampleSoundCorrsData.capitals’, and ‘sampleSoundCorrsData.ie.’.

# establish the paths of the two datasets
path.abc <- system.file ("extdata", "data-abc.tsv", package="soundcorrs")
path.cap <- system.file ("extdata", "data-capitals.tsv", package="soundcorrs")
path.ie <- system.file ("extdata", "data-ie.tsv", package="soundcorrs")

# read “capitals”
d.cap.ger <- read.scOne (path.cap, "German", "ALIGNED.German", path.trans.com)
#> Warning in scOne(data, name, col.aligned, read.transcription(transcription), :
#> The following segments are not covered by the transcription: jus, ŋk.
d.cap.pol <- read.scOne (path.cap, "Polish", "ALIGNED.Polish", path.trans.com)
#> Warning in scOne(data, name, col.aligned, read.transcription(transcription), :
#> The following segments are not covered by the transcription: ń, ẃ.
d.cap <- soundcorrs (d.cap.ger, d.cap.pol)

# read “ie”
d.ie.lat <- read.scOne (path.ie, "Lat", "LATIN", path.trans.com)
d.ie.eng <- read.scOne (path.ie, "Eng", "ENGLISH", path.trans.ipa)
#> Warning in transcription(data, col.grapheme, col.meta, col.value): Missing the
#> metacharacters column. The "META" column was generated.
#> Warning in scOne(data, name, col.aligned, read.transcription(transcription), :
#> The following segments are not covered by the transcription: eɪ, ɪ, aʊ, uː, ɑː,
#> ʊ, iː.
d.ie <- soundcorrs (d.ie.lat, d.ie.eng)

# read “abc”
tmp <- long2wide (read.table(path.abc,header=T), skip=c("ID"))
d.abc.l1 <- scOne (tmp, "L1", "ALIGNED.L1", trans.com)
d.abc.l2 <- scOne (tmp, "L2", "ALIGNED.L2", trans.com)
d.abc <- soundcorrs (d.abc.l1, d.abc.l2)

# individual languages are objects of class ‘scOne’
class (d.abc.l1)
#> [1] "scOne"

# some basic summary
d.abc.l1
#> A "scOne" object.
#>   Language: L1.
#>   Entries: 6.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.
#>   Transcription: /tmp/Rtmp8ER37j/Rinst16af2cb0d881/soundcorrs/extdata/trans-common.tsv.

# ‘cols’ are names of the important columns
# ‘data’ is the original data frame
# ‘name’ is the name of the language
# ‘segms’ are words exploded into segments; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
# ‘segpos’ is a lookup list to check which character belongs to which segment; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
# ‘separator’ is the string used as segment separator
# ‘trans’ is a ‘transcription’ object
# ‘words’ are words obtained by removing separators from the ‘col.aligned’ column; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
str (d.abc.l1, max.level=1)
#> List of 8
#>  $ cols     :List of 1
#>  $ data     :'data.frame':   6 obs. of  7 variables:
#>  $ name     : chr "L1"
#>  $ segms    :List of 2
#>  $ segpos   :List of 2
#>  $ separator: chr "\\|"
#>  $ trans    :List of 3
#>   ..- attr(*, "class")= chr "transcription"
#>   ..- attr(*, "file")= chr "/tmp/Rtmp8ER37j/Rinst16af2cb0d881/soundcorrs/extdata/trans-common.tsv"
#>  $ words    :List of 2
#>  - attr(*, "class")= chr "scOne"

# datasets are objects of class ‘soundcorrs’
class (d.abc)
#> [1] "soundcorrs"

# some basic summary
d.abc
#> A "soundcorrs" object.
#>   Languages: (2): L1, L2.
#>   Entries: 6.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.

# ‘data’ is the original data frame
# ‘cols’ are the same as with ‘scOne’ above, wrapped in a list
# ‘names’ are the names of the languages,
# ‘segms’ are the same as with ‘scOne’ above, wrapped in a list
# ‘segpos’ are likewise
# ‘separators’ are likewise, only a vector instead of a list
# ‘trans’ are the individual transcriptions wrapped in a list
# ‘words’ are the same as with ‘scOne’ above, wrapped in a list
str (d.abc, max.level=1)
#> List of 8
#>  $ cols      :List of 2
#>  $ data      :'data.frame':  6 obs. of  7 variables:
#>  $ names     : chr [1:2] "L1" "L2"
#>  $ segms     :List of 2
#>  $ segpos    :List of 2
#>  $ separators: chr [1:2] "\\|" "\\|"
#>  $ trans     :List of 2
#>  $ words     :List of 2
#>  - attr(*, "class")= chr "soundcorrs"

Functions

‘soundcorrs’ exports several functions intended for linguistic analysis. For easier orientation, they are organized below by what kind of outputs they produce, rather than by their names. ‘soundcorrs’ also exports several functions whose use for linguistic analysis, in and of themselves, is rather limited. Those are grouped in one subsection at the end, and discussed in the alphabetical order.

Contingency tables

There are three different functions in ‘soundcorrs’ that produce contingnecy tables. This may seem like poor design, but there is a logic behind it. ‘summary()’ is only meant to give a general overview of the dataset; ‘table()’ is the essential contingency table function; and ‘allTables()’ produces an output that is meant to be printed rather than read from the screen.

summary()

‘summary()’ produces a segment-to-segment contingency table. The values may represent how many times the two segments co-occur (‘unit=“o”’) or in how many words they co-occur (‘unit=“w”’). This distinction exists because it is quite possible that there will be a segment which appears more than once in a single word. The argument ‘unit’ accepts nine different values: ‘“o(cc(ur(ence(s))))”’ and ‘“w(or(d(s)))”’. By default, L1 segments are in rows and L2 segments in columns. This corresponds to the argument ‘direction’ being set to ‘1’, and can be swapped by setting it to ‘2’. The last argument that can be given to ‘summary()’ is ‘count’; this determines whether values are given in the absolute, or as relative. It accepts six values: ‘“a(bs(olute))”’ and ‘“r(el(ative))”’.

# a general overview of the dataset as a whole
summary (d.abc)
#>    L2
#> L1  a b c o u w ə
#>   - 0 0 0 0 0 0 2
#>   a 4 0 0 1 1 0 0
#>   b 0 5 0 0 0 1 0
#>   c 0 0 6 0 0 0 0

# words are the default ‘unit’
summary (d.abc, unit="o")
#>    L2
#> L1  a b c o u w ə
#>   - 0 0 0 0 0 0 2
#>   a 6 0 0 1 2 0 0
#>   b 0 5 0 0 0 1 0
#>   c 0 0 6 0 0 0 0

# in relative values …
rels <- summary (d.abc, count="r")
round (rels, 2)
#>    L2
#> L1     a    b    c    o    u    w    ə
#>   - 0.00 0.00 0.00 0.00 0.00 0.00 1.00
#>   a 0.67 0.00 0.00 0.17 0.17 0.00 0.00
#>   b 0.00 0.83 0.00 0.00 0.00 0.17 0.00
#>   c 0.00 0.00 1.00 0.00 0.00 0.00 0.00

# … relative to entire rows
apply (rels, 1, sum)
#> - a b c 
#> 1 1 1 1

table()

When loading ‘soundcorrs’ into R, it warns about ‘table()’ being masked from ‘package:base’. This was necessary to allow ‘table()’ to produce tables from ‘soundcorrs’ objects. The functioning of ‘table()’ for all the other objects should not be affected.

In ‘soundcorrs’, ‘table()’ has two modes: internal and external comparison. The former, invoked when ‘column=NULL’ (the default) cross-tabulates correspondences with themselves. The latter cross-tabulates correspondences with metadata taken from a column in the dataset whose name is given as the argument ‘column’. Like ‘summary()’ above, ‘table()’ has arguments ‘unit’ and ‘direction’ which have the same meaning, and also the argument ‘count’ which may appear to work a little differently. In actuality, its use with ‘summary()’ was a special case. The general idea is that the entire table is divided into blocks such that all rows represent correspondences of the same segment and, in the internal mode, so do all the columns.

# a general look in the internal mode
table (d.abc)
#>      L1→L2
#> L1→L2 -_ə a_a a_o a_u b_b b_w c_c
#>   -_ə   2   2   0   0   2   0   2
#>   a_a   2   4   0   0   4   0   4
#>   a_o   0   0   1   0   1   0   1
#>   a_u   0   0   0   1   0   1   1
#>   b_b   2   4   1   0   5   0   5
#>   b_w   0   0   0   1   0   1   1
#>   c_c   2   4   1   1   5   1   6

# … and in the other direction
table (d.abc, direction=2)
#>      L2←L1
#> L2←L1 a_a b_b c_c o_a u_a w_b ə_-
#>   a_a   4   4   4   0   0   0   2
#>   b_b   4   5   5   1   0   0   2
#>   c_c   4   5   6   1   1   1   2
#>   o_a   0   1   1   1   0   0   0
#>   u_a   0   0   1   0   1   1   0
#>   w_b   0   0   1   0   1   1   0
#>   ə_-   2   2   2   0   0   0   2

# now with metadata
table (d.abc, "DIALECT.L2")
#>      DIALECT.L2
#> L1→L2 north south std
#>   -_ə     0     2   0
#>   a_a     0     2   2
#>   a_o     1     0   0
#>   a_u     1     0   0
#>   b_b     1     2   2
#>   b_w     1     0   0
#>   c_c     2     2   2

# in the internal mode,
#    the relative values are with regard to segment-to-segment blocks
tab <- table (d.abc, count="r")
rows.a <- which (rownames(tab) %hasPrefix% "a")
cols.b <- which (colnames(tab) %hasPrefix% "b")
sum (tab [rows.a, cols.b])
#> [1] 1

# there are four different segments in L1
sum (tab)
#> [1] 16

# if two correspondences never co-occur, the relative value is 0/0
#    which R represents as ‘NaN’, and prints as empty space
table (d.abc, direction=2, count="r")
#>      L2←L1
#> L2←L1 a_a b_b c_c o_a u_a w_b ə_-
#>   a_a   1   1   1               1
#>   b_b   1   1   1   1           1
#>   c_c   1   1   1   1   1   1   1
#>   o_a       1   1   1            
#>   u_a           1       1   1    
#>   w_b           1       1   1    
#>   ə_-   1   1   1               1

# in the external mode,
#    the relative values are with regard to blocks of rows, and all columns
tab <- table (d.abc, "DIALECT.L2", count="r")
rows.a <- which (rownames(tab) %hasPrefix% "a")
sum (tab [rows.a, ])
#> [1] 1

allTables()

‘allTables()’ splits a table produced by ‘table()’ into blocks, each containing the correspondences of one segment. Its primary purpose is to facilitate the application of tests of independence, for which see ‘lapplyTest()’ below.

‘allTables()’ takes all the same arguments as ‘table()’: ‘column’, ‘unit’, ‘count’, and ‘direction’. In addition, it takes the argument ‘bin’ which determines whether the table should be just cut up, or whether all the resulting slices should also be binned.

The return value of ‘allTables()’ is a list which holds all the resulting tables, under names composed from the correspondences and connected with underscores. If ‘column = NULL’, they would be ‘a’, ‘b’, &c. if ‘bin = F’, and if ‘bin = T’, ‘a_b_c_d’ meaning L1 ‘a’ : L2 ‘b’ cross-tabulated with L1 ‘c’ : L2 ‘d’ (or the inverse, if ‘direction = 2’), and so on. If ‘column’ is not ‘NULL’, the names will be ‘a_b_northern’ meaning L1 ‘a’ : L2 ‘b’ tabulated with the ‘northern’ dialect, and so forth.

# for a small dataset, the result is going to be small
str (allTables(d.abc), max.level=0)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
#> List of 34

# but it can grow quite quickly with a larger dataset
str (allTables(d.cap), max.level=0)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |=================================================================     |  94%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
#> List of 2882

# the naming scheme
names (allTables(d.abc))
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
#>  [1] "-_ə_a_a" "-_ə_a_o" "-_ə_a_u" "-_ə_b_b" "-_ə_b_w" "-_ə_c_c" "a_a_-_ə"
#>  [8] "a_a_b_b" "a_a_b_w" "a_a_c_c" "a_o_-_ə" "a_o_b_b" "a_o_b_w" "a_o_c_c"
#> [15] "a_u_-_ə" "a_u_b_b" "a_u_b_w" "a_u_c_c" "b_b_-_ə" "b_b_a_a" "b_b_a_o"
#> [22] "b_b_a_u" "b_b_c_c" "b_w_-_ə" "b_w_a_a" "b_w_a_o" "b_w_a_u" "b_w_c_c"
#> [29] "c_c_-_ə" "c_c_a_a" "c_c_a_o" "c_c_a_u" "c_c_b_b" "c_c_b_w"

# and with ‘column’ not ‘NULL’
names (allTables(d.abc,column="DIALECT.L2"))
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
#>  [1] "-_ə_north" "-_ə_south" "-_ə_std"   "a_a_north" "a_a_south" "a_a_std"  
#>  [7] "a_o_north" "a_o_south" "a_o_std"   "a_u_north" "a_u_south" "a_u_std"  
#> [13] "b_b_north" "b_b_south" "b_b_std"   "b_w_north" "b_w_south" "b_w_std"  
#> [19] "c_c_north" "c_c_south" "c_c_std"

Fits

Two ‘soundcorrs’ functions help automate fitting models to data: the simpler ‘multiFit()’ and the slightly more complex ‘fitTable()’.

multiFit()

‘multiFit()’ fits multiple models to a single dataset. It takes as argument the dataset, as well as a list of models, in which each element is a list that contains two named fields: ‘formula’, and ‘start’. The latter is a list of lists of starting estimates for the parameteres of the model, to be tested in case the previous ones fail to produce a fit. The user can specify the fitting function, as well as pass additional arguments to it.

The return value of ‘fitTable()’ is a list of lists containing the outputs of the fitting function. Warnings and errors, which are suppressed by ‘multiFit()’, are attached to the individual elements of the output as attributes. Technically, the result is of class ‘list.multiFit’ so that it can passed to ‘summary()’ to produce a table for easier comparison of the fits. The available metrics are ‘aic’, ‘bic’, ‘rss’ (the default), and ‘sigma’. In addition, the output of ‘fitTable()’ has an attribute ‘depth’; it is intended for ‘summary()’, and should not be changed by the user.

# prepare some random data
set.seed (27)
dataset <- data.frame (X=1:10, Y=1:10 + runif(10,-1,1))

# prepare models to be tested
models <- list (
    "model A" = list( formula="Y~a+X", start=list(list(a=1)) ),
    "model B" = list( formula="Y~a^X", start=list(list(a=-1),list(a=1)) ))
# normally, (-1)^X would produce an error with ‘nls()’

# fit the models to the dataset
fit <- multiFit (models, dataset)

# inspect the results
summary (fit)
#>      model A  model B
#> rss 4.059485 11.51618

fitTable()

‘fitTable()’ applies ‘multiFit()’ over a table, such as the ones produced by ‘table()’ or ‘summary()’. The arguments are: the models, the dataset, margin (as in ‘apply()’: 1 for rows, 2 for columns), the converter function, and additional arguments passed to ‘multiFit()’ (including the fitting function). The converter is a function that turns individual rows or columns of the table into data frames to which models can be fitted. ‘soundcorrs’ provides three simple functions: ‘vec2df.id()’ (the default one), ‘vec2df.hist()’, and ‘vec2df.rank()’. The first one only attaches a list of ‘X’ values, the second one extracts from a histogram the midpoints and counts, and the third one ranks the data. Any function can be used, so long as it takes a numeric vector as the only argument, and returns a data frame. The names of columns in the data frames returned by these three functions are ‘X’ and ‘Y’, something to be borne in mind when defining the formulae of the models.

As with ‘multiFit()’, the return value of ‘fitTable()’ is a list of the outputs of the fitting function, only in the case of ‘fitTable()’ it is nested. It, too, can be passed to ‘summary()’ to produce a convenient table.

# prepare the data
dataset <- table (sampleSoundCorrsData.abc)

# prepare the models to be tested
models <- list (
    "model A" = list( formula="Y~a*(X+b)^2", start=list(list(a=1,b=1)) ),
    "model B" = list( formula="Y~a*(X-b)^2", start=list(list(a=1,b=1)) ))
# vanilla nls() often requires fairly accurate starting estimates

# fit the models to the dataset
fit <- fitTable (models, dataset, 1, vec2df.hist)

# inspect the results
summary (fit, metric="sigma")
#>               -_ə       a_a     a_o     a_u       b_b     b_w      c_c
#> model A   2.44791  1.827561      NA      NA 0.3406315      NA       NA
#> model B 0.4291194 0.5018804 0.72328 0.72328 0.3406315 0.72328 1.059744

N-grams

Only one function in ‘soundcorrs’ is dedicated to n-grams. Its name is quite simply ‘ngrams()’, and it produces a table with absolute counts.

ngrams()

Unlike most functions discussed here, ‘ngrams()’ operates on ‘scOne’ objects rather than on ‘soundcorrs’ ones. The other three arguments are ‘n’, the length of n-grams to extract (defaults to ‘1’); ‘zeros’ which determines whether to include linguistic zeros (defaults to ‘TRUE’); and ‘as.table’ which makes ‘ngram()’ return the result either as a table (the default) or as a list. The list format is useful for cross-tabulating n-grams from two different languages; it just needs to be remembered that for this, ‘zeros’ need to be set to ‘TRUE’ to ensure that both datasets have matching numbers of segments.

# with n==1, ngrams() returns simply the frequencies of segments
ngrams (d.cap.ger)
#> 
#>   -   a   b   d   f   g   h   i   j jus   k   l   m   n   o   p   r   s   t   u 
#>  18  15   9   4   3   3   3   6   3   1   7  13   6  11   3   5  12   8   9   3 
#>   v   z   ä   ü   ā   ī  ŋk   ō   š   ū   ə 
#>   3   4   6   1  10   8   1   4   2   3   6

# counts can easily be turned into a data frame with ranks
tab <- ngrams (d.cap.ger, 9, F)
mtx <- as.matrix (sort(tab,decreasing=T))
data.frame (RANK=1:length(mtx), COUNT=mtx, FREQ=mtx/sum(mtx))
#>                   RANK COUNT      FREQ
#> b|r|a|t|ī|s|l|a|v    1     1 0.1428571
#> k|ō|p|ə|n|h|ā|g|ə    2     1 0.1428571
#> l|j|ū|b|l|j|ā|n|a    3     1 0.1428571
#> l|u|k|s|ə|m|b|u|r    4     1 0.1428571
#> r|a|t|ī|s|l|a|v|a    5     1 0.1428571
#> u|k|s|ə|m|b|u|r|k    6     1 0.1428571
#> ō|p|ə|n|h|ā|g|ə|n    7     1 0.1428571

Pairs

‘soundcorrs’ has two functions to look for specific pairs. ‘findPairs()’ searches for pairs which exhibit the given correspondence, and ‘allPairs()’ produces an almost print-ready summary of the dataset, complete with tables and all the examples.

findPairs()

‘findPairs()’ searches a dataset of exactly two languages for pairs which exhibit a specific sound correspondence. It only has four arguments: ‘data’, the dataset; ‘x’, the string to look for in the first word in each pair; ‘y’ the string to look for in the corresponding place in the second word; ‘exact’ which invokes one of the two sifting modes (below); and ‘cols’ which controls the output.

Both ‘x’ and ‘y’ can be regular expressions, and this includes custom metacharacters defined in the transcription. They can also be empty strings, which ‘findPairs()’ understands as a permission to accept anything.

The two sifting modes mentioned above are the exact mode, and the inexact mode. In the exact mode, a pair is only considered a match if ‘x’ and ‘y’ are found in the same segment (for example, both in the second segment of their respective words), and if both are the entire segments (a segment may span multiple characters, and if ‘x’ or ‘y’ are only e.g. the last of those characters, such a pair will be ignored). The inexact mode allows for an offset of one segment between the matches, and does not require that either ‘x’ or ‘y’ be entire segments. In addition, the inexact mode entirely ignores linguistic zeros which the exact mode treats like any other character.

# the difference between the two sifting modes

#    “ab” spans segments 1–2, while “a” only occupies segment 1
findPairs (d.abc, "ab", "a", exact=T)
#> No matches found.
findPairs (d.abc, "ab", "a", exact=F)
#>   ALIGNED.L1 ALIGNED.L2
#> 1      a|b|c      a|b|c
#> 2    a|b|a|c    a|b|a|c
#> 5    a|b|c|-    a|b|c|ə
#> 6  a|b|a|c|-  a|b|a|c|ə

#    the exact mode ignores linguistic zeros
findPairs (d.abc, "-", "", exact=T)
#>   ALIGNED.L1 ALIGNED.L2
#> 5    a|b|c|-    a|b|c|ə
#> 6  a|b|a|c|-  a|b|a|c|ə
findPairs (d.abc, "-", "", exact=F)
#> No matches found.

# ‘findPairs()’ accepts the usual and the custom regular expressions
findPairs (d.abc, "a", "o|u")
#>   ALIGNED.L1 ALIGNED.L2
#> 3      a|b|c      o|b|c
#> 4    a|b|a|c    u|w|u|c
findPairs (d.abc, "a", "O")
#>   ALIGNED.L1 ALIGNED.L2
#> 3      a|b|c      o|b|c
#> 4    a|b|a|c    u|w|u|c

# the output is actually a list
str (findPairs(d.abc,"a","a"), max.level=1)
#> List of 3
#>  $ data :'data.frame':   4 obs. of  2 variables:
#>  $ found:'data.frame':   6 obs. of  9 variables:
#>  $ which: logi [1:6] TRUE TRUE FALSE FALSE TRUE TRUE
#>  - attr(*, "class")= chr "df.findPairs"

# ‘data’ is what is displayed on the screen
# ‘found’ is a data.frame with the exact positions
# ‘which’ is useful for subsetting
subset (d.abc, findPairs(d.abc,"a","O")$which)
#> A "soundcorrs" object.
#>   Languages: (2): L1, L2.
#>   Entries: 2.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.

# the ‘cols’ argument can be used to alter the printed output
findPairs (d.abc, "a", "O", cols=c("ORTHOGRAPHY.L1","ORTHOGRAPHY.L2"))
#>   ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> 3            abc            åbc
#> 4           abac           uwuc

allPairs()

‘allPairs()’ does not have great analytic value in itself, but it can be useful when writing a paper e.g. on the phonetic adaptation of loanwords, to prepare its material part.

The output of ‘allPairs()’ consists of sections devoted to each segment, filled with a general contingency table of its various renderings, and followed by subsections which list all pairs exhibiting the given correspondence. ‘soundcorrs’ provides functions to format such output in HTML or in LaTeX, or not at all. Custom formatters are also not very difficult to write.

The correspondences can be shown in one of two directions (the argument ‘direction’), and tables can show the number of occurrences or the number of words in which the given correspondence manifests itself (‘unit’), in absolute or in relative terms (‘count’; all three with values as with ‘summary()’). Which columns are printed can be modified with ‘cols’, and whether to write to a file or to the screen, with ‘file’ (‘NULL’ meaning the screen). Lastly, the formatting is controlled by a special function, of which ‘soundcorrs’ provides three: ‘formatter.none()’, ‘formatter.html()’, and ‘formatter.latex()’. A custom formatter can also take additional arguments, which will be passed to it from the call to ‘allPairs()’.

# and see what result this gives
allPairs (d.abc, cols=c("ORTHOGRAPHY.L1","ORTHOGRAPHY.L2"))
#> section  [1] "-"
#> table    ə 
#> table    2 
#> subsection   [1] "-" "ə"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca
#> section  [1] "a"
#> table    a o u 
#> table    4 1 1 
#> subsection   [1] "a" "a"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   1            abc            abc
#> data.frame   2           abac           abac
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca
#> subsection   [1] "a" "o"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   3            abc            åbc
#> subsection   [1] "a" "u"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   4           abac           uwuc
#> section  [1] "b"
#> table    b w 
#> table    5 1 
#> subsection   [1] "b" "b"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   1            abc            abc
#> data.frame   2           abac           abac
#> data.frame   3            abc            åbc
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca
#> subsection   [1] "b" "w"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   4           abac           uwuc
#> section  [1] "c"
#> table    c 
#> table    6 
#> subsection   [1] "c" "c"
#> data.frame     ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame   1            abc            abc
#> data.frame   2           abac           abac
#> data.frame   3            abc            åbc
#> data.frame   4           abac           uwuc
#> data.frame   5            abc           abca
#> data.frame   6           abac          abaca

# a clearer result could be obtained by running
# allPairs (d.cap, cols=c("ORTHOGRAPHY.German","ORTHOGRAPHY.Polish"),
#    file="~/Desktop/d.cap.html", formatter=formatter.html)

As was mentioned, the “capitals” dataset is linguistically absurd, and so it should not matter that all the Polish names of European capitals are listed as borrowed from German. If however, one wished to fix this problem, and do it not by copying the output to a word processor and replacing “>” with “:” there, but rather inside ‘soundcorrs’, this wish can be fulfilled easily enough. First, the existing ‘formatter.html()’ function needs to be written to a file to serve as a base for the new formatter: ‘dput(formatter.html, “~/Desktop/myFormatter.R”)’. Then, the beginning of the first line of this file needs to be changed to something like ‘myFormatter <- function’…, and finally, the “>” and “<” signs (written in HTML as ‘>’ and ‘<’, respectively) need to be replaced with a colon. All that is then left is to load the new function to R and use it to format the output of ‘allPairs()’:

# load the new formatter function …
# source ("~/Desktop/myFormatter.R")

# … and use it instead of ‘formatter.html()’
# allPairs (d.cap, cols=c("ORTHOGRAPHY.German","ORTHOGRAPHY.Polish"),
#    file="~/Desktop/d.cap.html", formatter=myFormatter)
# note that this time the output will not open in the web browser automatically

Segments

In this subsection, only one function: ‘findSegments()’ which, as the name implies, finds specific segments – in relation to segments exhibiting a specific sound correspondence.

findSegments()

‘findSegments()’ begins its operation by running ‘findPairs()’ to find which pairs realize the given sound correspondence. Then, it extracts from them the segment which lies in the specified distance from the segment which realizes this correspondence. For example, if we looked for the correspondence L1 a : L2 e in a pair of words L1 bac : L2 bec, the segment realizing the correspondence would be the second one. ‘findSegments()’ can be used to extract the b’s or the c’s.

Like ‘findPairs()’, it takes the arguments ‘data’, ‘x’, and ‘y’ – and, in addition, the argument ‘segment’ which, in the little example above, would define whether to extract the b’s (‘segment = -1’) or the c’s (‘segment = +1’).

The result is a list of two vectors, one for each of the two languages represented in the dataset. Both vectors have as many elements as the dataset has pairs, which makes them easy to attach to it. Places occupied by pairs which do not realize the given correspondence are filled with ‘NA’s, as are places occupied by words which do not have the desired segment.

# in the ‘d.abc’ dataset, only one word exhibits L1 a : L2 o
ao <- findPairs (d.abc, "a", "o")

# it is the third one
ao$which
#> [1] FALSE FALSE  TRUE FALSE FALSE FALSE

# and it has three segments, of which the first is the one we are looking for
ao
#>   ALIGNED.L1 ALIGNED.L2
#> 3      a|b|c      o|b|c

# hence
findSegments (d.abc, "a", "o", segment=0)
#> $L1
#> [1] NA  NA  "a" NA  NA  NA 
#> 
#> $L2
#> [1] NA  NA  "o" NA  NA  NA

# and
findSegments (d.abc, "a", "o", segment=2)
#> $L1
#> [1] NA  NA  "c" NA  NA  NA 
#> 
#> $L2
#> [1] NA  NA  "c" NA  NA  NA

# but
findSegments (d.abc, "a", "o", segment=-1)
#> $L1
#> [1] NA NA NA NA NA NA
#> 
#> $L2
#> [1] NA NA NA NA NA NA

# the output of ‘findSegments()’ can be turned into phonetic values
segms <- findSegments (d.abc, "b", "b", segment=1)
phon <- char2value (d.abc, "L1", segms$L1)
phon
#> [1] "cons,affr,apic,vl"       "vow,low,back,nrnd,short"
#> [3] "cons,affr,apic,vl"       NA                       
#> [5] "cons,affr,apic,vl"       "vow,low,back,nrnd,short"

# a table for manual inspection
mapply (function(l,s) char2value(d.abc,l,s), d.abc$names, segms)
#>      L1                        L2                       
#> [1,] "cons,affr,apic,vl"       "cons,affr,apic,vl"      
#> [2,] "vow,low,back,nrnd,short" "vow,low,back,nrnd,short"
#> [3,] "cons,affr,apic,vl"       "cons,affr,apic,vl"      
#> [4,] NA                        NA                       
#> [5,] "cons,affr,apic,vl"       "cons,affr,apic,vl"      
#> [6,] "vow,low,back,nrnd,short" "vow,low,back,nrnd,short"

# this result can then be further processed…
phon <- unlist (lapply (phon, function(i) grepl("cons",i)))

# … attached to a dataset
d.abc.new <- cbind (d.abc, BEFORE.CONSONANT=phon)

# … and analysed
table (d.abc.new, "BEFORE.CONSONANT")
#>      BEFORE.CONSONANT
#> L1→L2 FALSE TRUE
#>   -_ə     1    1
#>   a_a     2    2
#>   a_o     0    1
#>   a_u     1    0
#>   b_b     2    3
#>   b_w     1    0
#>   c_c     3    3

# sadly, the procedure becomes more complicated if a correspondence
#    occurs more than once in a single word
findSegments (d.abc, "a", "a", segment=1)
#> $L1
#> [1] "b"   "c,b" NA    NA    "b"   "b,c"
#> 
#> $L2
#> [1] "b"   "c,b" NA    NA    "b"   "b,c"

Helper functions

In addition to analytic functions, ‘soundcorrs’ also exports several helpers. Let us now briefly discuss those, this time simply in the alphabetic order.

addSeparators()

As was mentioned above, automatic segmentation and alignment requires careful supervision, and it may prove in the end to be easier to do by hand. ‘addSeparators()’ can facilitate the first half of this task by interspersing a vector of character strings with a separator.

# using the default ‘|’ …
addSeparators (d.abc$data$ORTHOGRAPHY.L1)
#> [1] "a|b|c"   "a|b|a|c" "a|b|c"   "a|b|a|c" "a|b|c"   "a|b|a|c"

# … or a full stop
addSeparators (d.abc$data$ORTHOGRAPHY.L1, ".")
#> [1] "a.b.c"   "a.b.a.c" "a.b.c"   "a.b.a.c" "a.b.c"   "a.b.a.c"

binTable()

It may be sometimes that the data are insufficient for a test of independence, or that the contingency table is too diversified to draw concrete conclusions from it. ‘binTable()’ takes one or more rows and one or more columns as arguments, and leaves those rows and columns unchanged, while summing up all the others.

# build a table for a slightly larger dataset
tab <- table (d.cap)

# let us focus on L1 a and o
rows <- which (rownames(tab) %hasPrefix% "a")
cols <- which (colnames(tab) %hasPrefix% "o")
binTable (tab, rows, cols)
#>         o_o non-o_o
#> a_a       0      70
#> non-a_a  16    1152

# or on all a-like and o-like vowels
rows <- which (rownames(tab) %hasPrefix% "[aāäǟ]")
cols <- which (colnames(tab) %hasPrefix% "[oōöȫ]")
binTable (tab, rows, cols)
#>       o_o ō_o ō_y other
#> a_a     0   1   0    69
#> ä_e     0   0   0    42
#> ā_-     1   0   0     7
#> ā_a     0   2   0    56
#> other  15  19   4  1022

expandMeta()

Metacharacters defined in the transcription (“wildcards”) can be used inside a ‘findPairs()’ query, but they can also be used with ‘grep()’ or any other function. ‘expandMeta()’ is a little function that translates them into regular expressions that vanilla R can understand.

# let us search a column other than the one specified as ‘aligned’
orth <- d.abc$data [, "ORTHOGRAPHY.L2"]

# look for all VCC sequences
query <- expandMeta(d.cap$trans[[1]],"VCC")
orth [grep(query,orth)]
#> [1] "abc"  "abca"

# look for all VCC words
query <- expandMeta(d.cap$trans[[1]],"^VCC$")
orth [grep(query,orth)]
#> [1] "abc"

%hasPrefix%

Checks if a string begins with another string. In ‘soundcorrs’, this can be useful for extracting specific rows and columns from a contingency table.

# build a table for a slightly larger dataset
tab <- table (d.cap)

# it is quite difficult to read as a whole, so let us focus
#    on a-like vowels in L1 and s-like consonants in L2
rows <- which (rownames(tab) %hasPrefix% "[aāäǟ]")
cols <- which (colnames(tab) %hasPrefix% "[sśš]")
tab [rows, cols]
#>              German→Polish
#> German→Polish s_s s_z s_š š_š
#>           a_a   2   0   1   1
#>           ä_e   0   0   2   0
#>           ā_-   0   1   0   0
#>           ā_a   0   0   2   0

%hasSuffix%

‘%hasSuffix%’ works nearly the same as ‘%hasPrefix%’, only instead of the beginning of a word, it looks at its end.

# build a table for a slightly larger dataset
tab <- table (d.cap)

# it is quite difficult to read as a whole, so let us focus
#    on what corresponds to a-like vowels in L1 and s-like consonants in L2
rows <- which (rownames(tab) %hasSuffix% "[aāäǟ]")
cols <- which (colnames(tab) %hasSuffix% "[sśš]")
tab [rows, cols]
#>              German→Polish
#> German→Polish s_s s_š z_s š_š
#>           -_a   1   0   0   1
#>           a_a   2   1   1   1
#>           ā_a   0   2   0   0
#>           ə_a   0   0   0   0

lapplyTest()

‘lapplyTest()’ is a variant of ‘base::lapply()’ specifically adjusted for the application of tests of independence. The main difference lies in the handling of warnings and errors.

This function takes a list of contingency tables, such as generated by ‘allTables()’ above, and applies to each of its elements a function given in ‘fun’. By default, it is ‘chisq.test()’, but any other test can be used, so long as its output contains an element named ‘p.value’. The result is a list of the outputs of ‘fun’, to each attached as an attribute a warning or an error if any were produced. Additional arguments to ‘fun’ can also be passed in a call to ‘lapplyTest()’.

Technically, the output is of class ‘list.lapplyTest’. It can be passed to ‘summary()’ to sift through the results and only print the ones with the p-value below the specified threshold (the default is 0.05). Those tests which produced a warning are prefixed with an exclamation mark.

# let us prepare the tables
tabs <- allTables (d.abc, bin=F)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%

# and apply the chi-squared test to them
chisq <- lapplyTest (tabs)
chisq
#> $`-`
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  tab
#> X-squared = 6, df = 5, p-value = 0.3062
#> 
#> 
#> $a
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  tab
#> X-squared = 7.7467, df = 6, p-value = 0.2573
#> 
#> 
#> $b
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  tab
#> X-squared = 7.1944, df = 4, p-value = 0.126
#> 
#> 
#> $c
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  tab
#> X-squared = 6.5714, df = 5, p-value = 0.2545
#> 
#> 
#> attr(,"class")
#> [1] "list.lapplyTest"

# this is only an example on a tiny dataset, so let us be more forgiving
summary (chisq, p.value=0.3)
#> Total results: 4; with p-value ≤ 0.3: 3.
#> ! a: p-value = 0.257
#> ! b: p-value = 0.126
#> ! c: p-value = 0.255

# let us see the problems with ‘a’
attr (chisq$a, "error")
#> NULL
attr (chisq$a, "warning")
#> <simpleWarning in fun(tab, ...): Chi-squared approximation may be incorrect>

# this warning often means that the data were insufficient
tabs$a
#>      L1→L2
#> L1→L2 -_ə b_b b_w c_c
#>   a_a   2   4   0   4
#>   a_o   0   1   0   1
#>   a_u   0   0   1   1

long2wide()

‘long2wide()’, together with ‘wide2long()’ are used to convert data frames between the “long format” and the “wide format” (see above). Of these two, ‘long2wide()’ is particularly useful because the “long format” tends to be easier for humans to perform the segmentation, and is therefore preferable for storing data, while the “wide format” is used internally and required by ‘soundcorrs’.

During the conversion, the number of columns is almost doubled (while the number of rows halved), but because it is unwise to have duplicate column names, they are given suffixes – which are taken from the values in the column ‘LANGUAGE’. The name of the column used for that purpose can be changed using the ‘col.lang’ argument.

Some of the attributes pertain to only one word in a pair or to the pair as a whole. In the “long format” those have to be repeated, but in the “wide format” this is not necessary. ‘long2wide()’ allows for certain columns to be excluded from the conversion, using the ‘skip’ argument.

# the “abc” dataset is in the long format
abc.long <- read.table (path.abc, header=T)

# the simplest conversion unnecessarily doubles the ID column
long2wide (abc.long)
#>   ID.L1 DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 ID.L2 DIALECT.L2 ALIGNED.L2
#> 1     1        std      a|b|c            abc     1        std      a|b|c
#> 2     2        std    a|b|a|c           abac     2        std    a|b|a|c
#> 3     3        std      a|b|c            abc     3      north      o|b|c
#> 4     4        std    a|b|a|c           abac     4      north    u|w|u|c
#> 5     5        std    a|b|c|-            abc     5      south    a|b|c|ə
#> 6     6        std  a|b|a|c|-           abac     6      south  a|b|a|c|ə
#>   ORTHOGRAPHY.L2
#> 1            abc
#> 2           abac
#> 3            åbc
#> 4           uwuc
#> 5           abca
#> 6          abaca

# but this can be avoided with the ‘skip’ argument
abc.wide <- long2wide (abc.long, skip="ID")

subset()

‘subset()’ does what its name suggests, i.e. it subsets a dataset using the provided condition. It returns a new ‘soundcorrs’ object.

# select only examples from L2’s northern dialect
subset (d.abc, DIALECT.L2=="north") $data
#>   ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2 ORTHOGRAPHY.L2
#> 3  3        std      a|b|c            abc      north      o|b|c            åbc
#> 4  4        std    a|b|a|c           abac      north    u|w|u|c           uwuc

# select only capitals of countries where German is an official language
subset (d.cap, grepl("German",d.cap$data$OFFICIAL.LANGUAGE)) $data
#>           ALIGNED.German ORTHOGRAPHY.German        ALIGNED.Polish
#> 5  l|u|k|s|ə|m|b|u|r|k|-          Luxemburg l|u|k|s|e|m|b|u|r|k|-
#> 19         v|ī|-|-|-|n|-               Wien         ẃ|-|e|d|e|ń|-
#> 21           b|ä|r|l|ī|n             Berlin           b|e|r|l|i|n
#> 23     b|r|ü|-|s|ə|l|-|-            Brüssel     b|r|u|k|s|e|l|a|-
#>    ORTHOGRAPHY.Polish       ALIGNED.Spanish  ORTHOGRAPHY.Spanish
#> 5          Luksemburg l|u|k|s|e|m|b|u|r|γ|o Ciudad_de_Luxemburgo
#> 19             Wiedeń         b|j|e|-|-|n|a                Viena
#> 21             Berlin           b|e|r|l|i|n               Berlín
#> 23           Bruksela     b|r|u|-|s|e|l|a|s             Bruselas
#>              OFFICIAL.LANGUAGE
#> 5  Luxembourgish,French,German
#> 19                      German
#> 21                      German
#> 23         Dutch,French,German

# select only pairs in which L1 a : L2 a
subset (d.abc, findPairs(d.abc,"a","a")$which) $data
#>   ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2 ORTHOGRAPHY.L2
#> 1  1        std      a|b|c            abc        std      a|b|c            abc
#> 2  2        std    a|b|a|c           abac        std    a|b|a|c           abac
#> 5  5        std    a|b|c|-            abc      south    a|b|c|ə           abca
#> 6  6        std  a|b|a|c|-           abac      south  a|b|a|c|ə          abaca

wide2long()

‘wide2long()’ is simply the inverse of ‘long2wide()’. The conversion may not be perfect, as the order of the columns may change.

In ‘long2wide()’, suffixes were taken from the values in the ‘LANGUAGE’ column; this time they must be specified explicitly. They will be stored in a column defined by the argument ‘col.lang’, which defaults to ‘LANGUAGE’. However, the string that separated column names from suffixes will not be removed by default. To strip it, the argument ‘strip’ needs to be set to the length of the separator.

# let us use the converted “abc” dataset
abc.wide
#>   ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2 ORTHOGRAPHY.L2
#> 1  1        std      a|b|c            abc        std      a|b|c            abc
#> 2  2        std    a|b|a|c           abac        std    a|b|a|c           abac
#> 3  3        std      a|b|c            abc      north      o|b|c            åbc
#> 4  4        std    a|b|a|c           abac      north    u|w|u|c           uwuc
#> 5  5        std    a|b|c|-            abc      south    a|b|c|ə           abca
#> 6  6        std  a|b|a|c|-           abac      south  a|b|a|c|ə          abaca

# with the separator preserved
wide2long (abc.wide, c(".L1",".L2"))
#>      ALIGNED DIALECT ORTHOGRAPHY ID LANGUAGE
#> 1      a|b|c     std         abc  1      .L1
#> 2    a|b|a|c     std        abac  2      .L1
#> 3      a|b|c     std         abc  3      .L1
#> 4    a|b|a|c     std        abac  4      .L1
#> 5    a|b|c|-     std         abc  5      .L1
#> 6  a|b|a|c|-     std        abac  6      .L1
#> 7      a|b|c     std         abc  1      .L2
#> 8    a|b|a|c     std        abac  2      .L2
#> 9      o|b|c   north         åbc  3      .L2
#> 10   u|w|u|c   north        uwuc  4      .L2
#> 11   a|b|c|ə   south        abca  5      .L2
#> 12 a|b|a|c|ə   south       abaca  6      .L2

# and with the separator removed
wide2long (abc.wide, c(".L1",".L2"), strip=1)
#>      ALIGNED DIALECT ORTHOGRAPHY ID LANGUAGE
#> 1      a|b|c     std         abc  1       L1
#> 2    a|b|a|c     std        abac  2       L1
#> 3      a|b|c     std         abc  3       L1
#> 4    a|b|a|c     std        abac  4       L1
#> 5    a|b|c|-     std         abc  5       L1
#> 6  a|b|a|c|-     std        abac  6       L1
#> 7      a|b|c     std         abc  1       L2
#> 8    a|b|a|c     std        abac  2       L2
#> 9      o|b|c   north         åbc  3       L2
#> 10   u|w|u|c   north        uwuc  4       L2
#> 11   a|b|c|ə   south        abca  5       L2
#> 12 a|b|a|c|ə   south       abaca  6       L2

soundcorrs: Semi-Automatic Analysis of Sound Correspondences

Kamil Stachowski

2020-04-24