soundcorrs: Semi-Automatic Analysis of Sound Correspondences

Kamil Stachowski

2020-04-24

‘soundcorrs’ is a small package whose purpose in life is to help linguists analyse sound correspondences between languages. It does not attempt to draw any conclusions on its own; this responsibility is placed entirely on the user. ‘soundcorrs’ merely automates and facilitates certain tasks, such as preparing the material part of the paper, or looking for examples of specific correspondences and, by making various functions available, suggests possible paths of analysis which may not be immediately obvious to the more traditional linguist.

This vignette assumes that the reader not only is a linguist and has at least a general idea about what kind of outputs he or she might want from ‘soundcorrs’, but also has at least a passing familiarity with R and a basic understanding of statistics. Most problems can probably be read up on as they appear in the text, but it is nevertheless recommended to start by very briefly acquainting oneself with R by reading the first page of maybe Quick-R, R Tutorial, or another R primer. In particular, it is assumed that the reader will know how to access and understand the built-in documentation, as not all arguments are discussed here.

A less technical introduction to ‘soundcorrs’ is also available in Stachowski K. [forthcoming]. soundcorrs: Tools for Semi-Automatic Analysis of Sound Correspondences. If you use ‘soundcorrs’ in your research, please cite this paper.

The first section of this vignette discusses in short how to prepare data for ‘soundcorrs’. The second section is an overview of all the analytic functions exported by ‘soundcorrs’ organized by their output, and of helper functions in the alphabetical order.

As of version 0.1.1, most ‘soundcorrs’ functions operate on pairs of words, coming from two different languages. (Technically, “language” can of course be anything, so long as it clearly determines which word is the first in the pair, and which one the second.) The discussion below will use ‘L1’ to refer to the first language, and ‘L2’ to refer to the second.

Naturally, all the examples given below assume that ‘soundcorrs’ is installed and loaded:

# install.packages ("soundcorrs")
library (soundcorrs)
#> 
#> Attaching package: 'soundcorrs'
#> The following object is masked from 'package:base':
#> 
#>     table

Data preparation

‘soundcorrs’ requires two kinds of data: transcription and word pairs/triples/…. Both are stored in tsv files, i.e. as tab-separated tables in text files.

Under BSD, Linux, and macOS, the recommended encoding is UTF-8. Unfortunately, it has been found to cause problems under Windows, so Windows users are advised to not use characters outside of the ASCII standard. Some issues can be fixed by converting from UTF-8 to UTF-8 (sic!) with ‘iconv()’, but others resist this and other treatments. Future versions of ‘soundcorrs’ hope to include a solution for this problem.

Transcription

Transcription is not strictly necessary for the functioning of ‘soundcorrs’, but without it linguistic regular expresssions (“wildcards”) could not be defined, and involvement of phonetics in the analysis would be made more difficult. Transcription is stored in tsv files with two or three columns:

‘soundcorrs’ contains two sample transcription files: ‘trans-common.tsv’ and ‘trans-ipa.tsv’. Both only cover the basics and are intended more as an illustration than anything else. To load one of them:

Data

Like the transcription, the data are also stored in tsv files. Two formats are theoretically possible: the “long format” in which every word is given its own row, and the “wide format” in which one row holds a pair/triple/… of words (see below).

Words ought to be segmented, and all words in a pair/triple/… must have the same number of segments. The default segment separator is ‘|’. If the words are not segmented, the function ‘addSeparators()’ can be used to facilitate the process of manual segmentation and alignment (see below). Tools for automatic alignment also exist (e.g. alineR, LingPy, PyAline), but it is recommended that their results be thoroughly checked by a human. Apart from the segmented and aligned form, each word must be assigned a language.

Hence, the two obligatory columns in the “long format” are

In the “wide format”, similarly, a minimum of two columns is necessary, each holding words from a different language. The information about which column holds which language can then be encoded simply as their names (e.g. ‘LATIN’), or in the form of a suffix attached to the names of columns (e.g. ‘ALIGNED.Latin’).

Regarding the two formats, see also ‘long2wide()’ and ‘wide2long()’ below.

It is possible, though not necessarily recommended, to store data from each language in a separate file; it is also possible to use a different transcription for each language. This flexibility can easily lead to a somewhat cumbersome string of arguments for the reader function, so instead a helper ‘scOne’ class is used to read each language individually before merging them into a ‘soundcorrs’ object. It only accepts data in the “wide format”.

‘soundcorrs’ has three sample datasets: 1. the entirely made-up ‘data-abc.tsv’; 2. ‘data-capitals.tsv’ which contains the names of EU capitals in German, Polish and Spanish – from the linguistic point of view, this of course makes no sense; it is merely an example that will hopefully not be seen as too exotic regardless of which language or languages the user specializes in (my gratitude is due to José Andrés Alonso de la Fuente, PhD (Cracow, Poland) for help with Spanish data); and 3. ‘data-ie.tsv’ with a dozen examples of Grimm’s and Verner’s laws (adapted from Campbell L. 2013. Historical Linguistics. An Introduction. Edinburgh University Press. Pp. 136f). The ‘abc’ dataset is in the “long format”, the ‘capitals’ and ‘ie’ datasets are in the “wide format”. All three are also available as preloaded datasets ‘sampleSoundCorrsData.abc’, ‘sampleSoundCorrsData.capitals’, and ‘sampleSoundCorrsData.ie.’.

# establish the paths of the two datasets
path.abc <- system.file ("extdata", "data-abc.tsv", package="soundcorrs")
path.cap <- system.file ("extdata", "data-capitals.tsv", package="soundcorrs")
path.ie <- system.file ("extdata", "data-ie.tsv", package="soundcorrs")

# read “capitals”
d.cap.ger <- read.scOne (path.cap, "German", "ALIGNED.German", path.trans.com)
#> Warning in scOne(data, name, col.aligned, read.transcription(transcription), :
#> The following segments are not covered by the transcription: jus, ŋk.
d.cap.pol <- read.scOne (path.cap, "Polish", "ALIGNED.Polish", path.trans.com)
#> Warning in scOne(data, name, col.aligned, read.transcription(transcription), :
#> The following segments are not covered by the transcription: ń, ẃ.
d.cap <- soundcorrs (d.cap.ger, d.cap.pol)

# read “ie”
d.ie.lat <- read.scOne (path.ie, "Lat", "LATIN", path.trans.com)
d.ie.eng <- read.scOne (path.ie, "Eng", "ENGLISH", path.trans.ipa)
#> Warning in transcription(data, col.grapheme, col.meta, col.value): Missing the
#> metacharacters column. The "META" column was generated.
#> Warning in scOne(data, name, col.aligned, read.transcription(transcription), :
#> The following segments are not covered by the transcription: eɪ, ɪ, aʊ, uː, ɑː,
#> ʊ, iː.
d.ie <- soundcorrs (d.ie.lat, d.ie.eng)

# read “abc”
tmp <- long2wide (read.table(path.abc,header=T), skip=c("ID"))
d.abc.l1 <- scOne (tmp, "L1", "ALIGNED.L1", trans.com)
d.abc.l2 <- scOne (tmp, "L2", "ALIGNED.L2", trans.com)
d.abc <- soundcorrs (d.abc.l1, d.abc.l2)

# individual languages are objects of class ‘scOne’
class (d.abc.l1)
#> [1] "scOne"

# some basic summary
d.abc.l1
#> A "scOne" object.
#>   Language: L1.
#>   Entries: 6.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.
#>   Transcription: /tmp/Rtmp8ER37j/Rinst16af2cb0d881/soundcorrs/extdata/trans-common.tsv.

# ‘cols’ are names of the important columns
# ‘data’ is the original data frame
# ‘name’ is the name of the language
# ‘segms’ are words exploded into segments; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
# ‘segpos’ is a lookup list to check which character belongs to which segment; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
# ‘separator’ is the string used as segment separator
# ‘trans’ is a ‘transcription’ object
# ‘words’ are words obtained by removing separators from the ‘col.aligned’ column; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
str (d.abc.l1, max.level=1)
#> List of 8
#>  $ cols     :List of 1
#>  $ data     :'data.frame':   6 obs. of  7 variables:
#>  $ name     : chr "L1"
#>  $ segms    :List of 2
#>  $ segpos   :List of 2
#>  $ separator: chr "\\|"
#>  $ trans    :List of 3
#>   ..- attr(*, "class")= chr "transcription"
#>   ..- attr(*, "file")= chr "/tmp/Rtmp8ER37j/Rinst16af2cb0d881/soundcorrs/extdata/trans-common.tsv"
#>  $ words    :List of 2
#>  - attr(*, "class")= chr "scOne"

# datasets are objects of class ‘soundcorrs’
class (d.abc)
#> [1] "soundcorrs"

# some basic summary
d.abc
#> A "soundcorrs" object.
#>   Languages: (2): L1, L2.
#>   Entries: 6.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.

# ‘data’ is the original data frame
# ‘cols’ are the same as with ‘scOne’ above, wrapped in a list
# ‘names’ are the names of the languages,
# ‘segms’ are the same as with ‘scOne’ above, wrapped in a list
# ‘segpos’ are likewise
# ‘separators’ are likewise, only a vector instead of a list
# ‘trans’ are the individual transcriptions wrapped in a list
# ‘words’ are the same as with ‘scOne’ above, wrapped in a list
str (d.abc, max.level=1)
#> List of 8
#>  $ cols      :List of 2
#>  $ data      :'data.frame':  6 obs. of  7 variables:
#>  $ names     : chr [1:2] "L1" "L2"
#>  $ segms     :List of 2
#>  $ segpos    :List of 2
#>  $ separators: chr [1:2] "\\|" "\\|"
#>  $ trans     :List of 2
#>  $ words     :List of 2
#>  - attr(*, "class")= chr "soundcorrs"

Functions

‘soundcorrs’ exports several functions intended for linguistic analysis. For easier orientation, they are organized below by what kind of outputs they produce, rather than by their names. ‘soundcorrs’ also exports several functions whose use for linguistic analysis, in and of themselves, is rather limited. Those are grouped in one subsection at the end, and discussed in the alphabetical order.

Contingency tables

There are three different functions in ‘soundcorrs’ that produce contingnecy tables. This may seem like poor design, but there is a logic behind it. ‘summary()’ is only meant to give a general overview of the dataset; ‘table()’ is the essential contingency table function; and ‘allTables()’ produces an output that is meant to be printed rather than read from the screen.

summary()

‘summary()’ produces a segment-to-segment contingency table. The values may represent how many times the two segments co-occur (‘unit=“o”’) or in how many words they co-occur (‘unit=“w”’). This distinction exists because it is quite possible that there will be a segment which appears more than once in a single word. The argument ‘unit’ accepts nine different values: ‘“o(cc(ur(ence(s))))”’ and ‘“w(or(d(s)))”’. By default, L1 segments are in rows and L2 segments in columns. This corresponds to the argument ‘direction’ being set to ‘1’, and can be swapped by setting it to ‘2’. The last argument that can be given to ‘summary()’ is ‘count’; this determines whether values are given in the absolute, or as relative. It accepts six values: ‘“a(bs(olute))”’ and ‘“r(el(ative))”’.

table()

When loading ‘soundcorrs’ into R, it warns about ‘table()’ being masked from ‘package:base’. This was necessary to allow ‘table()’ to produce tables from ‘soundcorrs’ objects. The functioning of ‘table()’ for all the other objects should not be affected.

In ‘soundcorrs’, ‘table()’ has two modes: internal and external comparison. The former, invoked when ‘column=NULL’ (the default) cross-tabulates correspondences with themselves. The latter cross-tabulates correspondences with metadata taken from a column in the dataset whose name is given as the argument ‘column’. Like ‘summary()’ above, ‘table()’ has arguments ‘unit’ and ‘direction’ which have the same meaning, and also the argument ‘count’ which may appear to work a little differently. In actuality, its use with ‘summary()’ was a special case. The general idea is that the entire table is divided into blocks such that all rows represent correspondences of the same segment and, in the internal mode, so do all the columns.

# a general look in the internal mode
table (d.abc)
#>      L1→L2
#> L1→L2 -_ə a_a a_o a_u b_b b_w c_c
#>   -_ə   2   2   0   0   2   0   2
#>   a_a   2   4   0   0   4   0   4
#>   a_o   0   0   1   0   1   0   1
#>   a_u   0   0   0   1   0   1   1
#>   b_b   2   4   1   0   5   0   5
#>   b_w   0   0   0   1   0   1   1
#>   c_c   2   4   1   1   5   1   6

# … and in the other direction
table (d.abc, direction=2)
#>      L2←L1
#> L2←L1 a_a b_b c_c o_a u_a w_b ə_-
#>   a_a   4   4   4   0   0   0   2
#>   b_b   4   5   5   1   0   0   2
#>   c_c   4   5   6   1   1   1   2
#>   o_a   0   1   1   1   0   0   0
#>   u_a   0   0   1   0   1   1   0
#>   w_b   0   0   1   0   1   1   0
#>   ə_-   2   2   2   0   0   0   2

# now with metadata
table (d.abc, "DIALECT.L2")
#>      DIALECT.L2
#> L1→L2 north south std
#>   -_ə     0     2   0
#>   a_a     0     2   2
#>   a_o     1     0   0
#>   a_u     1     0   0
#>   b_b     1     2   2
#>   b_w     1     0   0
#>   c_c     2     2   2

# in the internal mode,
#    the relative values are with regard to segment-to-segment blocks
tab <- table (d.abc, count="r")
rows.a <- which (rownames(tab) %hasPrefix% "a")
cols.b <- which (colnames(tab) %hasPrefix% "b")
sum (tab [rows.a, cols.b])
#> [1] 1

# there are four different segments in L1
sum (tab)
#> [1] 16

# if two correspondences never co-occur, the relative value is 0/0
#    which R represents as ‘NaN’, and prints as empty space
table (d.abc, direction=2, count="r")
#>      L2←L1
#> L2←L1 a_a b_b c_c o_a u_a w_b ə_-
#>   a_a   1   1   1               1
#>   b_b   1   1   1   1           1
#>   c_c   1   1   1   1   1   1   1
#>   o_a       1   1   1            
#>   u_a           1       1   1    
#>   w_b           1       1   1    
#>   ə_-   1   1   1               1

# in the external mode,
#    the relative values are with regard to blocks of rows, and all columns
tab <- table (d.abc, "DIALECT.L2", count="r")
rows.a <- which (rownames(tab) %hasPrefix% "a")
sum (tab [rows.a, ])
#> [1] 1

allTables()

‘allTables()’ splits a table produced by ‘table()’ into blocks, each containing the correspondences of one segment. Its primary purpose is to facilitate the application of tests of independence, for which see ‘lapplyTest()’ below.

‘allTables()’ takes all the same arguments as ‘table()’: ‘column’, ‘unit’, ‘count’, and ‘direction’. In addition, it takes the argument ‘bin’ which determines whether the table should be just cut up, or whether all the resulting slices should also be binned.

The return value of ‘allTables()’ is a list which holds all the resulting tables, under names composed from the correspondences and connected with underscores. If ‘column = NULL’, they would be ‘a’, ‘b’, &c. if ‘bin = F’, and if ‘bin = T’, ‘a_b_c_d’ meaning L1 ‘a’ : L2 ‘b’ cross-tabulated with L1 ‘c’ : L2 ‘d’ (or the inverse, if ‘direction = 2’), and so on. If ‘column’ is not ‘NULL’, the names will be ‘a_b_northern’ meaning L1 ‘a’ : L2 ‘b’ tabulated with the ‘northern’ dialect, and so forth.

# for a small dataset, the result is going to be small
str (allTables(d.abc), max.level=0)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
#> List of 34

# but it can grow quite quickly with a larger dataset
str (allTables(d.cap), max.level=0)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |=================================================================     |  94%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
#> List of 2882

# the naming scheme
names (allTables(d.abc))
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
#>  [1] "-_ə_a_a" "-_ə_a_o" "-_ə_a_u" "-_ə_b_b" "-_ə_b_w" "-_ə_c_c" "a_a_-_ə"
#>  [8] "a_a_b_b" "a_a_b_w" "a_a_c_c" "a_o_-_ə" "a_o_b_b" "a_o_b_w" "a_o_c_c"
#> [15] "a_u_-_ə" "a_u_b_b" "a_u_b_w" "a_u_c_c" "b_b_-_ə" "b_b_a_a" "b_b_a_o"
#> [22] "b_b_a_u" "b_b_c_c" "b_w_-_ə" "b_w_a_a" "b_w_a_o" "b_w_a_u" "b_w_c_c"
#> [29] "c_c_-_ə" "c_c_a_a" "c_c_a_o" "c_c_a_u" "c_c_b_b" "c_c_b_w"

# and with ‘column’ not ‘NULL’
names (allTables(d.abc,column="DIALECT.L2"))
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
#>  [1] "-_ə_north" "-_ə_south" "-_ə_std"   "a_a_north" "a_a_south" "a_a_std"  
#>  [7] "a_o_north" "a_o_south" "a_o_std"   "a_u_north" "a_u_south" "a_u_std"  
#> [13] "b_b_north" "b_b_south" "b_b_std"   "b_w_north" "b_w_south" "b_w_std"  
#> [19] "c_c_north" "c_c_south" "c_c_std"

Fits

Two ‘soundcorrs’ functions help automate fitting models to data: the simpler ‘multiFit()’ and the slightly more complex ‘fitTable()’.

multiFit()

‘multiFit()’ fits multiple models to a single dataset. It takes as argument the dataset, as well as a list of models, in which each element is a list that contains two named fields: ‘formula’, and ‘start’. The latter is a list of lists of starting estimates for the parameteres of the model, to be tested in case the previous ones fail to produce a fit. The user can specify the fitting function, as well as pass additional arguments to it.

The return value of ‘fitTable()’ is a list of lists containing the outputs of the fitting function. Warnings and errors, which are suppressed by ‘multiFit()’, are attached to the individual elements of the output as attributes. Technically, the result is of class ‘list.multiFit’ so that it can passed to ‘summary()’ to produce a table for easier comparison of the fits. The available metrics are ‘aic’, ‘bic’, ‘rss’ (the default), and ‘sigma’. In addition, the output of ‘fitTable()’ has an attribute ‘depth’; it is intended for ‘summary()’, and should not be changed by the user.

fitTable()

‘fitTable()’ applies ‘multiFit()’ over a table, such as the ones produced by ‘table()’ or ‘summary()’. The arguments are: the models, the dataset, margin (as in ‘apply()’: 1 for rows, 2 for columns), the converter function, and additional arguments passed to ‘multiFit()’ (including the fitting function). The converter is a function that turns individual rows or columns of the table into data frames to which models can be fitted. ‘soundcorrs’ provides three simple functions: ‘vec2df.id()’ (the default one), ‘vec2df.hist()’, and ‘vec2df.rank()’. The first one only attaches a list of ‘X’ values, the second one extracts from a histogram the midpoints and counts, and the third one ranks the data. Any function can be used, so long as it takes a numeric vector as the only argument, and returns a data frame. The names of columns in the data frames returned by these three functions are ‘X’ and ‘Y’, something to be borne in mind when defining the formulae of the models.

As with ‘multiFit()’, the return value of ‘fitTable()’ is a list of the outputs of the fitting function, only in the case of ‘fitTable()’ it is nested. It, too, can be passed to ‘summary()’ to produce a convenient table.

N-grams

Only one function in ‘soundcorrs’ is dedicated to n-grams. Its name is quite simply ‘ngrams()’, and it produces a table with absolute counts.

ngrams()

Unlike most functions discussed here, ‘ngrams()’ operates on ‘scOne’ objects rather than on ‘soundcorrs’ ones. The other three arguments are ‘n’, the length of n-grams to extract (defaults to ‘1’); ‘zeros’ which determines whether to include linguistic zeros (defaults to ‘TRUE’); and ‘as.table’ which makes ‘ngram()’ return the result either as a table (the default) or as a list. The list format is useful for cross-tabulating n-grams from two different languages; it just needs to be remembered that for this, ‘zeros’ need to be set to ‘TRUE’ to ensure that both datasets have matching numbers of segments.

Pairs

‘soundcorrs’ has two functions to look for specific pairs. ‘findPairs()’ searches for pairs which exhibit the given correspondence, and ‘allPairs()’ produces an almost print-ready summary of the dataset, complete with tables and all the examples.

findPairs()

‘findPairs()’ searches a dataset of exactly two languages for pairs which exhibit a specific sound correspondence. It only has four arguments: ‘data’, the dataset; ‘x’, the string to look for in the first word in each pair; ‘y’ the string to look for in the corresponding place in the second word; ‘exact’ which invokes one of the two sifting modes (below); and ‘cols’ which controls the output.

Both ‘x’ and ‘y’ can be regular expressions, and this includes custom metacharacters defined in the transcription. They can also be empty strings, which ‘findPairs()’ understands as a permission to accept anything.

The two sifting modes mentioned above are the exact mode, and the inexact mode. In the exact mode, a pair is only considered a match if ‘x’ and ‘y’ are found in the same segment (for example, both in the second segment of their respective words), and if both are the entire segments (a segment may span multiple characters, and if ‘x’ or ‘y’ are only e.g. the last of those characters, such a pair will be ignored). The inexact mode allows for an offset of one segment between the matches, and does not require that either ‘x’ or ‘y’ be entire segments. In addition, the inexact mode entirely ignores linguistic zeros which the exact mode treats like any other character.

# the difference between the two sifting modes

#    “ab” spans segments 1–2, while “a” only occupies segment 1
findPairs (d.abc, "ab", "a", exact=T)
#> No matches found.
findPairs (d.abc, "ab", "a", exact=F)
#>   ALIGNED.L1 ALIGNED.L2
#> 1      a|b|c      a|b|c
#> 2    a|b|a|c    a|b|a|c
#> 5    a|b|c|-    a|b|c|ə
#> 6  a|b|a|c|-  a|b|a|c|ə

#    the exact mode ignores linguistic zeros
findPairs (d.abc, "-", "", exact=T)
#>   ALIGNED.L1 ALIGNED.L2
#> 5    a|b|c|-    a|b|c|ə
#> 6  a|b|a|c|-  a|b|a|c|ə
findPairs (d.abc, "-", "", exact=F)
#> No matches found.

# ‘findPairs()’ accepts the usual and the custom regular expressions
findPairs (d.abc, "a", "o|u")
#>   ALIGNED.L1 ALIGNED.L2
#> 3      a|b|c      o|b|c
#> 4    a|b|a|c    u|w|u|c
findPairs (d.abc, "a", "O")
#>   ALIGNED.L1 ALIGNED.L2
#> 3      a|b|c      o|b|c
#> 4    a|b|a|c    u|w|u|c

# the output is actually a list
str (findPairs(d.abc,"a","a"), max.level=1)
#> List of 3
#>  $ data :'data.frame':   4 obs. of  2 variables:
#>  $ found:'data.frame':   6 obs. of  9 variables:
#>  $ which: logi [1:6] TRUE TRUE FALSE FALSE TRUE TRUE
#>  - attr(*, "class")= chr "df.findPairs"

# ‘data’ is what is displayed on the screen
# ‘found’ is a data.frame with the exact positions
# ‘which’ is useful for subsetting
subset (d.abc, findPairs(d.abc,"a","O")$which)
#> A "soundcorrs" object.
#>   Languages: (2): L1, L2.
#>   Entries: 2.
#>   Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.

# the ‘cols’ argument can be used to alter the printed output
findPairs (d.abc, "a", "O", cols=c("ORTHOGRAPHY.L1","ORTHOGRAPHY.L2"))
#>   ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> 3            abc            åbc
#> 4           abac           uwuc

allPairs()

‘allPairs()’ does not have great analytic value in itself, but it can be useful when writing a paper e.g. on the phonetic adaptation of loanwords, to prepare its material part.

The output of ‘allPairs()’ consists of sections devoted to each segment, filled with a general contingency table of its various renderings, and followed by subsections which list all pairs exhibiting the given correspondence. ‘soundcorrs’ provides functions to format such output in HTML or in LaTeX, or not at all. Custom formatters are also not very difficult to write.

The correspondences can be shown in one of two directions (the argument ‘direction’), and tables can show the number of occurrences or the number of words in which the given correspondence manifests itself (‘unit’), in absolute or in relative terms (‘count’; all three with values as with ‘summary()’). Which columns are printed can be modified with ‘cols’, and whether to write to a file or to the screen, with ‘file’ (‘NULL’ meaning the screen). Lastly, the formatting is controlled by a special function, of which ‘soundcorrs’ provides three: ‘formatter.none()’, ‘formatter.html()’, and ‘formatter.latex()’. A custom formatter can also take additional arguments, which will be passed to it from the call to ‘allPairs()’.

As was mentioned, the “capitals” dataset is linguistically absurd, and so it should not matter that all the Polish names of European capitals are listed as borrowed from German. If however, one wished to fix this problem, and do it not by copying the output to a word processor and replacing “>” with “:” there, but rather inside ‘soundcorrs’, this wish can be fulfilled easily enough. First, the existing ‘formatter.html()’ function needs to be written to a file to serve as a base for the new formatter: ‘dput(formatter.html, “~/Desktop/myFormatter.R”)’. Then, the beginning of the first line of this file needs to be changed to something like ‘myFormatter <- function’…, and finally, the “>” and “<” signs (written in HTML as ‘&gt;’ and ‘&lt;’, respectively) need to be replaced with a colon. All that is then left is to load the new function to R and use it to format the output of ‘allPairs()’:

Segments

In this subsection, only one function: ‘findSegments()’ which, as the name implies, finds specific segments – in relation to segments exhibiting a specific sound correspondence.

findSegments()

‘findSegments()’ begins its operation by running ‘findPairs()’ to find which pairs realize the given sound correspondence. Then, it extracts from them the segment which lies in the specified distance from the segment which realizes this correspondence. For example, if we looked for the correspondence L1 a : L2 e in a pair of words L1 bac : L2 bec, the segment realizing the correspondence would be the second one. ‘findSegments()’ can be used to extract the b’s or the c’s.

Like ‘findPairs()’, it takes the arguments ‘data’, ‘x’, and ‘y’ – and, in addition, the argument ‘segment’ which, in the little example above, would define whether to extract the b’s (‘segment = -1’) or the c’s (‘segment = +1’).

The result is a list of two vectors, one for each of the two languages represented in the dataset. Both vectors have as many elements as the dataset has pairs, which makes them easy to attach to it. Places occupied by pairs which do not realize the given correspondence are filled with ‘NA’s, as are places occupied by words which do not have the desired segment.

# in the ‘d.abc’ dataset, only one word exhibits L1 a : L2 o
ao <- findPairs (d.abc, "a", "o")

# it is the third one
ao$which
#> [1] FALSE FALSE  TRUE FALSE FALSE FALSE

# and it has three segments, of which the first is the one we are looking for
ao
#>   ALIGNED.L1 ALIGNED.L2
#> 3      a|b|c      o|b|c

# hence
findSegments (d.abc, "a", "o", segment=0)
#> $L1
#> [1] NA  NA  "a" NA  NA  NA 
#> 
#> $L2
#> [1] NA  NA  "o" NA  NA  NA

# and
findSegments (d.abc, "a", "o", segment=2)
#> $L1
#> [1] NA  NA  "c" NA  NA  NA 
#> 
#> $L2
#> [1] NA  NA  "c" NA  NA  NA

# but
findSegments (d.abc, "a", "o", segment=-1)
#> $L1
#> [1] NA NA NA NA NA NA
#> 
#> $L2
#> [1] NA NA NA NA NA NA

# the output of ‘findSegments()’ can be turned into phonetic values
segms <- findSegments (d.abc, "b", "b", segment=1)
phon <- char2value (d.abc, "L1", segms$L1)
phon
#> [1] "cons,affr,apic,vl"       "vow,low,back,nrnd,short"
#> [3] "cons,affr,apic,vl"       NA                       
#> [5] "cons,affr,apic,vl"       "vow,low,back,nrnd,short"

# a table for manual inspection
mapply (function(l,s) char2value(d.abc,l,s), d.abc$names, segms)
#>      L1                        L2                       
#> [1,] "cons,affr,apic,vl"       "cons,affr,apic,vl"      
#> [2,] "vow,low,back,nrnd,short" "vow,low,back,nrnd,short"
#> [3,] "cons,affr,apic,vl"       "cons,affr,apic,vl"      
#> [4,] NA                        NA                       
#> [5,] "cons,affr,apic,vl"       "cons,affr,apic,vl"      
#> [6,] "vow,low,back,nrnd,short" "vow,low,back,nrnd,short"

# this result can then be further processed…
phon <- unlist (lapply (phon, function(i) grepl("cons",i)))

# … attached to a dataset
d.abc.new <- cbind (d.abc, BEFORE.CONSONANT=phon)

# … and analysed
table (d.abc.new, "BEFORE.CONSONANT")
#>      BEFORE.CONSONANT
#> L1→L2 FALSE TRUE
#>   -_ə     1    1
#>   a_a     2    2
#>   a_o     0    1
#>   a_u     1    0
#>   b_b     2    3
#>   b_w     1    0
#>   c_c     3    3

# sadly, the procedure becomes more complicated if a correspondence
#    occurs more than once in a single word
findSegments (d.abc, "a", "a", segment=1)
#> $L1
#> [1] "b"   "c,b" NA    NA    "b"   "b,c"
#> 
#> $L2
#> [1] "b"   "c,b" NA    NA    "b"   "b,c"

Helper functions

In addition to analytic functions, ‘soundcorrs’ also exports several helpers. Let us now briefly discuss those, this time simply in the alphabetic order.

addSeparators()

As was mentioned above, automatic segmentation and alignment requires careful supervision, and it may prove in the end to be easier to do by hand. ‘addSeparators()’ can facilitate the first half of this task by interspersing a vector of character strings with a separator.

binTable()

It may be sometimes that the data are insufficient for a test of independence, or that the contingency table is too diversified to draw concrete conclusions from it. ‘binTable()’ takes one or more rows and one or more columns as arguments, and leaves those rows and columns unchanged, while summing up all the others.

expandMeta()

Metacharacters defined in the transcription (“wildcards”) can be used inside a ‘findPairs()’ query, but they can also be used with ‘grep()’ or any other function. ‘expandMeta()’ is a little function that translates them into regular expressions that vanilla R can understand.

lapplyTest()

‘lapplyTest()’ is a variant of ‘base::lapply()’ specifically adjusted for the application of tests of independence. The main difference lies in the handling of warnings and errors.

This function takes a list of contingency tables, such as generated by ‘allTables()’ above, and applies to each of its elements a function given in ‘fun’. By default, it is ‘chisq.test()’, but any other test can be used, so long as its output contains an element named ‘p.value’. The result is a list of the outputs of ‘fun’, to each attached as an attribute a warning or an error if any were produced. Additional arguments to ‘fun’ can also be passed in a call to ‘lapplyTest()’.

Technically, the output is of class ‘list.lapplyTest’. It can be passed to ‘summary()’ to sift through the results and only print the ones with the p-value below the specified threshold (the default is 0.05). Those tests which produced a warning are prefixed with an exclamation mark.

long2wide()

‘long2wide()’, together with ‘wide2long()’ are used to convert data frames between the “long format” and the “wide format” (see above). Of these two, ‘long2wide()’ is particularly useful because the “long format” tends to be easier for humans to perform the segmentation, and is therefore preferable for storing data, while the “wide format” is used internally and required by ‘soundcorrs’.

During the conversion, the number of columns is almost doubled (while the number of rows halved), but because it is unwise to have duplicate column names, they are given suffixes – which are taken from the values in the column ‘LANGUAGE’. The name of the column used for that purpose can be changed using the ‘col.lang’ argument.

Some of the attributes pertain to only one word in a pair or to the pair as a whole. In the “long format” those have to be repeated, but in the “wide format” this is not necessary. ‘long2wide()’ allows for certain columns to be excluded from the conversion, using the ‘skip’ argument.

wide2long()

‘wide2long()’ is simply the inverse of ‘long2wide()’. The conversion may not be perfect, as the order of the columns may change.

In ‘long2wide()’, suffixes were taken from the values in the ‘LANGUAGE’ column; this time they must be specified explicitly. They will be stored in a column defined by the argument ‘col.lang’, which defaults to ‘LANGUAGE’. However, the string that separated column names from suffixes will not be removed by default. To strip it, the argument ‘strip’ needs to be set to the length of the separator.

Contact, citation

If you found a bug, have a remark to make about ‘soundcorrs’, or wishes for its future releases, please write to .

If you use ‘soundcorrs’ in your research, please cite it as Stachowski K. [forthcoming]. soundcorrs: Tools for Semi-Automatic Analysis of Sound Correspondences.