Package ‘colr’ supplies functions that deal with column names. It puts the power of ‘regex’ at the disposal of the user. ‘Regex’ is an extremely powerful domain specific language (dsl) to deal with the selection and substitution of strings and is therefore very suited to deal with column names. ‘R’ has a full distribution of the ‘Perl’ ‘regex’ engine in the base package, and that is therefore at the disposal of any user. This vignette will explain the use of both the package and basic ‘regex’. ‘Regex’ is a rich and complex dsl and the this vignette is not the right place to go into great detail, I will point you to other sources where appropriate.
Use the package with or use the functions with colr::funcname
(for example colr::cgrep
).
Package ‘colr’ provides several functions all based on the ‘R’ functions associated with ‘regex’. The functions are cgrep
and csub
. cgrep
selects all columns where the string matches the column names, csub
changes the column names according to the replacement.
cgrep is a function to select columns or rows from a dataframe, list or matrix with named columns. For lists the selection is by names of the list. cgrep will act only on the top level of the list. The return is a dataframe with the selected columns, or if the input was a list or matrix and only a single column was selected a flattened list or vector is returned.
cgrep(x, pattern, dim = "c")
The input argument are x
, pattern
and, optionally, dim. x
can be a data frame, list or matrix, pattern
is a string, the string must be “quoted” and dim is a character of either c (for columns) or r (for rows).
A Dataframe, list or matrix or NULL
if no column is matched.
Select columns by ‘Perl’ regular expression. See {base} for ‘regex’ documenation. ‘Regex’ is a very powerful grammar to match strings
cgrep(x, "^.+$")
# matches all columns that have non-empty column names and thus drops all columns with empty names
cgrep(iris, "^Petal\\\\.")
# matches all columns that have names starting with the string “Petal.”
cgrep(iris, "\\\\.")
# columns with names that contain a dot
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
head(colr::cgrep(iris,"^Petal\\.")) # matches all columns that have names starting with the string "Petal."
## Petal.Length Petal.Width
## 1 1.4 0.2
## 2 1.4 0.2
## 3 1.3 0.2
## 4 1.5 0.2
## 5 1.4 0.2
## 6 1.7 0.4
csub is a function to change column- or row names in a dataframe, list or matrix with named columns. The return is the same dataframe, list or matrix where the all column names with a match are changed to the pattern, or if no match was found the returned data frame, list or matrix is exactly as the input x
. For lists ‘csub’ acts on the names in top level of the list.
csub(x, pattern, replacement, dim = "c", gl=TRUE)
The input argument are x
, pattern
, replacement
and, optionally, dim
and gl
. x
can be a data frame, list or matrix, pattern
and replacement
are strings, the strings must be “quoted”. dim
must be either “r” for substitution in row names or “c” for substitution in column names, which is the default. gl
is used to control global or non global substitution i.e. all occurences in all names or just the first occurence in every name.
A Dataframe, list or matrix.
Substitues strings in column or row names by ‘Perl’ regular expression. See {base} for ‘regex’ documenation. ‘Regex’ is a very powerful grammar to match strings and replace strings.
csub(iris, "\\.", "-")
# will change all dots in column names in “-”
csub(iris, "[pP]etal", "Beetle")
This is not a tutorial on ‘regex’, but an introduction to basic use. ‘Regex’ is a grammar for pattern matching and replacement in strings. A pattern consist of a quoted string of characters and meta characters. The quote in ‘R’ ‘regex’ is "
. The meta characters have special meaning in the grammar. An example of a meta character is the .
(the dot) that stands for any character except a newline in the string. So to match in the string The quick brown fox jumps over the hedge
either ow
or ox
or ov
you can use the pattern “o.
”.
‘Regex’ works best if you phrase your question as a positive. Try to phrase what you are looking for not what should be missing. So a string without numbers will have you looking for a string that has letters, spaces and punctation only.
If you want to use a meta characater literally you need to escape it. Escaping in ‘R’ is done with double backslashes \\
. So if you search the character [
the string to use is \\[
.
Apart from the .
, that stands for any character, there are several other classes. Some of them can be defined by the user on the fly. For example [0-9]
stands for any digit and is therefore the same as the class \d.
char | meaning |
---|---|
. | any character except newline |
\w \d \s | word, digit, whitespace |
\W \D \S | not word, digit, whitespace |
[abc] | any of a, b, or c |
[^abc] | not a, b, or c |
[a-g] | character between a & g |
Anchors are important meta characters in ‘regex’, they provide a way to specify where you want to find the string. The meta characters for anchors are in the table below.
char | meaning |
---|---|
^abc$ | start / end of the string |
\b \B | word, not-word boundary |
The start and end of the string mark the position where the string starts (or ends) i.e. just before the first character. A word boundary is anything between a space character and any other character. To take the example a little further if you only want to match the ov
in over
you could use the pattern “\bo.
”.
Quantification gives the possibility to define exactly how many times a character or meta character may occur and alternation. There are four: *
, +
, ?
and { }
. The ?
means: may occur once. The +
means: may occur once or more and the *
means: may occur once many times or not all. The { }
sytax is for an exact number of matches. A word of warning is in place here. Be very careful with +
and *
. A common mistake with (beginning) users of ‘regex’ is to use the string .*
wich means any character however many occurences. As stated here it will match anything to the end of the string. ‘Regex’ (the ‘Perl’ variant at least) is greedy. That means it will try to match as much as it can before moving on to the next character in the pattern. If you try to match the string jumps over the
in the sentence The quick brown fox jumps over the hedge
with a pattern of j.+e
you are actually matching jumps over the hedge
. Why? because e.+
will match as much as it can without failing the match.
Alternation is a simple concept the pattern ab|de
simply means: match ab
or de
whichever occurs. Alternation is lazy, i.e. if the first possibility the other is not evaluated.
char | meaning |
---|---|
a* a+ a? | 0 or more, 1 or more, 0 or 1 |
a{5} a{2,} | exactly five, two or more |
a{1,3} | between one & three |
a+? a{2,}? | match as few as possible |
ab|cd | match ab or cd |
‘Regex’ provides grouping with parentheses but these also capture the matched string for reuse in a replacement. There are two (actually there are more but they are outside the scope of this vignette) grouping structures. ()
for group and capture. (?: )
for grouping only. Grouping is usefull in many instances but often used in alternation. If you want to match either abc
or d
you can group abc
as such: (?:abc)|d, meaning match abc
or d
.
Capturing is useful in replacement. The ‘regex’ engine stores any string that is matched and that is grouped in a capturing parentheses in special variables. In the replacement string these special variables can be used to insert what was matched. There are nine of these special variables and they can be used with \\1
through \\9
. The first one stores what was matched first and so on.
char | meaning |
---|---|
(abc) | capture group |
backreference to capture group #1 | |
(?:abc) | non-capturing group |
To test the validity of the ‘regex’ pattern you can use this webstite: regexr. It is not related to this package.
A more elaborate example will be useful to understand the power of ‘regex’ substitution of column names. A data set is provided in the package under the name colrdata
. colrdata
has column names with messy notation of dates.
colr::colrdata
## 01-31-1955 02-15-1980 06/02/1999 02 14 01 03 03-2016
## 1 7 15 5 11 1
## 2 2 13 9 14 10
## 3 4 12 3 6 8
Say you want to clean this up and at the same time move from American custom (put months first) to European (days first). One call to ‘csub’ can do this for you as such csub(colrdata, "^([01]?\\d)[/ \\-]([123]?\\d)[/ \\-]((?:(?:19)|(?:20))?\\d{2})$","\\2-\\1-\\3")
. The result looks like this, I will shortly explain why the pattern looks like that.
colr::csub(colr::colrdata, "^([01]?\\d)[/ \\-]([0123]?\\d)[/ \\-](?:(?:19)|(?:20))?(\\d{2})$","\\2-\\1-\\3")
## 31-01-55 15-02-80 02-06-99 14-02-01 03-03-16
## 1 7 15 5 11 1
## 2 2 13 9 14 10
## 3 4 12 3 6 8
So how does that work? It states match from the beginning of the string ^
a zero or one [01]
, that may occur ?
, followed by one digit \\d
and if found capture it in special variable 1. Next there must be on of /-
followed by either a zero, a one, two or three [0123]?
and a digit, again followed by one of /-
, and capture the digits. Lastly look for either 19
or 20
but do not capture these alone, next there must be two digits and capture all the digits.