Modifying existing strings via substitution is a common practice in programing. To this end, functions like gsub
provide a method to accomplish this. Below is an example where “hey” is replaced with “ho” transforming a line from the Ramones into Santa Claus leaving on Christmas Eve.
## [1] "ho ho, let's go!"
gsub
only supports one string of matching with one string of replacement. What this means is while you can match on multiple conditions, you can only provide one condition of replacement. Below we construct a regular expression which matches on “hey” or “ho” and replaces any such matches with “yo”.
## [1] "yo yo, let's go!"
If you wanted to replace “hey” with “get” and “ho” with “ready” you would need two steps.
## [1] "get ready, let's go!"
This sequential process however can result in undesired changes. If we want to swap where “hey” and “ho” are, we can see the process breaks down. Because each change happens in order, “hey” becomes “ho” and then every “ho” becomes “hey”, undoing the first step.
## [1] "hey hey, let's go!"
This is where the idea of mgsub
comes in. mgsub
is a safe, simultaneous string substitution function. We pass in a patterns to match as well as replacements and the replacements are applied simultaneously.
library(mgsub)
s = "hey ho, let's go!"
mgsub::mgsub(string = s, pattern = c("hey", "ho"), replacement = c("ho", "hey"))
## [1] "ho hey, let's go!"
mgsub
fully supports regular expressions as matching criteria as well as backreferences in the replacement. Note how the matching criteria ignores “dopachloride” for replacement but matches both “Dopazamine” and “dopastriamine” (all fake chemicals despite what the replace string claims!).
s = "Dopazamine is not the same as dopachloride or dopastriamine, yet is still fake."
pattern = c("[Dd]opa([^ ]*?mine)", "fake")
replacement = c("Meta\\1", "real")
mgsub::mgsub(s, pattern, replacement)
## [1] "Metazamine is not the same as dopachloride or Metastriamine, yet is still real."
Furthermore, you can pass through any options from the gsub
family. In the example below you can see fixed string matching
s = "All my life I chased $money$ and .power. - not love!"
pattern = c("$money$", ".power.", "love")
replacement = c("balloons", "dolphins", "success")
mgsub::mgsub(s, pattern, replacement, fixed = TRUE)
## [1] "All my life I chased balloons and dolphins - not success!"
This is actually the most compelling feature of mgsub
. Several packages implement a similar type function (also named mgsub
) which do not employ safe substitution - qdap
, bazar
and textclean
. A detailed analysis of safety can be found on my blog. Here is a quick overview of what is meant by safety:
First, a demonstration of the first form of safety. Note how we are searching for ‘they’ and ‘the’ where ‘the’ is a substring of ‘they’. If ‘the’ is matched before ‘they’, we would expect to see “ay don’t understand the value of what they seek.”, but in both cases, the replacements occur correctly.
s = "they don't understand the value of what they seek."
pattern = c("the", "they")
replacement = c("a", "we")
mgsub::mgsub(s, pattern, replacement)
## [1] "we don't understand a value of what we seek."
## [1] "we don't understand a value of what we seek."
We can continue to test this by using variable length regular expression matches. Note that we provide two different matching criteria, one a regular expression of length 6 but which matches a length 10 and the other a match of length 9. However, qdap
only prioritizes based on the length of the regular expression, not on the actual length of the match. While this is an edge case, it an example of safety provided by mgsub
.
s = "Dopazamine is a fake chemical"
pattern = c("dopazamin", "do.*ne")
replacement = c("freakout", "metazamine")
mgsub::mgsub(s, pattern, replacement, ignore.case = TRUE)
## [1] "metazamine is a fake chemical"
## [1] "freakoute is a fake chemical"
In the second case, mgsub
does not utilize placeholders and therefore guarantees no string collisions when replacing. Consider a simple example of shifting each word in the following string one spot to the left. mgsub
correctly shifts each word while qdap
provides two wrong sets of substitutions depending on the other arguments you provide.
s = "hey, how are you?"
pattern = c("hey", "how", "are", "you")
replacement = c("how", "are", "you", "hey")
mgsub::mgsub(s, pattern, replacement)
## [1] "how, are you hey?"
## [1] "how, are you how?"
## [1] "hey, hey hey hey?"
mgsub
pays the price of safety in performance. When only a single string is passed in to be modified, it runs about as fast as qdap
. However, when multiple strings are passed in, qdap
is far more elastic than mgsub
.
library(microbenchmark)
s = c("Dopazamine is not the same as Dopachloride and is still fake.",
"dopazamine is undergoing a review by the fake news arm of the Dopazamine government")
pattern = c("[Dd]opa(.*?mine)", "fake")
replacement = c("Meta\\1", "real")
microbenchmark(
mgsub = mgsub::mgsub(s[1], pattern, replacement),
qdap = qdap::mgsub(pattern, replacement, s[1], fixed = FALSE)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## mgsub 154.401 157.5085 166.09614 163.7445 169.2980 291.607 100
## qdap 47.920 50.9605 55.57665 55.2865 58.8185 99.124 100
microbenchmark(
mgsub = mgsub::mgsub(s, pattern, replacement),
qdap = qdap::mgsub(pattern, replacement, s, fixed = FALSE)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## mgsub 319.401 326.4465 335.79102 333.4840 337.4685 434.950 100
## qdap 54.882 60.0235 65.00009 65.9265 67.0450 120.468 100