Match Regular Expressions with a Nicer ‘API’
A small wrapper on regular expression matching functions regexpr
and gregexpr
to return the results in tidy data frames.
Note that rematch2
is not compatible with the original rematch
package. There are at least three major changes: * The order of the arguments for the functions is different. In rematch2
the text
vector is first, and pattern
is second. * In the result, .match
is the last column instead of the first. * rematch2
returns tibble
data frames. See https://github.com/hadley/tibble.
With capture groups:
dates <- c("2016-04-20", "1977-08-08", "not a date", "2016",
"76-03-02", "2012-06-30", "2015-01-21 19:58")
isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
re_match(text = dates, pattern = isodate)
#> # A tibble: 7 x 5
#> `` `` `` .text .match
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2016 04 20 2016-04-20 2016-04-20
#> 2 1977 08 08 1977-08-08 1977-08-08
#> 3 <NA> <NA> <NA> not a date <NA>
#> 4 <NA> <NA> <NA> 2016 <NA>
#> 5 <NA> <NA> <NA> 76-03-02 <NA>
#> 6 2012 06 30 2012-06-30 2012-06-30
#> 7 2015 01 21 2015-01-21 19:58 2015-01-21
Named capture groups:
isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
re_match(text = dates, pattern = isodaten)
#> # A tibble: 7 x 5
#> year month day .text .match
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2016 04 20 2016-04-20 2016-04-20
#> 2 1977 08 08 1977-08-08 1977-08-08
#> 3 <NA> <NA> <NA> not a date <NA>
#> 4 <NA> <NA> <NA> 2016 <NA>
#> 5 <NA> <NA> <NA> 76-03-02 <NA>
#> 6 2012 06 30 2012-06-30 2012-06-30
#> 7 2015 01 21 2015-01-21 19:58 2015-01-21
A slightly more complex example:
github_repos <- c(
"metacran/crandb",
"jeroenooms/curl@v0.9.3",
"jimhester/covr#47",
"hadley/dplyr@*release",
"r-lib/remotes@550a3c7d3f9e1493a2ba",
"/$&@R64&3"
)
owner_rx <- "(?:(?<owner>[^/]+)/)?"
repo_rx <- "(?<repo>[^/@#]+)"
subdir_rx <- "(?:/(?<subdir>[^@#]*[^@#/]))?"
ref_rx <- "(?:@(?<ref>[^*].*))"
pull_rx <- "(?:#(?<pull>[0-9]+))"
release_rx <- "(?:@(?<release>[*]release))"
subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx)
github_rx <- sprintf(
"^(?:%s%s%s%s|(?<catchall>.*))$",
owner_rx, repo_rx, subdir_rx, subtype_rx
)
re_match(text = github_repos, pattern = github_rx)
#> # A tibble: 6 x 9
#> owner repo subdir ref pull release catchall
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 metacran crandb
#> 2 jeroenooms curl v0.9.3
#> 3 jimhester covr 47
#> 4 hadley dplyr *release
#> 5 r-lib remotes 550a3c7d3f9e1493a2ba
#> 6 /$&@R64&3
#> # ... with 2 more variables: .text <chr>, .match <chr>
Extract all names, and also first names and last names:
name_rex <- paste0(
"(?<first>[[:upper:]][[:lower:]]+) ",
"(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore"
)
not <- re_match_all(notables, name_rex)
not
#> # A tibble: 2 x 4
#> first last .text .match
#> <list> <list> <chr> <list>
#> 1 <chr [2]> <chr [2]> Ben Franklin and Jefferson Davis <chr [2]>
#> 2 <chr [1]> <chr [1]> "\tMillard Fillmore" <chr [1]>
#> [[1]]
#> [1] "Ben" "Jefferson"
#>
#> [[2]]
#> [1] "Millard"
#> [[1]]
#> [1] "Franklin" "Davis"
#>
#> [[2]]
#> [1] "Fillmore"
#> [[1]]
#> [1] "Ben Franklin" "Jefferson Davis"
#>
#> [[2]]
#> [1] "Millard Fillmore"
re_exec
and re_exec_all
are similar to re_match
and re_match_all
, but they also return match positions. These functions return match records. A match record has three components: match
, start
, end
, and each component can be a vector. It is similar to a data frame in this respect.
#> # A tibble: 2 x 4
#> first last .text .match
#> * <list> <list> <chr> <list>
#> 1 <list [3]> <list [3]> Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]> "\tMillard Fillmore" <list [3]>
Unfortunately R does not allow hierarchical data frames (i.e. a column of a data frame cannot be another data frame), but rematch2
defines some special classes and an $
operator, to make it easier to extract parts of re_exec
and re_exec_all
matches. You simply query the match
, start
or end
part of a column:
#> [1] "Ben" "Millard"
#> [1] 3 2
#> [1] 5 8
re_exec_all
is very similar, but these queries return lists, with arbitrary number of matches:
#> # A tibble: 2 x 4
#> first last .text .match
#> <list> <list> <chr> <list>
#> 1 <list [3]> <list [3]> Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]> "\tMillard Fillmore" <list [3]>
#> [[1]]
#> [1] "Ben" "Jefferson"
#>
#> [[2]]
#> [1] "Millard"
#> [[1]]
#> [1] 3 20
#>
#> [[2]]
#> [1] 2
#> [[1]]
#> [1] 5 28
#>
#> [[2]]
#> [1] 8
MIT © Mango Solutions, Gábor Csárdi