Match Regular Expressions with a Nicer ‘API’
A small wrapper on regular expression matching functions regexpr and gregexpr to return the results in tidy data frames.
Note that rematch2 is not compatible with the original rematch package. There are at least three major changes: * The order of the arguments for the functions is different. In rematch2 the text vector is first, and pattern is second. * In the result, .match is the last column instead of the first. * rematch2 returns tibble data frames. See https://github.com/hadley/tibble.
With capture groups:
dates <- c("2016-04-20", "1977-08-08", "not a date", "2016",
"76-03-02", "2012-06-30", "2015-01-21 19:58")
isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
re_match(text = dates, pattern = isodate)#> # A tibble: 7 x 5
#> `` `` `` .text .match
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2016 04 20 2016-04-20 2016-04-20
#> 2 1977 08 08 1977-08-08 1977-08-08
#> 3 <NA> <NA> <NA> not a date <NA>
#> 4 <NA> <NA> <NA> 2016 <NA>
#> 5 <NA> <NA> <NA> 76-03-02 <NA>
#> 6 2012 06 30 2012-06-30 2012-06-30
#> 7 2015 01 21 2015-01-21 19:58 2015-01-21
Named capture groups:
isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
re_match(text = dates, pattern = isodaten)#> # A tibble: 7 x 5
#> year month day .text .match
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2016 04 20 2016-04-20 2016-04-20
#> 2 1977 08 08 1977-08-08 1977-08-08
#> 3 <NA> <NA> <NA> not a date <NA>
#> 4 <NA> <NA> <NA> 2016 <NA>
#> 5 <NA> <NA> <NA> 76-03-02 <NA>
#> 6 2012 06 30 2012-06-30 2012-06-30
#> 7 2015 01 21 2015-01-21 19:58 2015-01-21
A slightly more complex example:
github_repos <- c(
"metacran/crandb",
"jeroenooms/curl@v0.9.3",
"jimhester/covr#47",
"hadley/dplyr@*release",
"r-lib/remotes@550a3c7d3f9e1493a2ba",
"/$&@R64&3"
)
owner_rx <- "(?:(?<owner>[^/]+)/)?"
repo_rx <- "(?<repo>[^/@#]+)"
subdir_rx <- "(?:/(?<subdir>[^@#]*[^@#/]))?"
ref_rx <- "(?:@(?<ref>[^*].*))"
pull_rx <- "(?:#(?<pull>[0-9]+))"
release_rx <- "(?:@(?<release>[*]release))"
subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx)
github_rx <- sprintf(
"^(?:%s%s%s%s|(?<catchall>.*))$",
owner_rx, repo_rx, subdir_rx, subtype_rx
)
re_match(text = github_repos, pattern = github_rx)#> # A tibble: 6 x 9
#> owner repo subdir ref pull release catchall
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 metacran crandb
#> 2 jeroenooms curl v0.9.3
#> 3 jimhester covr 47
#> 4 hadley dplyr *release
#> 5 r-lib remotes 550a3c7d3f9e1493a2ba
#> 6 /$&@R64&3
#> # ... with 2 more variables: .text <chr>, .match <chr>
Extract all names, and also first names and last names:
name_rex <- paste0(
"(?<first>[[:upper:]][[:lower:]]+) ",
"(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore"
)
not <- re_match_all(notables, name_rex)
not#> # A tibble: 2 x 4
#> first last .text .match
#> <list> <list> <chr> <list>
#> 1 <chr [2]> <chr [2]> Ben Franklin and Jefferson Davis <chr [2]>
#> 2 <chr [1]> <chr [1]> "\tMillard Fillmore" <chr [1]>
#> [[1]]
#> [1] "Ben" "Jefferson"
#>
#> [[2]]
#> [1] "Millard"
#> [[1]]
#> [1] "Franklin" "Davis"
#>
#> [[2]]
#> [1] "Fillmore"
#> [[1]]
#> [1] "Ben Franklin" "Jefferson Davis"
#>
#> [[2]]
#> [1] "Millard Fillmore"
re_exec and re_exec_all are similar to re_match and re_match_all, but they also return match positions. These functions return match records. A match record has three components: match, start, end, and each component can be a vector. It is similar to a data frame in this respect.
#> # A tibble: 2 x 4
#> first last .text .match
#> * <list> <list> <chr> <list>
#> 1 <list [3]> <list [3]> Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]> "\tMillard Fillmore" <list [3]>
Unfortunately R does not allow hierarchical data frames (i.e. a column of a data frame cannot be another data frame), but rematch2 defines some special classes and an $ operator, to make it easier to extract parts of re_exec and re_exec_all matches. You simply query the match, start or end part of a column:
#> [1] "Ben" "Millard"
#> [1] 3 2
#> [1] 5 8
re_exec_all is very similar, but these queries return lists, with arbitrary number of matches:
#> # A tibble: 2 x 4
#> first last .text .match
#> <list> <list> <chr> <list>
#> 1 <list [3]> <list [3]> Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]> "\tMillard Fillmore" <list [3]>
#> [[1]]
#> [1] "Ben" "Jefferson"
#>
#> [[2]]
#> [1] "Millard"
#> [[1]]
#> [1] 3 20
#>
#> [[2]]
#> [1] 2
#> [[1]]
#> [1] 5 28
#>
#> [[2]]
#> [1] 8
MIT © Mango Solutions, Gábor Csárdi