Tidy Output from Regular Expression Matching

Wrappers on 'regexpr' and 'gregexpr' to return the match results in tidy data frames.


Linux Build Status Windows Build status CRAN RStudio mirror downloads Coverage Status

A small wrapper on regular expression matching functions regexpr and gregexpr to return the results in tidy data frames.


Installation

source("https://install-github.me/r-lib/rematch2")

Rematch vs rematch2

Note that rematch2 is not compatible with the original rematch package. There are at least three major changes:

  • The order of the arguments for the functions is different. In rematch2 the text vector is first, and pattern is second.
  • In the result, .match is the last column instead of the first.
  • rematch2 returns tibble data frames. See https://github.com/hadley/tibble.

Usage

First match

library(rematch2)

With capture groups:

dates <- c("2016-04-20", "1977-08-08", "not a date", "2016",
  "76-03-02", "2012-06-30", "2015-01-21 19:58")
isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
re_match(text = dates, pattern = isodate)
#> # A tibble: 7 x 5
#>      ``    ``    ``            .text     .match
#>   <chr> <chr> <chr>            <chr>      <chr>
#> 1  2016    04    20       2016-04-20 2016-04-20
#> 2  1977    08    08       1977-08-08 1977-08-08
#> 3  <NA>  <NA>  <NA>       not a date       <NA>
#> 4  <NA>  <NA>  <NA>             2016       <NA>
#> 5  <NA>  <NA>  <NA>         76-03-02       <NA>
#> 6  2012    06    30       2012-06-30 2012-06-30
#> 7  2015    01    21 2015-01-21 19:58 2015-01-21

Named capture groups:

isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
re_match(text = dates, pattern = isodaten)
#> # A tibble: 7 x 5
#>    year month   day            .text     .match
#>   <chr> <chr> <chr>            <chr>      <chr>
#> 1  2016    04    20       2016-04-20 2016-04-20
#> 2  1977    08    08       1977-08-08 1977-08-08
#> 3  <NA>  <NA>  <NA>       not a date       <NA>
#> 4  <NA>  <NA>  <NA>             2016       <NA>
#> 5  <NA>  <NA>  <NA>         76-03-02       <NA>
#> 6  2012    06    30       2012-06-30 2012-06-30
#> 7  2015    01    21 2015-01-21 19:58 2015-01-21

A slightly more complex example:

github_repos <- c(
    "metacran/crandb",
    "jeroenooms/[email protected]",
    "jimhester/covr#47",
    "hadley/[email protected]*release",
    "r-lib/[email protected]",
    "/$&@R64&3"
)
owner_rx   <- "(?:(?<owner>[^/]+)/)?"
repo_rx    <- "(?<repo>[^/@#]+)"
subdir_rx  <- "(?:/(?<subdir>[^@#]*[^@#/]))?"
ref_rx     <- "(?:@(?<ref>[^*].*))"
pull_rx    <- "(?:#(?<pull>[0-9]+))"
release_rx <- "(?:@(?<release>[*]release))"
 
subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx)
github_rx  <- sprintf(
    "^(?:%s%s%s%s|(?<catchall>.*))$",
    owner_rx, repo_rx, subdir_rx, subtype_rx
)
re_match(text = github_repos, pattern = github_rx)
#> # A tibble: 6 x 9
#>        owner    repo subdir                  ref  pull  release  catchall
#>        <chr>   <chr>  <chr>                <chr> <chr>    <chr>     <chr>
#> 1   metacran  crandb                                                     
#> 2 jeroenooms    curl                      v0.9.3                         
#> 3  jimhester    covr                                47                   
#> 4     hadley   dplyr                                   *release          
#> 5      r-lib remotes        550a3c7d3f9e1493a2ba                         
#> 6                                                               /$&@R64&3
#> # ... with 2 more variables: .text <chr>, .match <chr>

All matches

Extract all names, and also first names and last names:

name_rex <- paste0(
  "(?<first>[[:upper:]][[:lower:]]+) ",
  "(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
  "  Ben Franklin and Jefferson Davis",
  "\tMillard Fillmore"
)
not <- re_match_all(notables, name_rex)
not
#> # A tibble: 2 x 4
#>       first      last                              .text    .match
#>      <list>    <list>                              <chr>    <list>
#> 1 <chr [2]> <chr [2]>   Ben Franklin and Jefferson Davis <chr [2]>
#> 2 <chr [1]> <chr [1]>               "\tMillard Fillmore" <chr [1]>
not$first
#> [[1]]
#> [1] "Ben"       "Jefferson"
#> 
#> [[2]]
#> [1] "Millard"
not$last
#> [[1]]
#> [1] "Franklin" "Davis"   
#> 
#> [[2]]
#> [1] "Fillmore"
not$.match
#> [[1]]
#> [1] "Ben Franklin"    "Jefferson Davis"
#> 
#> [[2]]
#> [1] "Millard Fillmore"

Match positions

re_exec and re_exec_all are similar to re_match and re_match_all, but they also return match positions. These functions return match records. A match record has three components: match, start, end, and each component can be a vector. It is similar to a data frame in this respect.

pos <- re_exec(notables, name_rex)
pos
#> # A tibble: 2 x 4
#>        first       last                              .text     .match
#> *     <list>     <list>                              <chr>     <list>
#> 1 <list [3]> <list [3]>   Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]>               "\tMillard Fillmore" <list [3]>

Unfortunately R does not allow hierarchical data frames (i.e. a column of a data frame cannot be another data frame), but rematch2 defines some special classes and an $ operator, to make it easier to extract parts of re_exec and re_exec_all matches. You simply query the match, start or end part of a column:

pos$first$match
#> [1] "Ben"     "Millard"
pos$first$start
#> [1] 3 2
pos$first$end
#> [1] 5 8

re_exec_all is very similar, but these queries return lists, with arbitrary number of matches:

allpos <- re_exec_all(notables, name_rex)
allpos
#> # A tibble: 2 x 4
#>        first       last                              .text     .match
#>       <list>     <list>                              <chr>     <list>
#> 1 <list [3]> <list [3]>   Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]>               "\tMillard Fillmore" <list [3]>
allpos$first$match
#> [[1]]
#> [1] "Ben"       "Jefferson"
#> 
#> [[2]]
#> [1] "Millard"
allpos$first$start
#> [[1]]
#> [1]  3 20
#> 
#> [[2]]
#> [1] 2
allpos$first$end
#> [[1]]
#> [1]  5 28
#> 
#> [[2]]
#> [1] 8

License

MIT © Mango Solutions, Gábor Csárdi

News

2.0.0.9000

  • Add perl argument to re_match and re_match_all for compatibility with functions that may pass that argument as part of ...

2.0.0

  • Add re_match_all to extract all matches.

  • Removed the perl options, we always use PERL compatible regular expressions.

1.0.1

  • Make R CMD check work when testthat is not available.

  • Fixed a bug with group capture when text is a scalar.

1.0.0

First public release.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("rematch2")

2.0.1 by Gábor Csárdi, 2 years ago


https://github.com/r-lib/rematch2#readme


Report a bug at https://github.com/r-lib/rematch2/issues


Browse source code at https://github.com/cran/rematch2


Authors: Gábor Csárdi


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports tibble

Suggests covr, testthat


Imported by pak, parsedate, pkgcache, pkgdown, remedy, sinew, styler, texPreview.


See at CRAN