Regex

Misc

  • For logical expressions, order matters. Therefore, more complicated patterns should precede simplier patterns.
    • Example:
      • (\\w+|\\w+\\s\\w+) says to extract single words then extract compound words or expressions, but since compound words are made up of single words, it will only extract the first halves and miss the whole compound word
      • ((\\w+\\s\\w+)|\\w+) will extract the compound word and then the single words.
  • Match anything once: (.*?)
  • Match empty lines: "^$"
  • Match punctuation and special characters: [^\\w\\s*]

Tools

  • autoregex - English to Regex
  • RegExplain - RStudio Addin. Interactively build your regexp, check the output of common string matching functions, consult the interactive help pages, or use the included resources to learn regular expressions

Patterns

  • Extracting Words and Compound Words at the Beginning

    qmd_txt
    #> [1] "-   200 Status - An API serving an ML model returns a HTTP 200 OK success status response code indicates that the request has succeeded." 
    #> [2] "-   AMI - amazon machine image. Thing that has R and the main packages you need to load onto the cloud server"
    #> [3] "-   Anti-patterns - certain patterns in software development that are considered bad programming practices."
    qmd_txt |> 
      str_extract(pattern = "^\\-   ((\\w+\\s\\w+)|(\\w+[^\\w\\s*]+\\w+)|\\w+)")
    
    #> [1] "-   200 Status"                 
    #> [2] "-   AMI"          
    #> [3] "-   Anti-patterns"
    • (\\w+\\s\\w+) matches patterns of “word + space + word”
    • (\\w+[^\\w\\s*]+\\w+) matches patterns of “word + (not word and not space) + word”
      • [^\\w\\s*] will match punctuation and special characters (e.g. hypens separating words)
    • \\w+ matches word
  • Extract words between brackets and parentheses.

    • Example 1: Base R (source)

      # brackets
      text <- "Extract this [text] from the string."
      
      result <- sub(".*\\[(.*?)\\].*", "\\1", text)
      
      print(result)
      #> [1] "text"
      
      # parentheses
      text2 <- "This is a sample (extract this part) string."
      
      # Extract string between parentheses using base R
      extracted_base <- gsub(".*\\((.*)\\).*", "\\1", text2)
      print(extracted_base)
      #> [1] "extract this part"
      • .* matches any character (except for line terminators) zero or more times.
      • \\[ matches the literal [
      • (.*?) and (.*) are non-greedy matches for any character (.) zero or more times.
      • \\1 in the replacement string refers to the first capture group, i.e., the text between [ ] and ( ).
    • Example 2: {stringr} (source)

      # brackets
      result_str_extract <- str_extract(text, "(?<=\\[).*?(?=\\])")
      
      # parentheses
      extracted_str <- str_extract(text2, "\\(.*?\\)")
      extracted_str <- str_sub(extracted_str, 2, -2)
      • The str_extract function extracts the first substring matching a regex pattern.
      • Look-Behind (?<=\\[) and Look-Ahead (?=\\]) assertions match text between [ and ]
      • str_sub is then used to remove the enclosing parentheses.
    • Example 3: {stringi} (source)

      # brackets
      result_stri_extract <- stri_extract(text, regex = "(?<=\\[).*?(?=\\])")
      
      # parentheses
      extracted_stri <- stringi::stri_extract_first_regex(text2, "\\(.*?\\)")
      extracted_stri <- stringi::stri_sub(extracted_stri, 2, -2)
      • Similar to Example 2 {stringr}
  • Extract text after a special character

    • Example: After a hyphen (source)

      library(stringr)
      
      # Example string
      string <- "apple-pie"
      
      # Extract substring after the hyphen
      result <- str_extract(string, "(?<=-).*")
      # result <- stri_extract(string, regex = "(?<=-).*") # stringi
      print(result) 
      #> [1] "pie"
      • (?<=-) is a look-behind assertion ensuring the match occurs after a hyphen, and .* matches any character zero or more times.
  • Extract text before a space

    • Example: Before the first space (source)

      # Sample data
      text <- c("John Doe", "Jane Smith", "Alice Johnson")
      
      # Extract strings before the first space
      sub("\\s.*", "", text)
      
      #> [1] "John"  "Jane"  "Alice"
      
      # {stringr}
      str_extract(text, "^[^\\s]+")
      # {stringi}
      stri_extract_first_regex(text, "^[^\\s]+")
      • ^ says start at the beginning of the string. Then, [^\\s ]+ matches one or more characters that isn’t a space.