Regex

Misc

Resources
- {stringr} online cheatsheet
For logical expressions, order matters. Therefore, more complicated patterns should precede simplier patterns.
- Example:
  - (\\w+|\\w+\\s\\w+) says to extract single words then extract compound words or expressions, but since compound words are made up of single words, it will only extract the first halves and miss the whole compound word
  - ((\\w+\\s\\w+)|\\w+) will extract the compound word and then the single words.
Match anything once: (.*?)
Match empty lines: "^$"
Match punctuation and special characters: [^\\w\\s*]

Tools

autoregex - English to Regex
RegExplain - RStudio Addin. Interactively build your regexp, check the output of common string matching functions, consult the interactive help pages, or use the included resources to learn regular expressions

Constructs

Lookarounds
- For matching before or after a pattern
- Computationally expensive
- Lookaheads: (?=...) (positive), (?!...) (negative).
  - Positive says match this pattern if followed by the pattern specified in place of the ... in (?=...)
    - e.g. \\d+(?= dollars) matches “100” in “100 dollars” but not in “100 euros”
  - Negative says match this pattern if not followed by the pattern specified in place of the ... in (?!...)
    - e.g. \\d+(?! dollars)matches “100” in “100 euros” but not in “100 dollars”
- Lookbehinds: (?<=...) (positive), (?<!...) (negative).
  - Positive says match this pattern if preceded by the pattern specified in place of the ... in (?<=...)
    - e.g. (?<=\\$)\\d+ matches “100” in “$100” but not in “₤100”
    - See Patterns >> :
      - Extract numbers after text
  - Negative says match this pattern if not preceded by the pattern specified in place of the ... in (?<!...)
    - e.g. (?<!\\$)\\d+ matches “100” in “₤100” but not in “$100”
Non-Capturing Groups
- Helps store pattern optimally (memory-wise) when using alternations (i.e. (word1|word2)) and repetitions (e.g. (\\d{3})*)
  - Has other benefits but the main one I see is the memory efficiency aspect
- Especially useful for big data
- Syntaxes
  - Typical
```
"(?:apple|banana|cherry) pie"
```
    - Matches “apple pie”, “banana pie”, or “cherry pie”
    - With (apple|banana|cherry), the regex engine would somehow unnecessarily store “cherry” and “banana” when matching “apple pie”.
      - The typically get stored for “later use,” but I don’t know what the situations are.
  - Optional
```
"(?:apple|banana|cherry)? pie"
```
    - The extra ? at the end of the grouping makes the fruit part optional. So if the string just has “pie” and no “apple”, “banana” or “cherry” preceding it, it still gets matched.
  - Nested
```
"(?:foo|bar(?:123|456))"
```
    - Matches “foo”, “bar123”, or “bar456”
  - Repetition
```
"\\d{3}(?:,\\d{3})*"
```
    - Matches numbers like 123,456 and 123,456,789. With the non-capture, none of the commas + digits get stored in memory.
Conditionals
- Syntax: (?(conditon)true|false)
- e.g. (?(?<=foo)bar|baz)
  - Uses a positive lookbehind for the condition.
  - Matches “bar” if preceded by “foo”, else “baz” is matched.
Atomic Groups
- Syntax: (?>...)
- Performance optimization (avoid catastrophic backtracking).
- Enforcing strict matches (e.g., “this must fully match or fail”)
- I don’t really get what these are for, but I think they’re rarely needed (or seen).

Patterns

Extract numberes after text
```
elem_text() |> 
  stringr::str_extract("(?<=Gross Tax Amount\n\\$)(?:\\d{1,3}(?:,\\d{3})*|\\d+)\\.\\d{2}") |> 
  stringr::str_remove_all(",")
```
- (?<=Gross Tax Amount\n\\$): Positive lookbehind (?<=) to ensure the amount follows “Gross Tax Amount” followed by a newline and a dollar sign.
- (?:\\d{1,3}(?:,\\d{3})*|\\d+): Matches the dollar part, which can be:
  - (?:\\d{1,3}(?:,\\d{3})*: Matches numbers with commas as thousand separators (e.g., 1,234).
    - \\d{1,3} matches 1 to 3 digits
    - (?:,\\d{3})* matches a comma followed by 3 digits and repeats as necessary
      - ?: specifies a non-capturing group
  - |\\d+: Numbers without commas (e.g., 1234).
- \\.\\d{2}: Requires a decimal point followed by exactly two digits for cents.

URL regex (source)

url_regex <- "https?://(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&//=]*)"

Extracting dates from the beginning

library(stringi)
days <- 
  stri_match_first_regex(fils, 
                         "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})")[,2]

Extracting Words and Compound Words at the Beginning

qmd_txt
#> [1] "-   200 Status - An API serving an ML model returns a HTTP 200 OK success status response code indicates that the request has succeeded." 
#> [2] "-   AMI - amazon machine image. Thing that has R and the main packages you need to load onto the cloud server"
#> [3] "-   Anti-patterns - certain patterns in software development that are considered bad programming practices."

qmd_txt |> 
  str_extract(pattern = "^\\-   ((\\w+\\s\\w+)|(\\w+[^\\w\\s*]+\\w+)|\\w+)")

#> [1] "-   200 Status"                 
#> [2] "-   AMI"          
#> [3] "-   Anti-patterns"

(\\w+\\s\\w+) matches patterns of “word + space + word”
(\\w+[^\\w\\s*]+\\w+) matches patterns of “word + (not word and not space) + word”
- [^\\w\\s*] will match punctuation and special characters (e.g. hypens separating words)
\\w+ matches word

Extract text after a character until the end of the string
```
text <- "path/to/some/file.txt"
(result <- str_extract(text, "[^/]+$"))
#> "file.txt"
```
- [^/]: Matches any character that is not a forward slash (/).
- +: Matches one or more of the preceding pattern.
- $: Anchors the match to the end of the string.

Extract words between brackets and parentheses.

Example 1: Base R (source)

# brackets
text <- "Extract this [text] from the string."

result <- sub(".*\\[(.*?)\\].*", "\\1", text)

print(result)
#> [1] "text"

# parentheses
text2 <- "This is a sample (extract this part) string."

# Extract string between parentheses using base R
extracted_base <- gsub(".*\\((.*)\\).*", "\\1", text2)
print(extracted_base)
#> [1] "extract this part"

.* matches any character (except for line terminators) zero or more times.
\\[ matches the literal [
(.*?) and (.*) are non-greedy matches for any character (.) zero or more times.
\\1 in the replacement string refers to the first capture group, i.e., the text between [ ] and ( ).

Example 2: {stringr} (source)
```
# brackets
result_str_extract <- str_extract(text, "(?<=\\[).*?(?=\\])")

# parentheses
extracted_str <- str_extract(text2, "\$.*?\$")
extracted_str <- str_sub(extracted_str, 2, -2)
```
- The str_extract function extracts the first substring matching a regex pattern.
- Look-Behind (?<=\\[) and Look-Ahead (?=\\]) assertions match text between [ and ]
- str_sub is then used to remove the enclosing parentheses.

Example 3: {stringi} (source)

# brackets
result_stri_extract <- stri_extract(text, regex = "(?<=\\[).*?(?=\\])")

# parentheses
extracted_stri <- stringi::stri_extract_first_regex(text2, "\\(.*?\\)")
extracted_stri <- stringi::stri_sub(extracted_stri, 2, -2)

Similar to Example 2 {stringr}

Extract text after a special character

Example: After a hyphen (source)

library(stringr)

# Example string
string <- "apple-pie"

# Extract substring after the hyphen
result <- str_extract(string, "(?<=-).*")
# result <- stri_extract(string, regex = "(?<=-).*") # stringi
print(result) 
#> [1] "pie"

(?<=-) is a look-behind assertion ensuring the match occurs after a hyphen, and .* matches any character zero or more times.

Extract text before a space

Example: Before the first space (source)

# Sample data
text <- c("John Doe", "Jane Smith", "Alice Johnson")

# Extract strings before the first space
sub("\\s.*", "", text)

#> [1] "John"  "Jane"  "Alice"

# {stringr}
str_extract(text, "^[^\\s]+")
# {stringi}
stri_extract_first_regex(text, "^[^\\s]+")

^ says start at the beginning of the string. Then, [^\\s ]+ matches one or more characters that isn’t a space.