Regex
Misc
- Resources
- {stringr} online cheatsheet
- For logical expressions, order matters. Therefore, more complicated patterns should precede simplier patterns.
- Example:
(\\w+|\\w+\\s\\w+)
says to extract single words then extract compound words or expressions, but since compound words are made up of single words, it will only extract the first halves and miss the whole compound word((\\w+\\s\\w+)|\\w+)
will extract the compound word and then the single words.
- Example:
- Match anything once:
(.*?)
- Match empty lines:
"^$"
- Match punctuation and special characters:
[^\\w\\s*]
Tools
- autoregex - English to Regex
- RegExplain - RStudio Addin. Interactively build your regexp, check the output of common string matching functions, consult the interactive help pages, or use the included resources to learn regular expressions
Constructs
- Lookarounds
- For matching before or after a pattern
- Computationally expensive
- Lookaheads:
(?=...)
(positive),(?!...)
(negative).- Positive says match this pattern if followed by the pattern specified in place of the
...
in(?=...)
- e.g.
\\d+(?= dollars)
matches “100” in “100 dollars” but not in “100 euros”
- e.g.
- Negative says match this pattern if not followed by the pattern specified in place of the
...
in(?!...)
- e.g.
\\d+(?! dollars)
matches “100” in “100 euros” but not in “100 dollars”
- e.g.
- Positive says match this pattern if followed by the pattern specified in place of the
- Lookbehinds:
(?<=...)
(positive),(?<!...)
(negative).- Positive says match this pattern if preceded by the pattern specified in place of the
...
in(?<=...)
- e.g.
(?<=\\$)\\d+
matches “100” in “$100” but not in “₤100” - See Patterns >> :
- Extract numbers after text
- e.g.
- Negative says match this pattern if not preceded by the pattern specified in place of the
...
in(?<!...)
- e.g.
(?<!\\$)\\d+
matches “100” in “₤100” but not in “$100”
- e.g.
- Positive says match this pattern if preceded by the pattern specified in place of the
- Non-Capturing Groups
- Helps store pattern optimally (memory-wise) when using alternations (i.e.
(word1|word2)
) and repetitions (e.g.(\\d{3})*
)- Has other benefits but the main one I see is the memory efficiency aspect
- Especially useful for big data
- Syntaxes
Typical
"(?:apple|banana|cherry) pie"
- Matches “apple pie”, “banana pie”, or “cherry pie”
- With
(apple|banana|cherry)
, the regex engine would somehow unnecessarily store “cherry” and “banana” when matching “apple pie”.- The typically get stored for “later use,” but I don’t know what the situations are.
Optional
"(?:apple|banana|cherry)? pie"
- The extra
?
at the end of the grouping makes the fruit part optional. So if the string just has “pie” and no “apple”, “banana” or “cherry” preceding it, it still gets matched.
- The extra
Nested
"(?:foo|bar(?:123|456))"
- Matches “foo”, “bar123”, or “bar456”
Repetition
"\\d{3}(?:,\\d{3})*"
- Matches numbers like 123,456 and 123,456,789. With the non-capture, none of the commas + digits get stored in memory.
- Helps store pattern optimally (memory-wise) when using alternations (i.e.
- Conditionals
- Syntax:
(?(conditon)true|false)
- e.g.
(?(?<=foo)bar|baz)
- Uses a positive lookbehind for the condition.
- Matches “bar” if preceded by “foo”, else “baz” is matched.
- Syntax:
- Atomic Groups
- Syntax:
(?>...)
- Performance optimization (avoid catastrophic backtracking).
- Enforcing strict matches (e.g., “this must fully match or fail”)
- I don’t really get what these are for, but I think they’re rarely needed (or seen).
- Syntax:
Patterns
Extract numberes after text
elem_text() |> ::str_extract("(?<=Gross Tax Amount\n\\$)(?:\\d{1,3}(?:,\\d{3})*|\\d+)\\.\\d{2}") |> stringr::str_remove_all(",") stringr
(?<=Gross Tax Amount\n\\$)
: Positive lookbehind (?<=
) to ensure the amount follows “Gross Tax Amount” followed by a newline and a dollar sign.(?:\\d{1,3}(?:,\\d{3})*|\\d+)
: Matches the dollar part, which can be:(?:\\d{1,3}(?:,\\d{3})*
: Matches numbers with commas as thousand separators (e.g., 1,234).\\d{1,3}
matches 1 to 3 digits(?:,\\d{3})*
matches a comma followed by 3 digits and repeats as necessary?:
specifies a non-capturing group
|\\d+
: Numbers without commas (e.g., 1234).
\\.\\d{2}
: Requires a decimal point followed by exactly two digits for cents.
URL regex (source)
<- "https?://(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&//=]*)" url_regex
Extracting dates from the beginning
library(stringi) <- days stri_match_first_regex(fils, "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})")[,2]
Extracting Words and Compound Words at the Beginning
qmd_txt#> [1] "- 200 Status - An API serving an ML model returns a HTTP 200 OK success status response code indicates that the request has succeeded." #> [2] "- AMI - amazon machine image. Thing that has R and the main packages you need to load onto the cloud server" #> [3] "- Anti-patterns - certain patterns in software development that are considered bad programming practices."
|> qmd_txt str_extract(pattern = "^\\- ((\\w+\\s\\w+)|(\\w+[^\\w\\s*]+\\w+)|\\w+)") #> [1] "- 200 Status" #> [2] "- AMI" #> [3] "- Anti-patterns"
(\\w+\\s\\w+)
matches patterns of “word + space + word”(\\w+[^\\w\\s*]+\\w+)
matches patterns of “word + (not word and not space) + word”[^\\w\\s*]
will match punctuation and special characters (e.g. hypens separating words)
\\w+
matches word
Extract text after a character until the end of the string
<- "path/to/some/file.txt" text <- str_extract(text, "[^/]+$")) (result #> "file.txt"
[^/]
: Matches any character that is not a forward slash (/
).+
: Matches one or more of the preceding pattern.$
: Anchors the match to the end of the string.
Extract words between brackets and parentheses.
Example 1: Base R (source)
# brackets <- "Extract this [text] from the string." text <- sub(".*\\[(.*?)\\].*", "\\1", text) result print(result) #> [1] "text" # parentheses <- "This is a sample (extract this part) string." text2 # Extract string between parentheses using base R <- gsub(".*\\((.*)\\).*", "\\1", text2) extracted_base print(extracted_base) #> [1] "extract this part"
.*
matches any character (except for line terminators) zero or more times.\\[
matches the literal[
(.*?)
and(.*)
are non-greedy matches for any character (.) zero or more times.\\1
in the replacement string refers to the first capture group, i.e., the text between[ ]
and( )
.
Example 2: {stringr} (source)
# brackets <- str_extract(text, "(?<=\\[).*?(?=\\])") result_str_extract # parentheses <- str_extract(text2, "\\(.*?\\)") extracted_str <- str_sub(extracted_str, 2, -2) extracted_str
- The
str_extract
function extracts the first substring matching a regex pattern. - Look-Behind
(?<=\\[)
and Look-Ahead(?=\\])
assertions match text between[
and]
str_sub
is then used to remove the enclosing parentheses.
- The
Example 3: {stringi} (source)
# brackets <- stri_extract(text, regex = "(?<=\\[).*?(?=\\])") result_stri_extract # parentheses <- stringi::stri_extract_first_regex(text2, "\\(.*?\\)") extracted_stri <- stringi::stri_sub(extracted_stri, 2, -2) extracted_stri
- Similar to Example 2 {stringr}
Extract text after a special character
Example: After a hyphen (source)
library(stringr) # Example string <- "apple-pie" string # Extract substring after the hyphen <- str_extract(string, "(?<=-).*") result # result <- stri_extract(string, regex = "(?<=-).*") # stringi print(result) #> [1] "pie"
- (?<=-) is a look-behind assertion ensuring the match occurs after a hyphen, and .* matches any character zero or more times.
Extract text before a space
Example: Before the first space (source)
# Sample data <- c("John Doe", "Jane Smith", "Alice Johnson") text # Extract strings before the first space sub("\\s.*", "", text) #> [1] "John" "Jane" "Alice" # {stringr} str_extract(text, "^[^\\s]+") # {stringi} stri_extract_first_regex(text, "^[^\\s]+")
- ^ says start at the beginning of the string. Then, [^\\s ]+ matches one or more characters that isn’t a space.