Regex
Misc
- For logical expressions, order matters. Therefore, more complicated patterns should precede simplier patterns.
- Example:
(\\w+|\\w+\\s\\w+)
says to extract single words then extract compound words or expressions, but since compound words are made up of single words, it will only extract the first halves and miss the whole compound word((\\w+\\s\\w+)|\\w+)
will extract the compound word and then the single words.
- Example:
- Match anything once:
(.*?)
- Match empty lines:
"^$"
- Match punctuation and special characters:
[^\\w\\s*]
Tools
- autoregex - English to Regex
- RegExplain - RStudio Addin. Interactively build your regexp, check the output of common string matching functions, consult the interactive help pages, or use the included resources to learn regular expressions
Patterns
URL regex (source)
<- "https?://(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&//=]*)" url_regex
Extracting dates from the beginning
library(stringi) <- days stri_match_first_regex(fils, "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})")[,2]
Extracting Words and Compound Words at the Beginning
qmd_txt#> [1] "- 200 Status - An API serving an ML model returns a HTTP 200 OK success status response code indicates that the request has succeeded." #> [2] "- AMI - amazon machine image. Thing that has R and the main packages you need to load onto the cloud server" #> [3] "- Anti-patterns - certain patterns in software development that are considered bad programming practices."
|> qmd_txt str_extract(pattern = "^\\- ((\\w+\\s\\w+)|(\\w+[^\\w\\s*]+\\w+)|\\w+)") #> [1] "- 200 Status" #> [2] "- AMI" #> [3] "- Anti-patterns"
(\\w+\\s\\w+)
matches patterns of “word + space + word”(\\w+[^\\w\\s*]+\\w+)
matches patterns of “word + (not word and not space) + word”[^\\w\\s*]
will match punctuation and special characters (e.g. hypens separating words)
\\w+
matches word
Extract text after a character until the end of the string
<- "path/to/some/file.txt" text <- str_extract(text, "[^/]+$")) (result #> "file.txt"
[^/]
: Matches any character that is not a forward slash (/
).+
: Matches one or more of the preceding pattern.$
: Anchors the match to the end of the string.
Extract words between brackets and parentheses.
Example 1: Base R (source)
# brackets <- "Extract this [text] from the string." text <- sub(".*\\[(.*?)\\].*", "\\1", text) result print(result) #> [1] "text" # parentheses <- "This is a sample (extract this part) string." text2 # Extract string between parentheses using base R <- gsub(".*\\((.*)\\).*", "\\1", text2) extracted_base print(extracted_base) #> [1] "extract this part"
.*
matches any character (except for line terminators) zero or more times.\\[
matches the literal[
(.*?)
and(.*)
are non-greedy matches for any character (.) zero or more times.\\1
in the replacement string refers to the first capture group, i.e., the text between[ ]
and( )
.
Example 2: {stringr} (source)
# brackets <- str_extract(text, "(?<=\\[).*?(?=\\])") result_str_extract # parentheses <- str_extract(text2, "\\(.*?\\)") extracted_str <- str_sub(extracted_str, 2, -2) extracted_str
- The
str_extract
function extracts the first substring matching a regex pattern. - Look-Behind
(?<=\\[)
and Look-Ahead(?=\\])
assertions match text between[
and]
str_sub
is then used to remove the enclosing parentheses.
- The
Example 3: {stringi} (source)
# brackets <- stri_extract(text, regex = "(?<=\\[).*?(?=\\])") result_stri_extract # parentheses <- stringi::stri_extract_first_regex(text2, "\\(.*?\\)") extracted_stri <- stringi::stri_sub(extracted_stri, 2, -2) extracted_stri
- Similar to Example 2 {stringr}
Extract text after a special character
Example: After a hyphen (source)
library(stringr) # Example string <- "apple-pie" string # Extract substring after the hyphen <- str_extract(string, "(?<=-).*") result # result <- stri_extract(string, regex = "(?<=-).*") # stringi print(result) #> [1] "pie"
- (?<=-) is a look-behind assertion ensuring the match occurs after a hyphen, and .* matches any character zero or more times.
Extract text before a space
Example: Before the first space (source)
# Sample data <- c("John Doe", "Jane Smith", "Alice Johnson") text # Extract strings before the first space sub("\\s.*", "", text) #> [1] "John" "Jane" "Alice" # {stringr} str_extract(text, "^[^\\s]+") # {stringi} stri_extract_first_regex(text, "^[^\\s]+")
- ^ says start at the beginning of the string. Then, [^\\s ]+ matches one or more characters that isn’t a space.