Scraping
Misc
Packages
- {rvest}
- {rselenium}
- {selenium}
- {selenider} - Wrapper functions around {chromote} and {selenium} functions that utilize lazy element finding and automatic waiting to make scraping code more reliable
- {shadowr} - For shadow DOMs
- {rJavaEnv} - Quickly install Java Development Kit (JDK) without administrative privileges and set environment variables in current R session or project to solve common issues with ‘Java’ environment management in ‘R’.
Resources
In loops, use
Sys.sleep
(probably) after EVERY selenium function. Sys.sleep(1) might be all that’s required. ({selenider} fixes this problem)- See Projects > foe > gb-level-1_9-thread > scrape-gb-levels.R
- Might not always be needed, but absolutely need if you’re filling out a form and submitting it.
- Might even need one at the top of the loop
- If a Selenium function stops working, adding Sys.sleeps are worth a try.
Sometimes
clickElement( )
stops working for no apparent reason. When this happens usedsendKeysToElement(list("laptops",key="enter"))
In batch scripts (.bat), sometimes after a major windows update, the Java that selenium uses will trigger Windows Defender (WD) and cause the scraping script to fail (if you have it scheduled). If you run the .bat script manually and then when the WD box rears its ugly head, just click ignore. WD should remember after that and not to mess with it.
RSelenium
findElement(using = "")
options “class name” : Returns an element whose class name contains the search value; compound class names are not permitted.“css selector” : Returns an element matching a CSS selector.
“id” : Returns an element whose ID attribute matches the search value.
“name” : Returns an element whose NAME attribute matches the search value.
“link text” : Returns an anchor element whose visible text matches the search value.
“partial link text” : Returns an anchor element whose visible text partially matches the search value.
“tag name” : Returns an element whose tag name matches the search value.
“xpath” : Returns an element matching an XPath expression.
Terms
Static Web Page: A web page (HTML page) that contains the same information for all users. Although it may be periodically updated, it does not change with each user retrieval.
Dynamic Web Page: A web page that provides custom content for the user based on the results of a search or some other request. Also known as “dynamic HTML” or “dynamic content”, the “dynamic” term is used when referring to interactive Web pages created for each user.
rvest
Misc
- Notes from: Pluralsight.Advanced.Web.Scraping.Tactics.R.Playbook
Uses css selectors or xpath to find html nodes
library(rvest) <- read_html("<url>") page <- html_element(page, xpath = "<xpath>" node
- Find css selectors
- selector gadget
- click selector gadget app icon in Chrome in upper right assuming you’ve installed it already
- click item on webpage you want to scrape
- it will highlight other items as well
- click each item you DON’T want to deselect it
- copy the selector name in box at the bottom of webpage
- Use html_text to pull text or html_attr to pull a link or something
- inspect
- right-click item on webpage
- click inspect
- html element should be highlighted in elements tab of right side pan
- right-click element –> copy –> copy selector or copy xpath
- selector gadget
- Find css selectors
Example: Access data that needs authentication (also see RSelenium version)
navigate to login page
<- session("<login page url>") session
Find “forms” for username and password
<- html_form(session)[[1]] form form
- Evidently there are multiple forms on a webpage. He didn’t give a good explanation for why he chose the first one
- “session_key” and “session_password” are the ones needed
Fill out the necessary parts of the form and send it
<- html_form_set(form, session_key = "<username>", session_password = "<password>") filled_form # shows values that inputed next the form sections filled_form <- session_submit(session, filled_form) log_in
Confirm that your logged in
# prints url status = 200, type = text/html, size = 757813 (number of lines of html on page?) log_in browseURL(log_in$url) # think this maybe opens browser
Example: Filter a football stats table by selecting values from a dropdown menu on a webpage (also see RSelenium version)
After set-up and navigating to url, get the forms from the webpage
<- html_form(session) forms # prints all the forms forms
- The fourth has all the filtering menu categories (team, week, position, year), so that one is chosen
Fill out the form to enter the values you want to use to filter the table and submit that form to filter the table
<- html_form_set(forms[[4]], "team" = "DAL", "week" = "all", "position" = "QB", "year" = "2017") filled_form <- session_submit(session = session, form = filled_form) submitted_session
Look for the newly filtered table
<- html_elements(submitted_session, "table") tables tables
- Using inspect, you can see the 2nd one has <table class = “sortable stats-table…etc
Select the second table and convert it to a dataframe
<- html_table(tables[[2]], header = TRUE) football_df
Retrieve Sidebar Content (source)
<- ChromoteSession$new() chrome_session # Retrieve the sidebar content <- chrome_session$DOM$querySelector( node nodeId = chrome_session$DOM$getDocument()$root$nodeId, selector = ".sidebar" ) # Get the outerHTML of the node <- chrome_session$DOM$getOuterHTML( html_content nodeId = node$nodeId ) ## Parse the sidebar content with `rvest` ---- # Pull the node's HTML response $outerHTML |> # Extract the HTML content html_content::minimal_html() |> # Convert to XML document rvest::html_elements("a") |> # Obtain all anchor (i.e. links) tags rvest::html_text() # Extract the text from the anchor tags rvest
Every element in the sidebar pretty much has a link, so the text can extracted from them.
The CSS selector was much longer but he shortened it to “.sidebar”
RSelenium
Along with installing package you have to know the version of the browser driver of the browser you’re going to use
https://chromedriver.chromium.org/downloads
Find Chrome browser version
Through console
system2(command = "wmic", args = 'datafile where name="C:\\\\Program Files (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value')
List available Chrome drivers
::list_versions(appname = "chromedriver") binman
- If no exact driver version matches your browser version,
- Each version of the Chrome driver supports Chrome with matching major, minor, and build version numbers.
- Example: Chrome driver 73.0.3683.20 supports all Chrome versions that start with 73.0.3683
- Each version of the Chrome driver supports Chrome with matching major, minor, and build version numbers.
- If no exact driver version matches your browser version,
Start server and create remote driver
- a browser will pop up and say “Chrome is being controlled by automated test software”
library(RSelenium) <- rsDriver(browser = c("chrome"), chromever = "<driver version>", port = 4571L) # assume the port number is specified by chrome driver ppl. driver <- driver[['client']] # can also use $client remDr
Navigate to a webpage
$navigate("<url>") remDr
remDR$maxWindowSize()
: Set the size of the browser window to maximum.- By default, the browser window size is small, and some elements of the website you navigate to might not be available right away
Grab the url of the webpage you’re on
$getCurrentUrl() remDr
Go back and forth between urls
$goBack() remDr$goForward() remDr
Find html element (name, id, class name, etc.)
<- remDr$findElement(using = "name", value = "q") webpage_element
- See Misc section for selector options
- Where “name” is the element class and “q” is the value e.g. name=“q” if you used the inspect method in chrome
- Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
Highlight element in pop-up browser to make sure you have the right thing
$highlightElement() webpage_element
Example: you picked a search bar for your html element and now you want to use the search bar from inside R
Enter text into search bar
$sendKeysToElement(list("Scraping the web with R")) webpage_element
Hit enter to execute search
$sendKeysToElement(list(key = "enter")) webpage_element
- You are now on the page with the results of the google search
Scrape all the links and titles on that page
<- remDr$findElement(using = "css selector", ".r") webelm_linkTitles
Inspect showed ”
. Notice he used “.r”. Says it will pick-up all elements with “r” as the class.
Get titles
# first title 1]]$getElementText() webelm_linkTitles[[ # put them all into a list <- purrr::map_chr(webelm_linkTitles, ~.x$getElementText()) titles <- unlist(lapply( titles webelm_linkTitles, function(x) {x$getElementText()}
Example: Access data that needs user authentication (also see rvest version)
After set-up and navigating to webpage, find elements where you type in your username and password
<- remDr$findElement(using = "id", "Username") webelm_username <- remDr$findElement(using = "id, "Password") webelm_pass
Enter username and password
$sendKeysToElement(list("<username>")) webpage_username$sendKeysToElement(list("<password>")) webpage_pass
Click sign-in button and click it
<- remDr$findElement(using = "class", "psds-button") webelm_sbutt $clickElement() webelm_sbutt
Example: Filter a football stats table by selecting values from a dropdown menu on a webpage (also see rvest version)
This is tedious — use rvest to scrape this if possible (have to use rvest at the end anyways). html forms are the stuff.
After set-up and navigated to url, find drop down “team” menu element locator using inspect in the browser and use findElement
<- remDr$findElement(using = "name", value = "team") # conveniently has name="team" in the html webelem_team
- Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
click team dropdown
$clickElement() webelem_team
Go back to inspect in the browser, you should be able to expand the team menu element. Left click value that you want to filter team by to highlight it. Then right click the element and select “copy” –> “copy selector”. Paste selector into value arg
<- remDr$findElement(using = "css", value = "edit-filters-0-team > option:nth-child(22)") webelem_DAL $clickElement() webelem_DAL
- Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
- Repeat process for week, position, and year drop down menu filters
After you’ve selected all the values in the dropdown, click the submit button to filter the table
<- remDr$findElement(using = "css", value = "edit-filters-0-actions-submit") webelem_submit $clickElement() webelem_submit
- Finds element by using inspect on the submit button and copying the selector
Get the url of the html code of the page with the filtered table. Read html code into R with rvest.
<- remDr$getPageSource()[[1]] url <- rvest::read_html(url) html_page
- If you want the header,
getPageSource(header = TRUE)
- If you want the header,
Use rvest to scrape the table. Find the table with the stats
<- rvest::html_elements(html_page, "table") all_tables all_tables
- Used the “html_elements” version instead of “element”
- Third one has “<table class =”sortable stats-table full-width blah blah”
Save to table to dataframe
<- rvest::html_table(all_tables[[3]], header = TRUE) football_df
Other Stuff
Clicking a semi-infinite scroll button (e.g. “See more”)
Example: For-Loop
# Find Page Element for Body <- remDr$findElement("css", "body") webElem # Page to the End for (i in 1:50) { message(paste("Iteration",i)) $sendKeysToElement(list(key = "end")) webElem # Check for the Show More Button <- try(unlist( element$findElement( remDr"class name", "RveJvd")$getElementAttribute('class')), silent = TRUE) #If Button Is There Then Click It Sys.sleep(2) if(str_detect(element, "RveJvd") == TRUE){ <- remDr$findElement("class name", "RveJvd") buttonElem $clickElement() buttonElem } # Sleep to Let Things Load Sys.sleep(3) }
- article
- After scrolling to the “end” of the page, there’s a “show me more button” that loads more data on the page
Example: Recursive
<- function(rd) { load_more # scroll to end of page $executeScript("window.scrollTo(0, document.body.scrollHeight);", args = list()) rd # Find the "Load more" button by its CSS selector and ... <- rd$findElement(using = "css selector", "button.btn-load.more") load_more_button # ... click it $clickElement() load_more_button # give the website a moment to respond Sys.sleep(5) } <- function(rd) { load_page_completely # load more content even if it throws an error tryCatch({ # call load_more() load_more(rd) # if no error is thrown, call the load_page_completely() function again Recall(rd) error = function(e) { }, # if an error is thrown return nothing / NULL }) } load_page_completely(remote_driver)
Example: While-Loop with scroll height (source)
<- function(browser, scroll_step = 100) { progressive_scroll # Get initial scroll height of the page <- browser$executeScript("return document.body.scrollHeight") current_height # Set a variable for the scrolling position <- 0 scroll_position # Continue scrolling until the end of the page while (scroll_position < current_height) { # Scroll down by 'scroll_step' pixels $executeScript(paste0("window.scrollBy(0,", scroll_step, ");")) browser Sys.sleep(runif(1, max = 0.2)) # Wait for the content to load (adjust this if the page is slower to load) <- scroll_position + scroll_step # Update the scroll position scroll_position <- browser$executeScript("return document.body.scrollHeight") # Get the updated scroll height after scrolling (in case more content is loaded) current_height } } # Scroll the ECB page to ensure all dynamic content is visible progressive_scroll(browser, scroll_step = 1000)
Shadow DOM elements
#shadow-root and shadow dom button elements
Misc
- Two options: {shadowr} or JS script
Example: Use {shadowr}
My stackoverflow post
Set-up
::p_load(RSelenium, shadowr) pacman<- rsDriver(browser = c("chrome"), chromever = chrome_driver_version) driver # chrome browser <- driver$client chrome <- shadow(chrome) shadow_rd
Find web element
- Search for element using html tag
<- shadowr::find_elements(shadow_rd, 'calcite-button') wisc_dl_panel_button4 1]]$clickElement() wisc_dl_panel_button4[[
- Shows web element located in #shadow-root
- Since there might be more than one element with the “calcite-button” html tag, we use the plural,
find_elements
, instead offind_element
- There’s only 1 element returned, so we use
[[1]]
index to subset the list before clicking it
- Search for element using html tag
Search for web element by html tag and attribute
<- find_elements(shadow_rd, 'button[aria-describedby*="tooltip"]') wisc_dl_panel_button3 3]]$clickElement() wisc_dl_panel_button3[[
- “button” is the html tag which is subsetted by the brackets, and “aria-describedby” is the attribute
- Only part of the attribute’s value is used, “tooltip,” so I think that’s why “*=” instead of just “=” is used. I believe the “*” may indicate partial-matching.
- Since there might be more than one element with this html tag + attribute combo, we use the plural,
find_elements
, instead offind_element
- There are 3 elements returned, so we use
[[3]]
index to subset the list to element we want before clicking it
Example: Use a JS script and some webelement hacks to get a clickable element
- Misc
- “.class_name”
- fill in spaces with periods
- “.btn btn-default hidden-xs” becomes “.btn.btn-default.hidden-xs”
- fill in spaces with periods
- “.class_name”
- You can find the element path to use in your JS script by going step by step with JS commands in the Chrome console (bottom window)
- Steps
Write JS script to get clickable element’s elementId
Start with element right above first shadow-root element and use
querySelector
Move to the next element inside the next shadow-root element using
shadowRoot.querySelector
Continue to desired clickable element
- If there’s isn’t another shadow-root that you have to open, then the next element can be selected using
querySelector
- If you do have to click on another shadow-root element to open another branch, then used
shadowRoot.querySelector
- Example
- “hub-download-card” is just above shadow-root so it needs
querySelector
- “calcite-card” is an element that’s one-step removed from shadow-root, so it needs
shadowRoot.querySelector
- “calcite-dropdown” (type = “click”) is not directly (see div) next to shadow-root , so it can selected using
querySelector
- “hub-download-card” is just above shadow-root so it needs
- If there’s isn’t another shadow-root that you have to open, then the next element can be selected using
Write and execute JS script
<- chrome$executeScript("return document.querySelector('hub-download-card').shadowRoot.querySelector('calcite-card').querySelector('calcite-dropdown');") wisc_dlopts_elt_id
Make a clickable element or just click the damn thing
clickable element (sometimes this doesn’t work; needs to be a button or type=click)
- Use
findElement
to find a generic element class object that you can manipulate - Use “@” ninja-magic to force elementId into the generic webElement to coerce it into your button element
- Use
clickElement
to click the button
# think this is a generic element that can always be used <- chrome$findElement("css", "html") moose @.xData$elementId <- as.character(wisc_dlopts_elt_id) moose$clickElement() moose
- Use
Click the button
$executeScript("document.querySelector('hub-download-card').shadowRoot.querySelector('calcite-card').querySelector('calcite-dropdown').querySelector('calcite-dropdown-group').querySelector('calcite-dropdown-item:nth-child(2)').click()") chrome
- Misc
Get data from a hidden input
HTML Element
<input type="hidden" id="overview-about-text" value="%3Cp%3E100%25%20Plant-Derived%20Squalane%20hydrates%20your%20skin%20while%20supporting%20its%20natural%20moisture%20barrier.%20Squalane%20is%20an%20exceptional%20hydrator%20found%20naturally%20in%20the%20skin,%20and%20this%20formula%20uses%20100%25%20plant-derived%20squalane%20derived%20from%20sugar%20cane%20for%20a%20non-comedogenic%20solution%20that%20enhances%20surface-level%20hydration.%3Cbr%3E%3Cbr%3EOur%20100%25%20Plant-Derived%20Squalane%20formula%20can%20also%20be%20used%20in%20hair%20to%20increase%20heat%20protection,%20add%20shine,%20and%20reduce%20breakage.%3C/p%3E">
Extract value and decode the text
<- webpage |> overview_text html_element("#overview-about-text") |> html_attr("value") |> URLdecode() |> read_html() |> html_text() overview_text#> [1] "100% Plant-Derived Squalane hydrates your skin while supporting its natural moisture barrier.
JS
Highlight an element on the page (source)
<- ChromoteSession$new() chrome_session # Launch chrome to view actions taken in the browser $view() chrome_session # Get the browser's version $Browser$getVersion() chrome_session # Open a new tab and navigate to a URL $Page$navigate("https://www.r-project.org/") chrome_session $Runtime$evaluate( chrome_sessionexpression = " // Find the element element = document.querySelector('.sidebar'); // Highlight it element.style.backgroundColor = 'yellow'; element.style.border = '2px solid red'; " ) # Wait for the action to complete Sys.sleep(0.5) # Take a screenshot of the highlighted element $screenshot("r-project-sidebar.png", selector = ".sidebar") chrome_session # View the screenshot browseURL("r-project-sidebar.png")
- The css selector was some long string, but he shortened it to “.sidebar”
Search and extract table (source)
# Start a new browser tab session <- chrome_session$new_session() chrome_session_windy # Open a new tab in the current browser $view() chrome_session_windy ## Navigate to windy.com ---- # Navigate to windy.com $Page$navigate("https://www.windy.com") chrome_session_windy # Wait for the page to load Sys.sleep(0.5) # First focus the input field $Runtime$evaluate(' chrome_session_windy document.querySelector("#q").focus(); ') # Brief pause to ensure focus is complete Sys.sleep(0.5) # Enter search term and trigger search <- 'Stanford University Museum of Art' search_query $Runtime$evaluate( chrome_session_windyexpression = sprintf('{ // Get the search input const searchInput = document.getElementById("q"); searchInput.value = "%s"; // Focus the input searchInput.focus(); // Trigger input event const inputEvent = new Event("input", { bubbles: true }); searchInput.dispatchEvent(inputEvent); // Trigger change event const changeEvent = new Event("change", { bubbles: true }); searchInput.dispatchEvent(changeEvent); // Force the search to update - this triggers the site\'s search logic const keyupEvent = new KeyboardEvent("keyup", { key: "a", code: "KeyA", keyCode: 65, bubbles: true }); searchInput.dispatchEvent(keyupEvent); }', search_query) )# Wait for and, then, click the first search result Sys.sleep(0.5) $Runtime$evaluate(' chrome_session_windy document.querySelector(".results-data a").click(); ') ## Extract weather data ---- # Wait for and, then, extract the weather data table Sys.sleep(0.5) <- chrome_session_windy$Runtime$evaluate(' html document.querySelector("table#detail-data-table").outerHTML ')$result$value ## Parse the table using `rvest` ---- <- html |> raw_weather_table read_html() |> html_node('table') |> # Select the table to extract it without getting a node set html_table() |> # Convert the table to a data frame as.data.frame() raw_weather_table
%s is replaced by search_query
#detail-data-table is the CSS selector but he added table in front of that — not sure why. Without the “#”, this is also the table id in the HTML, so maybe because it’s a table id (?).
Feel like this might’ve been solved by RSelenium and the JS wasn’t necessary