Scraping

Misc

  • Packages

    • {rvest}
    • {rselenium}
    • {selenium}
    • {selenider} - Wrapper functions around {chromote} and {selenium} functions that utilize lazy element finding and automatic waiting to make scraping code more reliable
    • {shadowr} - For shadow DOMs
    • {rJavaEnv} - Quickly install Java Development Kit (JDK) without administrative privileges and set environment variables in current R session or project to solve common issues with ‘Java’ environment management in ‘R’.
  • Resources

  • In loops, use Sys.sleep (probably) after EVERY selenium function. Sys.sleep(1) might be all that’s required. ({selenider} fixes this problem)

    • See Projects > foe > gb-level-1_9-thread > scrape-gb-levels.R
    • Might not always be needed, but absolutely need if you’re filling out a form and submitting it.
    • Might even need one at the top of the loop
    • If a Selenium function stops working, adding Sys.sleeps are worth a try.
  • Sometimes clickElement( ) stops working for no apparent reason. When this happens used sendKeysToElement(list("laptops",key="enter"))

  • In batch scripts (.bat), sometimes after a major windows update, the Java that selenium uses will trigger Windows Defender (WD) and cause the scraping script to fail (if you have it scheduled). If you run the .bat script manually and then when the WD box rears its ugly head, just click ignore. WD should remember after that and not to mess with it.

  • RSelenium findElement(using = "") options “class name” : Returns an element whose class name contains the search value; compound class names are not permitted.

    • “css selector” : Returns an element matching a CSS selector.

    • “id” : Returns an element whose ID attribute matches the search value.

    • “name” : Returns an element whose NAME attribute matches the search value.

    • “link text” : Returns an anchor element whose visible text matches the search value.

    • “partial link text” : Returns an anchor element whose visible text partially matches the search value.

    • “tag name” : Returns an element whose tag name matches the search value.

    • “xpath” : Returns an element matching an XPath expression.

Terms

  • Static Web Page: A web page (HTML page) that contains the same information for all users. Although it may be periodically updated, it does not change with each user retrieval.

  • Dynamic Web Page: A web page that provides custom content for the user based on the results of a search or some other request. Also known as “dynamic HTML” or “dynamic content”, the “dynamic” term is used when referring to interactive Web pages created for each user.

rvest

  • Misc

    • Notes from: Pluralsight.Advanced.Web.Scraping.Tactics.R.Playbook
  • Uses css selectors or xpath to find html nodes

    library(rvest)
    page <- read_html("<url>")
    node <- html_element(page, xpath = "<xpath>"
    • Find css selectors
      • selector gadget
        1. click selector gadget app icon in Chrome in upper right assuming you’ve installed it already
        2. click item on webpage you want to scrape
          • it will highlight other items as well
        3. click each item you DON’T want to deselect it
        4. copy the selector name in box at the bottom of webpage
        5. Use html_text to pull text or html_attr to pull a link or something
      • inspect
        1. right-click item on webpage
        2. click inspect
        3. html element should be highlighted in elements tab of right side pan
        4. right-click element –> copy –> copy selector or copy xpath
  • Example: Access data that needs authentication (also see RSelenium version)

    • navigate to login page

      session <- session("<login page url>")
    • Find “forms” for username and password

      form <- html_form(session)[[1]]
      form
      • Evidently there are multiple forms on a webpage. He didn’t give a good explanation for why he chose the first one
      • “session_key” and “session_password” are the ones needed
    • Fill out the necessary parts of the form and send it

      filled_form <- html_form_set(form, session_key = "<username>", session_password = "<password>")
      filled_form # shows values that inputed next the form sections
      log_in <- session_submit(session, filled_form)
    • Confirm that your logged in

      log_in # prints url status = 200, type = text/html, size = 757813 (number of lines of html on page?)
      browseURL(log_in$url) # think this maybe opens browser
  • Example: Filter a football stats table by selecting values from a dropdown menu on a webpage (also see RSelenium version)

    • After set-up and navigating to url, get the forms from the webpage

      forms <- html_form(session)
      forms # prints all the forms
      • The fourth has all the filtering menu categories (team, week, position, year), so that one is chosen
    • Fill out the form to enter the values you want to use to filter the table and submit that form to filter the table

      filled_form <- html_form_set(forms[[4]], "team" = "DAL", "week" = "all", "position" = "QB", "year" = "2017")
      submitted_session <- session_submit(session = session, form = filled_form)
    • Look for the newly filtered table

      tables <- html_elements(submitted_session, "table")
      tables
      • Using inspect, you can see the 2nd one has <table class = “sortable stats-table…etc
    • Select the second table and convert it to a dataframe

      football_df <- html_table(tables[[2]], header = TRUE)
  • Retrieve Sidebar Content (source)

    chrome_session <- ChromoteSession$new()
    
    # Retrieve the sidebar content
    node <- chrome_session$DOM$querySelector(
      nodeId = chrome_session$DOM$getDocument()$root$nodeId,
      selector = ".sidebar"
    )
    
    # Get the outerHTML of the node
    html_content <- chrome_session$DOM$getOuterHTML(
      nodeId = node$nodeId
    )
    
    ## Parse the sidebar content with `rvest` ----
    
    # Pull the node's HTML response
    html_content$outerHTML |>      # Extract the HTML content
      rvest::minimal_html() |>     # Convert to XML document
      rvest::html_elements("a") |> # Obtain all anchor (i.e. links) tags
      rvest::html_text()           # Extract the text from the anchor tags
    • Every element in the sidebar pretty much has a link, so the text can extracted from them.

    • The CSS selector was much longer but he shortened it to “.sidebar”

RSelenium

  • Along with installing package you have to know the version of the browser driver of the browser you’re going to use

    • https://chromedriver.chromium.org/downloads

    • Find Chrome browser version

      • Through console

        system2(command = "wmic",
                args = 'datafile where name="C:\\\\Program Files         (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value')
    • List available Chrome drivers

      binman::list_versions(appname = "chromedriver")
      • If no exact driver version matches your browser version,
        • Each version of the Chrome driver supports Chrome with matching major, minor, and build version numbers.
          • Example: Chrome driver 73.0.3683.20  supports all Chrome versions that start with 73.0.3683
  • Start server and create remote driver

    • a browser will pop up and say “Chrome is being controlled by automated test software”
    library(RSelenium)
    driver <- rsDriver(browser = c("chrome"), chromever = "<driver version>", port = 4571L) # assume the port number is specified by chrome driver ppl.
    remDr <- driver[['client']] # can also use $client
  • Navigate to a webpage

    remDr$navigate("<url>")
  • remDR$maxWindowSize(): Set the size of the browser window to maximum.

    • By default, the browser window size is small, and some elements of the website you navigate to might not be available right away
  • Grab the url of the webpage you’re on

    remDr$getCurrentUrl()
  • Go back and forth between urls

    remDr$goBack()
    remDr$goForward()
  • Find html element (name, id, class name, etc.)

    webpage_element <- remDr$findElement(using = "name", value = "q") 
    • See Misc section for selector options
    • Where “name” is the element class and “q” is the value e.g. name=“q” if you used the inspect method in chrome
    • Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
  • Highlight element in pop-up browser to make sure you have the right thing

    webpage_element$highlightElement()
  • Example: you picked a search bar for your html element and now you want to use the search bar from inside R

    • Enter text into search bar

      webpage_element$sendKeysToElement(list("Scraping the web with R"))
    • Hit enter to execute search

      webpage_element$sendKeysToElement(list(key = "enter"))
      • You are now on the page with the results of the google search
    • Scrape all the links and titles on that page

      webelm_linkTitles <- remDr$findElement(using = "css selector", ".r") 
      • Inspect showed ”

        . Notice he used “.r”. Says it will pick-up all elements with “r” as the class.

    • Get titles

      # first title
      webelm_linkTitles[[1]]$getElementText()
      
      # put them all into a list
      titles <- purrr::map_chr(webelm_linkTitles, ~.x$getElementText())
      titles <- unlist(lapply(
          webelm_linkTitles, 
          function(x) {x$getElementText()}
  • Example: Access data that needs user authentication (also see rvest version)

    • After set-up and navigating to webpage, find elements where you type in your username and password

      webelm_username <- remDr$findElement(using = "id", "Username")
      webelm_pass <- remDr$findElement(using = "id, "Password")
    • Enter username and password

      webpage_username$sendKeysToElement(list("<username>"))
      webpage_pass$sendKeysToElement(list("<password>"))
    • Click sign-in button and click it

      webelm_sbutt <- remDr$findElement(using = "class", "psds-button")
      webelm_sbutt$clickElement()
  • Example: Filter a football stats table by selecting values from a dropdown menu on a webpage (also see rvest version)

    • This is tedious — use rvest to scrape this if possible (have to use rvest at the end anyways). html forms are the stuff.

    • After set-up and navigated to url, find drop down “team” menu element locator using inspect in the browser and use findElement

      webelem_team <- remDr$findElement(using = "name", value = "team") # conveniently has name="team" in the html
      • Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
    • click team dropdown

      webelem_team$clickElement()
    • Go back to inspect in the browser, you should be able to expand the team menu element. Left click value that you want to filter team by to highlight it. Then right click the element and select “copy” –> “copy selector”. Paste selector into value arg

      webelem_DAL <- remDr$findElement(using = "css", value = "edit-filters-0-team > option:nth-child(22)")
      webelem_DAL$clickElement()
      • Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
      • Repeat process for week, position, and year drop down menu filters
    • After you’ve selected all the values in the dropdown, click the submit button to filter the table

      webelem_submit <- remDr$findElement(using = "css", value =     "edit-filters-0-actions-submit") 
      webelem_submit$clickElement()
      • Finds element by using inspect on the submit button and copying the selector
    • Get the url of the html code of the page with the filtered table. Read html code into R with rvest.

      url <- remDr$getPageSource()[[1]]
      html_page <- rvest::read_html(url)
      • If you want the header, getPageSource(header = TRUE)
    • Use rvest to scrape the table. Find the table with the stats

      all_tables <- rvest::html_elements(html_page, "table")
      all_tables
      • Used the “html_elements” version instead of “element”
      • Third one has “<table class =”sortable stats-table full-width blah blah”
    • Save to table to dataframe

      football_df <- rvest::html_table(all_tables[[3]], header = TRUE)

Other Stuff

  • Clicking a semi-infinite scroll button (e.g. “See more”)

    • Example: For-Loop

      # Find Page Element for Body
      webElem <- remDr$findElement("css", "body")
      
      # Page to the End
      for (i in 1:50) {
        message(paste("Iteration",i))
        webElem$sendKeysToElement(list(key = "end"))
      
        # Check for the Show More Button
        element<- try(unlist(
            remDr$findElement(
              "class name",
              "RveJvd")$getElementAttribute('class')), silent = TRUE)
      
        #If Button Is There Then Click It
        Sys.sleep(2)
        if(str_detect(element, "RveJvd") == TRUE){
          buttonElem <- remDr$findElement("class name", "RveJvd")
          buttonElem$clickElement()
        }
      
        # Sleep to Let Things Load
        Sys.sleep(3)
      }
      • article
      • After scrolling to the “end” of the page, there’s a “show me more button” that loads more data on the page
    • Example: Recursive

      load_more <- function(rd) {
        # scroll to end of page
        rd$executeScript("window.scrollTo(0, document.body.scrollHeight);", args = list())
      
        # Find the "Load more" button by its CSS selector and ...
        load_more_button <- rd$findElement(using = "css selector", "button.btn-load.more")
      
        # ... click it
        load_more_button$clickElement()
      
        # give the website a moment to respond
        Sys.sleep(5)
      }
      
      load_page_completely <- function(rd) {
        # load more content even if it throws an error
        tryCatch({
          # call load_more()
          load_more(rd)
          # if no error is thrown, call the load_page_completely() function again
          Recall(rd)
        }, error = function(e) {
          # if an error is thrown return nothing / NULL
        })
      }
      
      load_page_completely(remote_driver)
      • article
      • Recall is a base R function that calls the same function it’s in.
    • Example: While-Loop with scroll height (source)

      progressive_scroll <- function(browser, scroll_step = 100) {
        # Get initial scroll height of the page
        current_height <- browser$executeScript("return document.body.scrollHeight")
      
        # Set a variable for the scrolling position
        scroll_position <- 0
      
        # Continue scrolling until the end of the page
        while (scroll_position < current_height) {
          # Scroll down by 'scroll_step' pixels
          browser$executeScript(paste0("window.scrollBy(0,", scroll_step, ");"))
      
          Sys.sleep(runif(1, max = 0.2)) # Wait for the content to load (adjust this if the page is slower to load)
          scroll_position <- scroll_position + scroll_step # Update the scroll position
          current_height <- browser$executeScript("return document.body.scrollHeight") # Get the updated scroll height after scrolling (in case more content is loaded)
        }
      }
      
      # Scroll the ECB page to ensure all dynamic content is visible
      progressive_scroll(browser, scroll_step = 1000)
  • Shadow DOM elements

    • #shadow-root and shadow dom button elements

    • Misc

      • Two options: {shadowr} or JS script
    • Example: Use {shadowr}

      • My stackoverflow post

      • Set-up

        pacman::p_load(RSelenium, shadowr)
        driver <- rsDriver(browser = c("chrome"), chromever = chrome_driver_version)
        # chrome browser
        chrome <- driver$client
        shadow_rd <- shadow(chrome)
      • Find web element

        • Search for element using html tag
        wisc_dl_panel_button4 <- shadowr::find_elements(shadow_rd, 'calcite-button')
        wisc_dl_panel_button4[[1]]$clickElement()
        • Shows web element located in #shadow-root
        • Since there might be more than one element with the “calcite-button” html tag, we use the plural, find_elements, instead of find_element
        • There’s only 1 element returned, so we use [[1]] index to subset the list before clicking it
      • Search for web element by html tag and attribute

        wisc_dl_panel_button3 <- find_elements(shadow_rd, 'button[aria-describedby*="tooltip"]')
        wisc_dl_panel_button3[[3]]$clickElement()
        • “button” is the html tag which is subsetted by the brackets, and “aria-describedby” is the attribute
        • Only part of the attribute’s value is used, “tooltip,” so I think that’s why “*=” instead of just “=” is used. I believe the “*” may indicate partial-matching.
        • Since there might be more than one element with this  html tag + attribute combo, we use the plural, find_elements, instead of find_element
        • There are 3 elements returned, so we use [[3]] index to subset the list to element we want before clicking it
    • Example: Use a JS script and some webelement hacks to get a clickable element

      • Misc
        • “.class_name”
          • fill in spaces with periods
            • “.btn btn-default hidden-xs” becomes “.btn.btn-default.hidden-xs”
      • You can find the element path to use in your JS script by going step by step with JS commands in the Chrome console (bottom window)
      • Steps
        • Write JS script to get clickable element’s elementId

          1. Start with element right above first shadow-root element and use querySelector

          2. Move to the next element inside the next shadow-root element using shadowRoot.querySelector

          3. Continue to desired clickable element

            • If there’s isn’t another shadow-root that you have to open, then the next element can be selected usingquerySelector
            • If you do have to click on another shadow-root element to open another branch, then used shadowRoot.querySelector
            • Example
              • “hub-download-card” is just above shadow-root so it needs querySelector
              • “calcite-card” is an element that’s one-step removed from shadow-root, so it needs shadowRoot.querySelector
              • “calcite-dropdown” (type = “click”) is not directly (see div) next to shadow-root , so it can selected using querySelector
          4. Write and execute JS script

            wisc_dlopts_elt_id <- chrome$executeScript("return document.querySelector('hub-download-card').shadowRoot.querySelector('calcite-card').querySelector('calcite-dropdown');")
        • Make a clickable element or just click the damn thing

          • clickable element (sometimes this doesn’t work; needs to be a button or type=click)

            1. Use findElement to find a generic element class object that you can manipulate
            2. Use “@” ninja-magic to force elementId into the generic webElement to coerce it into your button element
            3. Use clickElement to click the button
            # think this is a generic element that can always be used
            moose <- chrome$findElement("css", "html")
            moose@.xData$elementId <- as.character(wisc_dlopts_elt_id)
            moose$clickElement()
        • Click the button

          chrome$executeScript("document.querySelector('hub-download-card').shadowRoot.querySelector('calcite-card').querySelector('calcite-dropdown').querySelector('calcite-dropdown-group').querySelector('calcite-dropdown-item:nth-child(2)').click()")
  • Get data from a hidden input

    • article

    • HTML Element

      <input type="hidden" id="overview-about-text" value="%3Cp%3E100%25%20Plant-Derived%20Squalane%20hydrates%20your%20skin%20while%20supporting%20its%20natural%20moisture%20barrier.%20Squalane%20is%20an%20exceptional%20hydrator%20found%20naturally%20in%20the%20skin,%20and%20this%20formula%20uses%20100%25%20plant-derived%20squalane%20derived%20from%20sugar%20cane%20for%20a%20non-comedogenic%20solution%20that%20enhances%20surface-level%20hydration.%3Cbr%3E%3Cbr%3EOur%20100%25%20Plant-Derived%20Squalane%20formula%20can%20also%20be%20used%20in%20hair%20to%20increase%20heat%20protection,%20add%20shine,%20and%20reduce%20breakage.%3C/p%3E">
    • Extract value and decode the text

      overview_text <- webpage |>
        html_element("#overview-about-text") |>
        html_attr("value") |>
        URLdecode() |>
        read_html() |>
        html_text()
      
      overview_text
      #> [1] "100% Plant-Derived Squalane hydrates your skin while supporting its natural moisture barrier.

JS

  • Highlight an element on the page (source)

    chrome_session <- ChromoteSession$new()
    
    # Launch chrome to view actions taken in the browser
    chrome_session$view()
    
    # Get the browser's version
    chrome_session$Browser$getVersion()
    
    # Open a new tab and navigate to a URL
    chrome_session$Page$navigate("https://www.r-project.org/")
    
    chrome_session$Runtime$evaluate(
      expression = "
      // Find the element
      element = document.querySelector('.sidebar');
    
      // Highlight it
      element.style.backgroundColor = 'yellow';
      element.style.border = '2px solid red';
      "
    )
    
    # Wait for the action to complete
    Sys.sleep(0.5)
    
    # Take a screenshot of the highlighted element
    chrome_session$screenshot("r-project-sidebar.png", selector = ".sidebar")
    
    # View the screenshot
    browseURL("r-project-sidebar.png")
    • The css selector was some long string, but he shortened it to “.sidebar”
  • Search and extract table (source)

    # Start a new browser tab session
    chrome_session_windy <- chrome_session$new_session()
    
    # Open a new tab in the current browser
    chrome_session_windy$view()
    
    ## Navigate to windy.com ----
    
    # Navigate to windy.com
    chrome_session_windy$Page$navigate("https://www.windy.com")
    
    # Wait for the page to load
    Sys.sleep(0.5) 
    # First focus the input field
    chrome_session_windy$Runtime$evaluate('
      document.querySelector("#q").focus();
    ')
    
    # Brief pause to ensure focus is complete
    Sys.sleep(0.5) 
    
    # Enter search term and trigger search
    search_query <- 'Stanford University Museum of Art'
    chrome_session_windy$Runtime$evaluate(
      expression = sprintf('{
        // Get the search input
        const searchInput = document.getElementById("q");
        searchInput.value = "%s";
    
        // Focus the input
        searchInput.focus();
    
        // Trigger input event
        const inputEvent = new Event("input", { bubbles: true });
        searchInput.dispatchEvent(inputEvent);
    
        // Trigger change event
        const changeEvent = new Event("change", { bubbles: true });
        searchInput.dispatchEvent(changeEvent);
    
        // Force the search to update - this triggers the site\'s search logic
        const keyupEvent = new KeyboardEvent("keyup", {
          key: "a",
          code: "KeyA",
          keyCode: 65,
          bubbles: true
        });
        searchInput.dispatchEvent(keyupEvent);
      }', search_query)
    )
    # Wait for and, then, click the first search result
    Sys.sleep(0.5) 
    chrome_session_windy$Runtime$evaluate('
        document.querySelector(".results-data a").click();
      ')
    
    ## Extract weather data ----
    
    # Wait for and, then, extract the weather data table
    Sys.sleep(0.5) 
    html <- chrome_session_windy$Runtime$evaluate('
        document.querySelector("table#detail-data-table").outerHTML
      ')$result$value
    
    ## Parse the table using `rvest` ----
    raw_weather_table <- html |>
      read_html() |> 
      html_node('table') |> # Select the table to extract it without getting a node set
      html_table() |> # Convert the table to a data frame
      as.data.frame()
    
    raw_weather_table
    • %s is replaced by search_query

    • #detail-data-table is the CSS selector but he added table in front of that — not sure why. Without the “#”, this is also the table id in the HTML, so maybe because it’s a table id (?).

    • Feel like this might’ve been solved by RSelenium and the JS wasn’t necessary