Scraping

Misc

  • Packages

    • {rvest}
    • {rselenium}
    • {selenium}
    • {selenider} - Wrapper functions around {chromote} and {selenium} functions that utilize lazy element finding and automatic waiting to make scraping code more reliable
    • {shadowr} - For shadow DOMs
  • Resources

  • In loops, use Sys.sleep (probably) after EVERY selenium function. Sys.sleep(1) might be all that’s required. ({selenider} fixes this problem)

    • See Projects > foe > gb-level-1_9-thread > scrape-gb-levels.R
    • Might not always be needed, but absolutely need if you’re filling out a form and submitting it.
    • Might even need one at the top of the loop
    • If a Selenium function stops working, adding Sys.sleeps are worth a try.
  • Sometimes clickElement( ) stops working for no apparent reason. When this happens used sendKeysToElement(list("laptops",key="enter"))

  • In batch scripts (.bat), sometimes after a major windows update, the Java that selenium uses will trigger Windows Defender (WD) and cause the scraping script to fail (if you have it scheduled). If you run the .bat script manually and then when the WD box rears its ugly head, just click ignore. WD should remember after that and not to mess with it.

  • RSelenium findElement(using = "") options “class name” : Returns an element whose class name contains the search value; compound class names are not permitted.

    • “css selector” : Returns an element matching a CSS selector.

    • “id” : Returns an element whose ID attribute matches the search value.

    • “name” : Returns an element whose NAME attribute matches the search value.

    • “link text” : Returns an anchor element whose visible text matches the search value.

    • “partial link text” : Returns an anchor element whose visible text partially matches the search value.

    • “tag name” : Returns an element whose tag name matches the search value.

    • “xpath” : Returns an element matching an XPath expression.

Terms

  • Static Web Page: A web page (HTML page) that contains the same information for all users. Although it may be periodically updated, it does not change with each user retrieval.

  • Dynamic Web Page: A web page that provides custom content for the user based on the results of a search or some other request. Also known as “dynamic HTML” or “dynamic content”, the “dynamic” term is used when referring to interactive Web pages created for each user.

rvest

  • Misc

    • Notes from: Pluralsight.Advanced.Web.Scraping.Tactics.R.Playbook
  • Uses css selectors or xpath to find html nodes

    library(rvest)
    page <- read_html("<url>")
    node <- html_element(page, xpath = "<xpath>"
    • Find css selectors
      • selector gadget
        1. click selector gadget app icon in Chrome in upper right assuming you’ve installed it already
        2. click item on webpage you want to scrape
          • it will highlight other items as well
        3. click each item you DON’T want to deselect it
        4. copy the selector name in box at the bottom of webpage
        5. Use html_text to pull text or html_attr to pull a link or something
      • inspect
        1. right-click item on webpage
        2. click inspect
        3. html element should be highlighted in elements tab of right side pan
        4. right-click element –> copy –> copy selector or copy xpath
  • Example: Access data that needs authentication (also see RSelenium version)

    • navigate to login page

      session <- session("<login page url>")
    • Find “forms” for username and password

      form <- html_form(session)[[1]]
      form
      • Evidently there are multiple forms on a webpage. He didn’t give a good explanation for why he chose the first one
      • “session_key” and “session_password” are the ones needed
    • Fill out the necessary parts of the form and send it

      filled_form <- html_form_set(form, session_key = "<username>", session_password = "<password>")
      filled_form # shows values that inputed next the form sections
      log_in <- session_submit(session, filled_form)
    • Confirm that your logged in

      log_in # prints url status = 200, type = text/html, size = 757813 (number of lines of html on page?)
      browseURL(log_in$url) # think this maybe opens browser
  • Example: Filter a football stats table by selecting values from a dropdown menu on a webpage (also see RSelenium version)

    • After set-up and navigating to url, get the forms from the webpage

      forms <- html_form(session)
      forms # prints all the forms
      • The fourth has all the filtering menu categories (team, week, position, year), so that one is chosen
    • Fill out the form to enter the values you want to use to filter the table and submit that form to filter the table

      filled_form <- html_form_set(forms[[4]], "team" = "DAL", "week" = "all", "position" = "QB", "year" = "2017")
      submitted_session <- session_submit(session = session, form = filled_form)
    • Look for the newly filtered table

      tables <- html_elements(submitted_session, "table")
      tables
      • Using inspect, you can see the 2nd one has <table class = “sortable stats-table…etc
    • Select the second table and convert it to a dataframe

      football_df <- html_table(tables[[2]], header = TRUE)

RSelenium

  • Along with installing package you have to know the version of the browser driver of the browser you’re going to use

    • https://chromedriver.chromium.org/downloads

    • Find Chrome browser version

      • Through console

        system2(command = "wmic",
                args = 'datafile where name="C:\\\\Program Files         (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value')
    • List available Chrome drivers

      binman::list_versions(appname = "chromedriver")
      • If no exact driver version matches your browser version,
        • Each version of the Chrome driver supports Chrome with matching major, minor, and build version numbers.
          • Example: Chrome driver 73.0.3683.20  supports all Chrome versions that start with 73.0.3683
  • Start server and create remote driver

    • a browser will pop up and say “Chrome is being controlled by automated test software”
    library(RSelenium)
    driver <- rsDriver(browser = c("chrome"), chromever = "<driver version>", port = 4571L) # assume the port number is specified by chrome driver ppl.
    remDr <- driver[['client']] # can also use $client
  • Navigate to a webpage

    remDr$navigate("<url>")
  • remDR$maxWindowSize(): Set the size of the browser window to maximum.

    • By default, the browser window size is small, and some elements of the website you navigate to might not be available right away
  • Grab the url of the webpage you’re on

    remDr$getCurrentUrl()
  • Go back and forth between urls

    remDr$goBack()
    remDr$goForward()
  • Find html element (name, id, class name, etc.)

    webpage_element <- remDr$findElement(using = "name", value = "q") 
    • See Misc section for selector options
    • Where “name” is the element class and “q” is the value e.g. name=“q” if you used the inspect method in chrome
    • Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
  • Highlight element in pop-up browser to make sure you have the right thing

    webpage_element$highlightElement()
  • Example: you picked a search bar for your html element and now you want to use the search bar from inside R

    • Enter text into search bar

      webpage_element$sendKeysToElement(list("Scraping the web with R"))
    • Hit enter to execute search

      webpage_element$sendKeysToElement(list(key = "enter"))
      • You are now on the page with the results of the google search
    • Scrape all the links and titles on that page

      webelm_linkTitles <- remDr$findElement(using = "css selector", ".r") 
      • Inspect showed ”

        . Notice he used “.r”. Says it will pick-up all elements with “r” as the class.

    • Get titles

      # first title
      webelm_linkTitles[[1]]$getElementText()
      
      # put them all into a list
      titles <- purrr::map_chr(webelm_linkTitles, ~.x$getElementText())
      titles <- unlist(lapply(
          webelm_linkTitles, 
          function(x) {x$getElementText()}
  • Example: Access data that needs user authentication (also see rvest version)

    • After set-up and navigating to webpage, find elements where you type in your username and password

      webelm_username <- remDr$findElement(using = "id", "Username")
      webelm_pass <- remDr$findElement(using = "id, "Password")
    • Enter username and password

      webpage_username$sendKeysToElement(list("<username>"))
      webpage_pass$sendKeysToElement(list("<password>"))
    • Click sign-in button and click it

      webelm_sbutt <- remDr$findElement(using = "class", "psds-button")
      webelm_sbutt$clickElement()
  • Example: Filter a football stats table by selecting values from a dropdown menu on a webpage (also see rvest version)

    • This is tedious — use rvest to scrape this if possible (have to use rvest at the end anyways). html forms are the stuff.

    • After set-up and navigated to url, find drop down “team” menu element locator using inspect in the browser and use findElement

      webelem_team <- remDr$findElement(using = "name", value = "team") # conveniently has name="team" in the html
      • Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
    • click team dropdown

      webelem_team$clickElement()
    • Go back to inspect in the browser, you should be able to expand the team menu element. Left click value that you want to filter team by to highlight it. Then right click the element and select “copy” –> “copy selector”. Paste selector into value arg

      webelem_DAL <- remDr$findElement(using = "css", value = "edit-filters-0-team > option:nth-child(22)")
      webelem_DAL$clickElement()
      • Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
      • Repeat process for week, position, and year drop down menu filters
    • After you’ve selected all the values in the dropdown, click the submit button to filter the table

      webelem_submit <- remDr$findElement(using = "css", value =     "edit-filters-0-actions-submit") 
      webelem_submit$clickElement()
      • Finds element by using inspect on the submit button and copying the selector
    • Get the url of the html code of the page with the filtered table. Read html code into R with rvest.

      url <- remDr$getPageSource()[[1]]
      html_page <- rvest::read_html(url)
      • If you want the header, getPageSource(header = TRUE)
    • Use rvest to scrape the table. Find the table with the stats

      all_tables <- rvest::html_elements(html_page, "table")
      all_tables
      • Used the “html_elements” version instead of “element”
      • Third one has “<table class =”sortable stats-table full-width blah blah”
    • Save to table to dataframe

      football_df <- rvest::html_table(all_tables[[3]], header = TRUE)

Other Stuff

  • Clicking a semi-infinite scroll button (e.g. “See more”)

    • Example: For-Loop

      # Find Page Element for Body
      webElem <- remDr$findElement("css", "body")
      
      # Page to the End
      for (i in 1:50) {
        message(paste("Iteration",i))
        webElem$sendKeysToElement(list(key = "end"))
      
        # Check for the Show More Button
        element<- try(unlist(
            remDr$findElement(
              "class name",
              "RveJvd")$getElementAttribute('class')), silent = TRUE)
      
        #If Button Is There Then Click It
        Sys.sleep(2)
        if(str_detect(element, "RveJvd") == TRUE){
          buttonElem <- remDr$findElement("class name", "RveJvd")
          buttonElem$clickElement()
        }
      
        # Sleep to Let Things Load
        Sys.sleep(3)
      }
      • article
      • After scrolling to the “end” of the page, there’s a “show me more button” that loads more data on the page
    • Example: Recursive

      load_more <- function(rd) {
        # scroll to end of page
        rd$executeScript("window.scrollTo(0, document.body.scrollHeight);", args = list())
      
        # Find the "Load more" button by its CSS selector and ...
        load_more_button <- rd$findElement(using = "css selector", "button.btn-load.more")
      
        # ... click it
        load_more_button$clickElement()
      
        # give the website a moment to respond
        Sys.sleep(5)
      }
      
      load_page_completely <- function(rd) {
        # load more content even if it throws an error
        tryCatch({
          # call load_more()
          load_more(rd)
          # if no error is thrown, call the load_page_completely() function again
          Recall(rd)
        }, error = function(e) {
          # if an error is thrown return nothing / NULL
        })
      }
      
      load_page_completely(remote_driver)
      • article
      • Recall is a base R function that calls the same function it’s in.
  • Shadow DOM elements

    • #shadow-root and shadow dom button elements

    • Misc

      • Two options: {shadowr} or JS script
    • Example: Use {shadowr}

      • My stackoverflow post

      • Set-up

        pacman::p_load(RSelenium, shadowr)
        driver <- rsDriver(browser = c("chrome"), chromever = chrome_driver_version)
        # chrome browser
        chrome <- driver$client
        shadow_rd <- shadow(chrome)
      • Find web element

        • Search for element using html tag
        wisc_dl_panel_button4 <- shadowr::find_elements(shadow_rd, 'calcite-button')
        wisc_dl_panel_button4[[1]]$clickElement()
        • Shows web element located in #shadow-root
        • Since there might be more than one element with the “calcite-button” html tag, we use the plural, find_elements, instead of find_element
        • There’s only 1 element returned, so we use [[1]] index to subset the list before clicking it
      • Search for web element by html tag and attribute

        wisc_dl_panel_button3 <- find_elements(shadow_rd, 'button[aria-describedby*="tooltip"]')
        wisc_dl_panel_button3[[3]]$clickElement()
        • “button” is the html tag which is subsetted by the brackets, and “aria-describedby” is the attribute
        • Only part of the attribute’s value is used, “tooltip,” so I think that’s why “*=” instead of just “=” is used. I believe the “*” may indicate partial-matching.
        • Since there might be more than one element with this  html tag + attribute combo, we use the plural, find_elements, instead of find_element
        • There are 3 elements returned, so we use [[3]] index to subset the list to element we want before clicking it
    • Example: Use a JS script and some webelement hacks to get a clickable element

      • Misc
        • “.class_name”
          • fill in spaces with periods
            • “.btn btn-default hidden-xs” becomes “.btn.btn-default.hidden-xs”
      • You can find the element path to use in your JS script by going step by step with JS commands in the Chrome console (bottom window)
      • Steps
        • Write JS script to get clickable element’s elementId

          1. Start with element right above first shadow-root element and use querySelector

          2. Move to the next element inside the next shadow-root element using shadowRoot.querySelector

          3. Continue to desired clickable element

            • If there’s isn’t another shadow-root that you have to open, then the next element can be selected usingquerySelector
            • If you do have to click on another shadow-root element to open another branch, then used shadowRoot.querySelector
            • Example
              • “hub-download-card” is just above shadow-root so it needs querySelector
              • “calcite-card” is an element that’s one-step removed from shadow-root, so it needs shadowRoot.querySelector
              • “calcite-dropdown” (type = “click”) is not directly (see div) next to shadow-root , so it can selected using querySelector
          4. Write and execute JS script

            wisc_dlopts_elt_id <- chrome$executeScript("return document.querySelector('hub-download-card').shadowRoot.querySelector('calcite-card').querySelector('calcite-dropdown');")
        • Make a clickable element or just click the damn thing

          • clickable element (sometimes this doesn’t work; needs to be a button or type=click)

            1. Use findElement to find a generic element class object that you can manipulate
            2. Use “@” ninja-magic to force elementId into the generic webElement to coerce it into your button element
            3. Use clickElement to click the button
            # think this is a generic element that can always be used
            moose <- chrome$findElement("css", "html")
            moose@.xData$elementId <- as.character(wisc_dlopts_elt_id)
            moose$clickElement()
        • Click the button

          chrome$executeScript("document.querySelector('hub-download-card').shadowRoot.querySelector('calcite-card').querySelector('calcite-dropdown').querySelector('calcite-dropdown-group').querySelector('calcite-dropdown-item:nth-child(2)').click()")
  • Get data from a hidden input

    • article

    • HTML Element

      <input type="hidden" id="overview-about-text" value="%3Cp%3E100%25%20Plant-Derived%20Squalane%20hydrates%20your%20skin%20while%20supporting%20its%20natural%20moisture%20barrier.%20Squalane%20is%20an%20exceptional%20hydrator%20found%20naturally%20in%20the%20skin,%20and%20this%20formula%20uses%20100%25%20plant-derived%20squalane%20derived%20from%20sugar%20cane%20for%20a%20non-comedogenic%20solution%20that%20enhances%20surface-level%20hydration.%3Cbr%3E%3Cbr%3EOur%20100%25%20Plant-Derived%20Squalane%20formula%20can%20also%20be%20used%20in%20hair%20to%20increase%20heat%20protection,%20add%20shine,%20and%20reduce%20breakage.%3C/p%3E">
    • Extract value and decode the text

      overview_text <- webpage |>
        html_element("#overview-about-text") |>
        html_attr("value") |>
        URLdecode() |>
        read_html() |>
        html_text()
      
      overview_text
      #> [1] "100% Plant-Derived Squalane hydrates your skin while supporting its natural moisture barrier.