Scraping
Misc
- {rvest}
- {rselenium}
- {selenium}
- {selenider} - Wrapper functions around {chromote} and {selenium} functions that utilize lazy element finding and automatic waiting to make scraping code more reliable
- {shadowr} - For shadow DOMs
- {rJavaEnv} - Quickly install Java Development Kit (JDK) without administrative privileges and set environment variables in current R session or project to solve common issues with ‘Java’ environment management in ‘R’.
- {robotstxt} - Provides functions to download and parse ‘robots.txt’ files.
Unofficial Guidelines
- Finding no robots.txt file at the server (e.g. HTTP status code 404) implies that everything is allowed
- Subdomains should have there own robots.txt file if not it is assumed that everything is allowed
- Redirects involving protocol changes - e.g. upgrading from http to https - are followed and considered no domain or subdomain change - so whatever is found at the end of the redirect is considered to be the robots.txt file for the original domain
- Redirects from subdomain www to the domain is considered no domain change - so whatever is found at the end of the redirect is considered to be the robots.txt file for the subdomain originally requested
Example
::get_robotstxt("https://www.cimls.com") robotstxt#> [robots.txt] #> -------------------------------------- #> #> Sitemap: https://www.cimls.com/sitemaps/sitemap-index.xml #> #> User-agent: * #> Disallow: /view-map.php #> Disallow: /data_provider/broker_website.php #> Disallow: /admin/* #> Disallow: /login/* #> Allow: / ::paths_allowed( robotstxtpaths = "external-data/datafiniti-api.php", domain = "www.cimls.com", bot = "*" )#> www.cimls.com #> #> [1] TRUE
- The robots.txt does seems to be okay with my calling its api
- TRUE means bots have permission to access the page
- {polite} (article) - Uses three principles of polite webscraping: seeking permission, taking slowly and never asking twice.
Specifically, it manages the http session, declares the user agent string and checks the site policies, and uses rate-limiting and response caching to minimize the impact on the webserver.
Creates a session with
bow
, requests a page withnod
, and pulls the contents of the page withscrape
.Example
library(polite) library(rvest) <- bow("https://www.cheese.com/by_type", force = TRUE) session <- scrape(session, result query=list(t = "semi-soft", per_page = 100)) |> html_node("#main-body") |> html_nodes("h3") |> html_text() head(result) #> [1] "3-Cheese Italian Blend" "Abbaye de Citeaux" #> [3] "Abbaye du Mont des Cats" "Adelost" #> [5] "ADL Brick Cheese" "Ailsa Craig"
- Resources
Waiting for stuff to load
{selenider} fixes this problem
In loops, use
Sys.sleep
(probably) after EVERY selenium function. Sys.sleep(1) might be all that’s required.- See Projects > foe > gb-level-1_9-thread > scrape-gb-levels.R
- Might not always be needed, but absolutely need if you’re filling out a form and submitting it.
- Might even need one at the top of the loop
- If a Selenium function stops working, adding Sys.sleeps are worth a try.
Using a while-loop in order to account for uncertain loading times
Example: (source)
for (page_index in 1:2348) { # Try to find the buttons "Ver Mais" <- FALSE all_buttons_loaded <- 0 iterations while(!all_buttons_loaded & iterations < 20) { tryCatch( {<- remote_driver$ test findElements(using = 'id', value = "link_ver_detalhe") if (inherits(test, "list") && length(test) > 0) { <<- TRUE all_buttons_loaded } },error = function(e) { <<- iterations + 1 iterations Sys.sleep(0.5) } ) } if (!all_buttons_loaded & iterations == 20) { next } # ... more stuff }
Keeping track of progress can help to find where the error occurred (source)
Areas that indicate stages within a scraping script
- Show which page is being scraped;
- Show which modal of this page is being scraped;
- Show the status of this scraping (success/failure).
Example:
# save calls to message() in an external file log_appender(appender_file("data/modals/00_logfile")) log_messages() for (page_index in 1:2348) { message(paste("Start scraping of page", page_index)) for (modal_index in buttons) { # open modal # get HTML and save it in an external file # leave modal message(paste(" Scraped modal", modal_index)) } # Once all modals of a page have been scraped, go to the next page (except # if we're on the last page) message(paste("Finished scraping of page", page_index)) }
Best practice to scrape the html page and clean it in separate scripts
Keeping the raw html files helps with reproducibility
Example:
$clickElement() buttons[[modal_index]] Sys.sleep(1.5) # Get the HTML and save it <- remote_driver$getPageSource()[[1]] tmp write(tmp, file = paste0("data/modals/page-", page_index, "-modal-", modal_index, ".html")) # Leave the modal <- remote_driver$findElement(using = "xpath", value = "/html/body") body $sendKeysToElement(list(key = "escape")) body
Sometimes
clickElement( )
stops working for no apparent reason. When this happens usedsendKeysToElement(list("laptops",key="enter"))
In batch scripts (.bat), sometimes after a major windows update, the Java that selenium uses will trigger Windows Defender (WD) and cause the scraping script to fail (if you have it scheduled). If you run the .bat script manually and then when the WD box rears its ugly head, just click ignore. WD should remember after that and not to mess with it.
RSelenium
findElement(using = "")
options “class name” : Returns an element whose class name contains the search value; compound class names are not permitted.“css selector” : Returns an element matching a CSS selector.
“id” : Returns an element whose ID attribute matches the search value.
“name” : Returns an element whose NAME attribute matches the search value.
“link text” : Returns an anchor element whose visible text matches the search value.
“partial link text” : Returns an anchor element whose visible text partially matches the search value.
“tag name” : Returns an element whose tag name matches the search value.
“xpath” : Returns an element matching an XPath expression.
Terms
Static Web Page: A web page (HTML page) that contains the same information for all users. Although it may be periodically updated, it does not change with each user retrieval.
Dynamic Web Page: A web page that provides custom content for the user based on the results of a search or some other request. Also known as “dynamic HTML” or “dynamic content”, the “dynamic” term is used when referring to interactive Web pages created for each user.
rvest
Misc
- Notes from: Pluralsight.Advanced.Web.Scraping.Tactics.R.Playbook
Uses css selectors or xpath to find html nodes
library(rvest) <- read_html("<url>") page <- html_element(page, xpath = "<xpath>" node
- Find css selectors
- selector gadget
- click selector gadget app icon in Chrome in upper right assuming you’ve installed it already
- click item on webpage you want to scrape
- it will highlight other items as well
- click each item you DON’T want to deselect it
- copy the selector name in box at the bottom of webpage
- Use html_text to pull text or html_attr to pull a link or something
- inspect
- right-click item on webpage
- click inspect
- html element should be highlighted in elements tab of right side pan
- right-click element –> copy –> copy selector or copy xpath
- selector gadget
- Find css selectors
Example: Access data that needs authentication (also see RSelenium version)
navigate to login page
<- session("<login page url>") session
Find “forms” for username and password
<- html_form(session)[[1]] form form
- Evidently there are multiple forms on a webpage. He didn’t give a good explanation for why he chose the first one
- “session_key” and “session_password” are the ones needed
Fill out the necessary parts of the form and send it
<- html_form_set(form, session_key = "<username>", session_password = "<password>") filled_form # shows values that inputed next the form sections filled_form <- session_submit(session, filled_form) log_in
Confirm that your logged in
# prints url status = 200, type = text/html, size = 757813 (number of lines of html on page?) log_in browseURL(log_in$url) # think this maybe opens browser
Example: Filter a football stats table by selecting values from a dropdown menu on a webpage (also see RSelenium version)
After set-up and navigating to url, get the forms from the webpage
<- html_form(session) forms # prints all the forms forms
- The fourth has all the filtering menu categories (team, week, position, year), so that one is chosen
Fill out the form to enter the values you want to use to filter the table and submit that form to filter the table
<- html_form_set(forms[[4]], "team" = "DAL", "week" = "all", "position" = "QB", "year" = "2017") filled_form <- session_submit(session = session, form = filled_form) submitted_session
Look for the newly filtered table
<- html_elements(submitted_session, "table") tables tables
- Using inspect, you can see the 2nd one has <table class = “sortable stats-table…etc
Select the second table and convert it to a dataframe
<- html_table(tables[[2]], header = TRUE) football_df
Retrieve Sidebar Content (source)
<- ChromoteSession$new() chrome_session # Retrieve the sidebar content <- chrome_session$DOM$querySelector( node nodeId = chrome_session$DOM$getDocument()$root$nodeId, selector = ".sidebar" ) # Get the outerHTML of the node <- chrome_session$DOM$getOuterHTML( html_content nodeId = node$nodeId ) ## Parse the sidebar content with `rvest` ---- # Pull the node's HTML response $outerHTML |> # Extract the HTML content html_content::minimal_html() |> # Convert to XML document rvest::html_elements("a") |> # Obtain all anchor (i.e. links) tags rvest::html_text() # Extract the text from the anchor tags rvest
Every element in the sidebar pretty much has a link, so the text can extracted from them.
The CSS selector was much longer but he shortened it to “.sidebar”
RSelenium
Use Selenium if:
- The HTML you want is not directly accessible, i.e needs some interactions (clicking on a button, connect to a website…),
- The URL doesn’t change with the inputs,
- You can’t access the data directly in the “network” tab of the console and you can’t reproduce the
POST
request.
Along with installing package you have to know the version of the browser driver of the browser you’re going to use
https://chromedriver.chromium.org/downloads
Find Chrome browser version
Through console
system2(command = "wmic", args = 'datafile where name="C:\\\\Program Files (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value')
List available Chrome drivers
::list_versions(appname = "chromedriver") binman
- If no exact driver version matches your browser version,
- Each version of the Chrome driver supports Chrome with matching major, minor, and build version numbers.
- Example: Chrome driver 73.0.3683.20 supports all Chrome versions that start with 73.0.3683
- If no exact driver version matches your browser version,
Start server and create remote driver
- a browser will pop up and say “Chrome is being controlled by automated test software”
library(RSelenium) <- rsDriver(browser = c("chrome"), chromever = "<driver version>", port = 4571L) # assume the port number is specified by chrome driver ppl. driver <- driver[['client']] # can also use $client remDr
Navigate to a webpage
$navigate("<url>") remDr
remDR$maxWindowSize()
: Set the size of the browser window to maximum.- By default, the browser window size is small, and some elements of the website you navigate to might not be available right away
Grab the url of the webpage you’re on
$getCurrentUrl() remDr
Go back and forth between urls
$goBack() remDr$goForward() remDr
Find html element (name, id, class name, etc.)
<- remDr$findElement(using = "name", value = "q") webpage_element
- See Misc section for selector options
- Where “name” is the element class and “q” is the value e.g. name=“q” if you used the inspect method in chrome
- Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
Highlight element in pop-up browser to make sure you have the right thing
$highlightElement() webpage_element
Example: you picked a search bar for your html element and now you want to use the search bar from inside R
Enter text into search bar
$sendKeysToElement(list("Scraping the web with R")) webpage_element
Hit enter to execute search
$sendKeysToElement(list(key = "enter")) webpage_element
- You are now on the page with the results of the google search
Scrape all the links and titles on that page
<- remDr$findElement(using = "css selector", ".r") webelm_linkTitles
Inspect showed ”
. Notice he used “.r”. Says it will pick-up all elements with “r” as the class.
Get titles
# first title 1]]$getElementText() webelm_linkTitles[[ # put them all into a list <- purrr::map_chr(webelm_linkTitles, ~.x$getElementText()) titles <- unlist(lapply( titles webelm_linkTitles, function(x) {x$getElementText()}
Example: Access data that needs user authentication (also see rvest version)
After set-up and navigating to webpage, find elements where you type in your username and password
<- remDr$findElement(using = "id", "Username") webelm_username <- remDr$findElement(using = "id, "Password") webelm_pass
Enter username and password
$sendKeysToElement(list("<username>")) webpage_username$sendKeysToElement(list("<password>")) webpage_pass
Click sign-in button and click it
<- remDr$findElement(using = "class", "psds-button") webelm_sbutt $clickElement() webelm_sbutt
Example: Filter a football stats table by selecting values from a dropdown menu on a webpage (also see rvest version)
This is tedious — use rvest to scrape this if possible (have to use rvest at the end anyways). html forms are the stuff.
After set-up and navigated to url, find drop down “team” menu element locator using inspect in the browser and use findElement
<- remDr$findElement(using = "name", value = "team") # conveniently has name="team" in the html webelem_team
- Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
click team dropdown
$clickElement() webelem_team
Go back to inspect in the browser, you should be able to expand the team menu element. Left click value that you want to filter team by to highlight it. Then right click the element and select “copy” –> “copy selector”. Paste selector into value arg
<- remDr$findElement(using = "css", value = "edit-filters-0-team > option:nth-child(22)") webelem_DAL $clickElement() webelem_DAL
- Also see Other Stuff >> Shadow DOM elements >> Use {shadowr} for alternate syntax to search for web elements
- Repeat process for week, position, and year drop down menu filters
After you’ve selected all the values in the dropdown, click the submit button to filter the table
<- remDr$findElement(using = "css", value = "edit-filters-0-actions-submit") webelem_submit $clickElement() webelem_submit
- Finds element by using inspect on the submit button and copying the selector
Get the url of the html code of the page with the filtered table. Read html code into R with rvest.
<- remDr$getPageSource()[[1]] url <- rvest::read_html(url) html_page
- If you want the header,
getPageSource(header = TRUE)
- If you want the header,
Use rvest to scrape the table. Find the table with the stats
<- rvest::html_elements(html_page, "table") all_tables all_tables
- Used the “html_elements” version instead of “element”
- Third one has “<table class =”sortable stats-table full-width blah blah”
Save to table to dataframe
<- rvest::html_table(all_tables[[3]], header = TRUE) football_df
Other Stuff
Clicking a semi-infinite scroll button (e.g. “See more”)
Example: For-Loop
# Find Page Element for Body <- remDr$findElement("css", "body") webElem # Page to the End for (i in 1:50) { message(paste("Iteration",i)) $sendKeysToElement(list(key = "end")) webElem # Check for the Show More Button <- try(unlist( element$findElement( remDr"class name", "RveJvd")$getElementAttribute('class')), silent = TRUE) #If Button Is There Then Click It Sys.sleep(2) if(str_detect(element, "RveJvd") == TRUE){ <- remDr$findElement("class name", "RveJvd") buttonElem $clickElement() buttonElem } # Sleep to Let Things Load Sys.sleep(3) }
- article
- After scrolling to the “end” of the page, there’s a “show me more button” that loads more data on the page
Example: Recursive
<- function(rd) { load_more # scroll to end of page $executeScript("window.scrollTo(0, document.body.scrollHeight);", args = list()) rd # Find the "Load more" button by its CSS selector and ... <- rd$findElement(using = "css selector", "button.btn-load.more") load_more_button # ... click it $clickElement() load_more_button # give the website a moment to respond Sys.sleep(5) } <- function(rd) { load_page_completely # load more content even if it throws an error tryCatch({ # call load_more() load_more(rd) # if no error is thrown, call the load_page_completely() function again Recall(rd) error = function(e) { }, # if an error is thrown return nothing / NULL }) } load_page_completely(remote_driver)
Example: While-Loop with scroll height (source)
<- function(browser, scroll_step = 100) { progressive_scroll # Get initial scroll height of the page <- browser$executeScript("return document.body.scrollHeight") current_height # Set a variable for the scrolling position <- 0 scroll_position # Continue scrolling until the end of the page while (scroll_position < current_height) { # Scroll down by 'scroll_step' pixels $executeScript(paste0("window.scrollBy(0,", scroll_step, ");")) browser Sys.sleep(runif(1, max = 0.2)) # Wait for the content to load (adjust this if the page is slower to load) <- scroll_position + scroll_step # Update the scroll position scroll_position <- browser$executeScript("return document.body.scrollHeight") # Get the updated scroll height after scrolling (in case more content is loaded) current_height } } # Scroll the ECB page to ensure all dynamic content is visible progressive_scroll(browser, scroll_step = 1000)
Shadow DOM elements
#shadow-root and shadow dom button elements
Misc
- Two options: {shadowr} or JS script
Example: Use {shadowr}
My stackoverflow post
Set-up
::p_load(RSelenium, shadowr) pacman<- rsDriver(browser = c("chrome"), chromever = chrome_driver_version) driver # chrome browser <- driver$client chrome <- shadow(chrome) shadow_rd
Find web element
- Search for element using html tag
<- shadowr::find_elements(shadow_rd, 'calcite-button') wisc_dl_panel_button4 1]]$clickElement() wisc_dl_panel_button4[[
- Shows web element located in #shadow-root
- Since there might be more than one element with the “calcite-button” html tag, we use the plural,
find_elements
, instead offind_element
- There’s only 1 element returned, so we use
[[1]]
index to subset the list before clicking it
- Search for element using html tag
Search for web element by html tag and attribute
<- find_elements(shadow_rd, 'button[aria-describedby*="tooltip"]') wisc_dl_panel_button3 3]]$clickElement() wisc_dl_panel_button3[[
- “button” is the html tag which is subsetted by the brackets, and “aria-describedby” is the attribute
- Only part of the attribute’s value is used, “tooltip,” so I think that’s why “*=” instead of just “=” is used. I believe the “*” may indicate partial-matching.
- Since there might be more than one element with this html tag + attribute combo, we use the plural,
find_elements
, instead offind_element
- There are 3 elements returned, so we use
[[3]]
index to subset the list to element we want before clicking it
Example: Use a JS script and some webelement hacks to get a clickable element
- Misc
- “.class_name”
- fill in spaces with periods
- “.btn btn-default hidden-xs” becomes “.btn.btn-default.hidden-xs”
- fill in spaces with periods
- “.class_name”
- You can find the element path to use in your JS script by going step by step with JS commands in the Chrome console (bottom window)
- Steps
Write JS script to get clickable element’s elementId
Start with element right above first shadow-root element and use
querySelector
Move to the next element inside the next shadow-root element using
shadowRoot.querySelector
Continue to desired clickable element
- If there’s isn’t another shadow-root that you have to open, then the next element can be selected using
querySelector
- If you do have to click on another shadow-root element to open another branch, then used
shadowRoot.querySelector
- Example
- “hub-download-card” is just above shadow-root so it needs
querySelector
- “calcite-card” is an element that’s one-step removed from shadow-root, so it needs
shadowRoot.querySelector
- “calcite-dropdown” (type = “click”) is not directly (see div) next to shadow-root , so it can selected using
querySelector
- “hub-download-card” is just above shadow-root so it needs
- If there’s isn’t another shadow-root that you have to open, then the next element can be selected using
Write and execute JS script
<- chrome$executeScript("return document.querySelector('hub-download-card').shadowRoot.querySelector('calcite-card').querySelector('calcite-dropdown');") wisc_dlopts_elt_id
Make a clickable element or just click the damn thing
clickable element (sometimes this doesn’t work; needs to be a button or type=click)
- Use
findElement
to find a generic element class object that you can manipulate - Use “@” ninja-magic to force elementId into the generic webElement to coerce it into your button element
- Use
clickElement
to click the button
# think this is a generic element that can always be used <- chrome$findElement("css", "html") moose @.xData$elementId <- as.character(wisc_dlopts_elt_id) moose$clickElement() moose
- Use
Click the button
$executeScript("document.querySelector('hub-download-card').shadowRoot.querySelector('calcite-card').querySelector('calcite-dropdown').querySelector('calcite-dropdown-group').querySelector('calcite-dropdown-item:nth-child(2)').click()") chrome
- Misc
Get data from a hidden input
HTML Element
<input type="hidden" id="overview-about-text" value="%3Cp%3E100%25%20Plant-Derived%20Squalane%20hydrates%20your%20skin%20while%20supporting%20its%20natural%20moisture%20barrier.%20Squalane%20is%20an%20exceptional%20hydrator%20found%20naturally%20in%20the%20skin,%20and%20this%20formula%20uses%20100%25%20plant-derived%20squalane%20derived%20from%20sugar%20cane%20for%20a%20non-comedogenic%20solution%20that%20enhances%20surface-level%20hydration.%3Cbr%3E%3Cbr%3EOur%20100%25%20Plant-Derived%20Squalane%20formula%20can%20also%20be%20used%20in%20hair%20to%20increase%20heat%20protection,%20add%20shine,%20and%20reduce%20breakage.%3C/p%3E">
Extract value and decode the text
<- webpage |> overview_text html_element("#overview-about-text") |> html_attr("value") |> URLdecode() |> read_html() |> html_text() overview_text#> [1] "100% Plant-Derived Squalane hydrates your skin while supporting its natural moisture barrier.
JS
Highlight an element on the page (source)
<- ChromoteSession$new() chrome_session # Launch chrome to view actions taken in the browser $view() chrome_session # Get the browser's version $Browser$getVersion() chrome_session # Open a new tab and navigate to a URL $Page$navigate("https://www.r-project.org/") chrome_session $Runtime$evaluate( chrome_sessionexpression = " // Find the element element = document.querySelector('.sidebar'); // Highlight it element.style.backgroundColor = 'yellow'; element.style.border = '2px solid red'; " ) # Wait for the action to complete Sys.sleep(0.5) # Take a screenshot of the highlighted element $screenshot("r-project-sidebar.png", selector = ".sidebar") chrome_session # View the screenshot browseURL("r-project-sidebar.png")
- The css selector was some long string, but he shortened it to “.sidebar”
Search and extract table (source)
# Start a new browser tab session <- chrome_session$new_session() chrome_session_windy # Open a new tab in the current browser $view() chrome_session_windy ## Navigate to windy.com ---- # Navigate to windy.com $Page$navigate("https://www.windy.com") chrome_session_windy # Wait for the page to load Sys.sleep(0.5) # First focus the input field $Runtime$evaluate(' chrome_session_windy document.querySelector("#q").focus(); ') # Brief pause to ensure focus is complete Sys.sleep(0.5) # Enter search term and trigger search <- 'Stanford University Museum of Art' search_query $Runtime$evaluate( chrome_session_windyexpression = sprintf('{ // Get the search input const searchInput = document.getElementById("q"); searchInput.value = "%s"; // Focus the input searchInput.focus(); // Trigger input event const inputEvent = new Event("input", { bubbles: true }); searchInput.dispatchEvent(inputEvent); // Trigger change event const changeEvent = new Event("change", { bubbles: true }); searchInput.dispatchEvent(changeEvent); // Force the search to update - this triggers the site\'s search logic const keyupEvent = new KeyboardEvent("keyup", { key: "a", code: "KeyA", keyCode: 65, bubbles: true }); searchInput.dispatchEvent(keyupEvent); }', search_query) )# Wait for and, then, click the first search result Sys.sleep(0.5) $Runtime$evaluate(' chrome_session_windy document.querySelector(".results-data a").click(); ') ## Extract weather data ---- # Wait for and, then, extract the weather data table Sys.sleep(0.5) <- chrome_session_windy$Runtime$evaluate(' html document.querySelector("table#detail-data-table").outerHTML ')$result$value ## Parse the table using `rvest` ---- <- html |> raw_weather_table read_html() |> html_node('table') |> # Select the table to extract it without getting a node set html_table() |> # Convert the table to a data frame as.data.frame() raw_weather_table
%s is replaced by search_query
#detail-data-table is the CSS selector but he added table in front of that — not sure why. Without the “#”, this is also the table id in the HTML, so maybe because it’s a table id (?).
Feel like this might’ve been solved by RSelenium and the JS wasn’t necessary
API
POST
- Also see API >> Request Methods >> POST
- Sometimes dynamically served html tables can be scraped via a simple POST request avoiding Selenium procedures
- Example: Dynamic Table (source)
Get API Parameters From Network Tab
- The user sets some inputs like city and date range, and clicks the hourglass button to submit. Then this table pops up below.
- What we want is
search_lib.php
. It has a POST request method in the Headers tab, and the Preview tab shows table we want - The request URL is also under the Headers tab at the top.
- The json file may look interesting since data often comes in json, but t’s just some table format settings I think.
- Under the Payload tab with view parsed selected, there’s a nice clean list of parameters that are the inputs we set in the first step
- Image appears to show view source selected, but that’s just because it changes to the other one once selected
- With view source selected, we see the query string that’s used to encapsulate the inputs.
fetch("http://historico.oepm.es/logica/search_lib.php", { "headers": { "accept": "*/*", "accept-language": "en-US,en;q=0.9", "content-type": "application/x-www-form-urlencoded; charset=UTF-8", "x-requested-with": "XMLHttpRequest" }, "referrer": "http://historico.oepm.es/buscador.php", "referrerPolicy": "strict-origin-when-cross-origin", "body": "cadena=Madrid&tb=SPH_MATCH_ALL&rangoa=1826%2C1966&indexes%5B%5D=privilegios&indexes%5B%5D=patentes&indexes%5B%5D=patentes_upm&indexes%5B%5D=marcas×tamp=Thu Dec 26 2024 08:49:42 GMT-0500 (Eastern Standard Time)", "method": "POST", "mode": "cors", "credentials": "include" });
- Right-clicking
search_lib.php
gives us some options. Select “Copy as fetch.” - Under “headers”, we see the necessary header parameters that are required: “accept”, “accept-language”, “content-type”, and “x-requested-with”.
- No idea what these mean or why these are the particular ones required
- There is a list of other parameters that could probably be added. You can find them in the Headers tab underneath Request Headers or by selecting Copy Request Headers as seen in the image.
- By using “Copy as fetch,” we only get the absolutely necessary ones (I think).
Call API and Get the Data
library(httr) # make POST requests library(polite) # be polite when we scrape library(rvest) # extract HTML tables <- "madrid" city <- 1850 year1 <- 1870 year2 <- paste0( query "cadena=", city, "&tb=SPH_MATCH_ALL&rangoa=", year1, "%2C", year2, "&indexes%5B%5D=privilegios&indexes%5B%5D=patentes&indexes%5B%5D", "=patentes_upm&indexes%5B%5D=marcas"
- We choose some input values that we desire and place them in appropriate spots of our query string.
# <- politely(POST, verbose=TRUE) polite_POST <- polite_POST( POST_response "http://historico.oepm.es/logica/search_lib.php", add_headers( "accept" = "*/*", "accept-language" = "en-GB,en-US;q=0.9,en;q=0.8", "content-type" = "application/x-www-form-urlencoded; charset=UTF-8", "x-requested-with" = "XMLHttpRequest" ),body = query )
politely
tells the website who is performing the requests and to add a delay between requests (here we only do one)- In POST, we set the request URL, the header parameters, the finalized query string.
content(POST_response, "parsed") |> html_table() |> head(n = 5) #> [[1]] #> # A tibble: 12 × 7 #> `` TIPO SUBTIPO EXPEDIENTE FECHA DENOMINACION_TITULO #> <lgl> <chr> <chr> <int> <chr> <chr> #> 1 NA Marca "" 103 1870… La Deliciosa #> 2 NA Marca "Marca de Fá… 54 1867… Fuente de los Cana… #> 3 NA Marca "Marca de Fá… 50 1868… Campanadas para in… #> 4 NA Marca "Marca de Fá… 66 1868… Compañía Española #> 5 NA Marca "Marca de Fá… 76 1869… Tinta Universal #> # ℹ 1 more variable: SOLICITANTE <chr>
- Clicking the “+” in the first row of the table results in a pop-up table with more information (See first image)
- Click + and monitor Network tab
- The name the pops up is “ficha.php?id=1030&db=maruam”
- Get the request URL
- You can right-click that name >> Copy >> Copy URL or Left-click the name and the Request URL will be in the Headers tab
- “http://historico.oepm.es/logica/ficha.php?id=1030&db=maruam”
- Notice that this was a GET request method
- Issue: The id and db parameter values cannot be obtained from the primary table nor can be guessed.
Obtain values POST response
1<- content(POST_response, "parsed") |> list_attrs html_nodes("td > a") |> html_attrs() 2<- lapply(list_attrs, function(x) { info <- x[names(x) %in% c("data-id", "data-db")] out if (length(out) == 0) return(NULL) data.frame(id = out[1], db = out[2]) }) 3<- Filter(Negate(is.null), info) info 4<- data.table::rbindlist(info) out head(out) #> id db #> 1: 6 maruam #> 2: 130 maruam #> 3: 461 maruam #> 4: 523 maruam #> 5: 560 maruam #> 6: 581 maruam
- 1
- Get all the attributes for all “+” buttons
- 2
- For each “+” button, extract only the id and db attributes
- 3
- Remove cases where there are no attributes
- 4
- Transform the list into a clean dataframe
Loop values through url string
read_html( paste0("http://historico.oepm.es/logica/ficha.php?id=", 6, "&db=", "maruam") |> ) html_table() |> head(n = 5) #> [[1]] #> # A tibble: 14 × 2 #> X1 X2 #> <chr> <chr> #> 1 Número de Marca "103" #> 2 Denominación Breve "La Deliciosa" #> 3 Fecha Solicitud "27-10-1870" #> 4 Fecha Concesión "24-03-1871" #> 5 Fecha de publicación Concesión ""
- Values can now be looped through the GET request url string
- Data is in long format and needs cleaned. See repo for complete code.
GET
- Example: Real Estate Addresses
The website, https://www.cimls.com/, is a free listing for commercial real estate
Get API Parameters From Network Tab
- After navigating to the “For Sale” page from the navbar, you fill out an HTML form with whichever query parameters your interested in.
- For this example, I chose city, state, search range (miles), and property type.
- To find the GET (or POST) request, look for an object with Size in kb — which there are two here.
- The ones I’ve seen also have .php extensions (popular API programming language).
- Looking for xhr in the Type column can also be an indicator. If there’s a lot of stuff that gets loaded, there’s xhr filter button in the tool bar
- XMLHttpRequest (xhr) is an API in the form of a JavaScript object whose methods transmit HTTP requests from a web browser to a web server
- For each suspected object, click it. Then, click the Preview tab to see if that object is loading the data you want.
- Copy the request url from the Headers tab and the query parameters that were used from the Payload tab
fetch("https://www.cimls.com/external-data/datafiniti-api.php?city=Louisville&state=Kentucky&prop-type=Retail&range=25", { "headers": { "accept": "*/*", "accept-language": "en-US,en;q=0.9", "priority": "u=1, i", "sec-ch-ua": "\"Chromium\";v=\"134\", \"Not:A-Brand\";v=\"24\", \"Google Chrome\";v=\"134\"", "sec-ch-ua-mobile": "?0", "sec-ch-ua-platform": "\"Windows\"", "sec-fetch-dest": "empty", "sec-fetch-mode": "cors", "sec-fetch-site": "same-origin" }, "referrer": "https://www.cimls.com/external-data/external-search.php?city=Louisville&state=Kentucky&prop-type=Retail&range=25", "referrerPolicy": "strict-origin-when-cross-origin", "body": null, "method": "GET", "mode": "cors", "credentials": "include" });
- Right-clicking
datafiniti-api.php?city=Louisville&state=Kentucky&prop-type=Retail&range=25
gives us some options. Select “Copy as fetch.” - Under “headers”, we see the necessary header parameters that are required: “accept”, “accept-language”, “content-type”, and “x-requested-with”.
- No idea what these mean or why these are the particular ones required
- There is a list of other parameters that could probably be added. You can find them in the Headers tab underneath Request Headers or by selecting Copy Request Headers as seen in the image.
- By using “Copy as fetch,” we only get the absolutely necessary ones (I think)
- Unlike the POST example, you’ll also need the Cookie string and the Referer url in the Headers tab.
- The cookie string could be valid months or a year. From the Cookies tab, check the dates for the various values in the Expires field. After which, you’ll have to come to the site and renew it.
- There are some cookie functions in {httr2}, so maybe there’s a way to programmatically do this.
Click for More Pages
- Only 4 properties are listed. So, if we want more, we have to click the More Results button.
- Monitoring the network activity shows that the request url has changed slightly.
Get New Parameters
- Viewing the Payload tab shows the new query parameters that we’ll need to get the data from the rest of the pages.
Call API and Get the Data
::p_load( pacman dplyr, stringr, httr2, rvest ) <- req_prop request("https://www.cimls.com/external-data/datafiniti-api.php") |> req_throttle(capacity = 1, fill_time_s = 10) |> req_url_query( "city" = "Louisville", "state" = "Kentucky", "page" = "1", "cend" = "3", "prop-type" = "Retail", "range" = "25" |> ) req_headers( "accept" = "*/*", "accept-language" = "en-US,en;q=0.9", "priority" = "u=1, i", "sec-ch-ua" = "\"Chromium\";v=\"134\", \"Not:A-Brand\";v=\"24\", \"Google Chrome\";v=\"134\"", "sec-ch-ua-mobile" = "?0", "sec-ch-ua-platform" = "\"Windows\"", "sec-fetch-dest" = "empty", "sec-fetch-mode" = "cors", "sec-fetch-site" = "same-origin", "referer" = "https://www.cimls.com/external-data/external-search.php?city=Louisville&state=Kentucky&prop-type=Retail&range=25", "cookie" = cookies ) <- resps_prop req_perform_iterative( req = req_prop, next_req = iterate_with_offset(param_name = "page"), max_reqs = 4 ) length(resps_prop) #> [1] 4
- Also see API >> {httr2} >> Paginated Request for another example
request
takes the request url without the query parametersreq_throttle
says only call the API once every 10 seconds- If capacity = 5, then at most five API calls can be made consecutively every 10 seconds
req_query_url
fills out the request url with the parameters and values- Even though “page 1” didn’t include page and cend parameters, the query fortunately still works. This makes iterating calls to the API simpler
- If the parameters couldn’t be included in the first iteration (page 1), I think {httr2} provides a function to alter the request url in the iteration loop.
req_headers
takes the headers we got from the “copy as fetch” along with the cookie string (not shown) and the referer url.req_perform_iterative
performs the call loop.- param_name = “page” specifies which query parameter(s) to use in the iteration (e.g. page = “1”, page = “2”, page = “3”, etc.)
- max_reqs = 4 is used to specify that only want 4 pages. Although if 1 call failed, I’d only get 3.
- You can also a provide a user-defined function that returns a logical to determine when the iteration stops.
- The combined response is a list where each element is the response from that iterations API call. (4 elements in this case)
<- function(rsp) { pull_addrs <- addrs |> rsp resp_body_html() |> html_elements(".feature-text") |> html_text() |> str_remove_all("\\$.+") <- tib_addr tibble(addrs_full = addrs) |> mutate(addrs_full = str_replace_all(addrs_full, pattern = "Vly", replacement = "Valley")) } <- tib_addrs resps_data(resps = resps_prop, resp_data = pull_addrs) |> distinct() tib_addrs#> # A tibble: 13 × 1 #> addrs_full #> <chr> #> 1 10405 Southpointe Blvd, Louisville KY #> 2 3621 Fern Valley Rd, Louisville KY #> 3 3600 Bardstown Rd, Louisville KY #> 4 7724 Bardstown Rd, Louisville KY #> 5 11601 Plantside Dr, Louisville KY #> 6 4926 Cane Run Rd, Louisville KY #> 7 5000 Maple Spg Dr, Louisville KY #> 8 552 S 4th St, Louisville KY #> 9 10435 Southpointe Blvd, Louisville KY #> 10 10415 Southpointe Blvd, Louisville KY #> 11 7701 Preston Hwy, Louisville KY #> 12 1905 Bardstown Rd, Louisville KY #> 13 714 E 10th St, Jeffersonville IN
pull_addrs
is wrangling function that takes a response, pulls address text from .feature-text class. Then it’s cleaned.resps_data
takes the response list and loops each element thoughpull_addrs
- Acts like a {purrr:map} function
Note that there are only 13 addresses. That’s because there were only 13 unique properties fitting my criteria. Once I got the end of the available properties, it looped around to the beginning and gave me duplicate properties. (Hence, the
distinct
at the end.)To grap the property characteristics, you can adjust the
pull_addrs
function<- prop_chars |> resp resp_body_html() |> html_nodes("li") |> html_text()
- Each characteristic is a html list element. It would take some additional wrangling to get it tidy. I didn’t look at this result too closely, but it shouldn’t be too much trouble.