Use

Misc

Packages
- {requests-ratelimiter} - A simple wrapper around pyrate-limiter v2 that adds convenient integration with the requests library
- {tenacity} - A general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything (See {requests} >> Rate Limiting >> Example)
  - Without backoff: 1000 failed requests retry simultaneously → rate limit → all fail again
  - With exponential backoff: Requests spread out over time → fewer rate limit hits → higher success rate
  - Benchmark: Reduces API failures from 15% to <1% in production pipelines
Resources
- W3 School HTTP Request methods reference page
Exponential Backoff
- Retries: Waits 1s, 2s, 4s, 8s between retries (up to a threshold (e.g. 60s)
- When to use:
  - API calls (rate limits, transient errors)
  - Database connections (connection pool exhaustion)
  - Network I/O (timeouts, temporary network issues)
  - External service calls (third-party APIs that throttle)
- When not to use it:
  - Logic errors (retrying won’t help)
  - Authentication failures (will always fail)
  - CPU-bound operations (backoff doesn’t help computation)

Terms

Body - information that is sent to the server. (Can’t use with GET requests.)
Endpoint - a part of the URL you visit. For example, the endpoint of the URL https://example.com/predict is /predict
Headers - used for providing information (think authentication credentials, for example). They are provided as key-value pairs
Method - a type of request you’re sending, can be either GET, POST, PUT, PATCH, and DELETE. They are used to perform one of these actions: Create, Read, Update, Delete (CRUD)
Pooled Requests - A technique where multiple individual requests are combined or “pooled” into a single API call. May require more complex error handling, as you’ll need to manage partial successes or failures within the pooled request
- Methods
  - Batch endpoints: Some APIs offer specific endpoints designed to handle multiple operations in a single call.
  - Request bundling: Clients can aggregate multiple requests into a single payload before sending it to the API.

Request Methods

Misc
- If you’re writing a function or script, you should check whether the status code is in the 200s before additional code runs.
- HTTP 429 - Too Many Requests

GET

GET is a request for data where the parameters for that request are inserted into the URL usually after a ?.

Examples

# example 1
args <- list(key = "<key>", id = "<id>", format = "json", output = "full", count = "2")
api_json <- GET(url = URL, query = args)

# example 2 (with headers)
res = GET("https://api.helium.io/v1/dc_burns/sum",
          query = list(min_time = "2020-07-27T00:00:00Z"
                      , max_time = "2021-07-27T00:00:00Z"),
          add_headers(`Accept`='application/json'
                      , `Connection`='keep-live'))

# example 3
get_book <- function(this_title, this_author = NA){
  httr::GET(
    url = url,
    query = list(
      apikey = Sys.getenv("ACCUWEATHER_KEY"),
      q = ifelse(
        is.na(this_author),
        glue::glue('intitle:{this_title}'),
        glue::glue('intitle:{this_title}+inauthor:{this_author}')
        )))
}

Example: Pull parsed json from raw format

my_url <- paste0("http://dataservice.accuweather.com/forecasts/",
                  "v1/daily/1day/571_pc?apikey=", 
                 Sys.getenv("ACCUWEATHER_KEY"))
my_raw_result <- httr::GET(my_url)

my_content <- httr::content(my_raw_result, as = 'text')

dplyr::glimpse(my_content) #get a sense of the structure
dat <- jsonlite::fromJSON(my_content)

content has 3 option for extracting and converting the content of the GET output.
“raw” output asis
“text” can be easiest to work with for nested json
“parsed” is a list

POST
- Also see Scraping >> POST
- POST is also a request for data, but the parameters are typically sent in the body of a json. So, it’s closer to sending data and receiving data than a GET request is.
- When you fill out a html form or search inputs on a website and click a submit button, this is a POST request in the background being sent to the webserver.
- Example
```
# base_url from get_url above
base_url <- "https://tableau.bi.iu.edu/"
vizql <- dashsite_json$vizql_root
session_id <- dashsite_json$sessionid
sheet_id <- dashsite_json$sheetId

post_url <- glue("{base_url}{vizql}/bootstrapSession/sessions/{session_id}")

dash_api_output <- 
  POST(post_url,
       body = list(sheet_id = sheet_id),
       encode = "form",
       timeout(300))
```
- Example: json body
  - From thread
  - “use auto_unbox = TRUE; otherwise there are some defaults that mess with your API format”
  - “url” is the api endpoint (obtain from api docs)
  - headers

{httr2}

POST
- Contacts Home Assistant API and turns off a light.
Paginated Requests
- Also see Scraping >> API >> GET >> Example: Real Estate Addresses
- Example: (source)
  - request_complete checks the response to see whether another is needed.
  - req_perform_iterative is added to the request, giving it a canned iterator that takes your function and bumps a query parameter (page for this API) every time you do need another request

GET Request in Parallel

Example

pacman::p_load(
  dplyr,
  httr2
)

reqs_dat_comp <- 
  tib_comp |> 
  mutate(latitude = as.character(latitude),
         longitude = as.character(longitude))

reqs_comp <- 
  purrr::map2(reqs_dat_comp$longitude, 
              reqs_dat_comp$latitude, 
              \(x, y) {
                request("https://geocoding.geo.census.gov/geocoder/geographies/coordinates") |>
                  req_url_query(
                    "benchmark" = "Public_AR_Current",
                    "vintage" = "Current_Current",
                    "format" = "json",
                    "x" = x,
                    "y" = y
                  )
              })

resps_comp <- req_perform_parallel(reqs_comp, max_active = 5)
length(resps_failures(resps_comp))

pull_geoid <- function(resp) {
  resp_json <- resp_body_json(resp)

  cb_name <- 
    stringr::str_extract(names(resp_json$result$geographies), 
                         pattern = "\\d{4} Census Blocks$") |> 
    na.omit()

  loc_geoid <- 
    purrr::pluck(resp_json, 1, 1, cb_name, 1)$GEOID |> 
    stringr::str_sub(end = 12)

  return(loc_geoid)
}

tib_geoids_comp <- 
  resps_data(resps = resps_comp, resp_data = pull_geoid) |> 
  tibble(geoid = _)

reqs_comp is a list of requests — each with different x and y values (longitude, latitude)
req_perform_parallel calls the API with 5 requests at a time
pull_geoid wrangles the response
resps_data takes a list of responses and applies pull_geoid to each element

Don’t parse an JSON response to a string from an API
- Responses are binary. It’s more performant to read the binary directly than to parse the response into a string and then read the string
- Example: {yyjsonr} (source)
```
library(httr2)

# format request
req <- request("https://jsonplaceholder.typicode.com/users")
# send request and get response
resp <- req_perform(req)

# translate binary to json
your_json <- yyjsonr::read_json_raw(resp_body_raw(resp))
```
  - Faster than the httr2/jsonlite default, resp_body_json

{requests}

Basic Call (source)

import polars as pl
import requests

url = "https://pub.demo.posit.team/public/lead-quality-metrics-api/lead_quality_metrics"

response = requests.get(url)
content = response.text

lead_quality_data = pl.read_json(io.StringIO(content))

The response is probably a json binary which gets parsed by io.String and finally read into a polars df

Basic Call Using a Function (source)

import requests
import os
import pandas as pd
import matplotlib.pyplot as plt

def grab_ONS_time_series_data(dataset_id,timeseries_id):
    """
    This function grabs specified time series from the ONS API.
    """
    api_endpoint = "https://api.ons.gov.uk/"
    api_params = {
    'dataset':dataset_id,
    'timeseries':timeseries_id}
    url = (api_endpoint
                        +'/'.join([x+'/'+y for x,y in zip(api_params.keys(),api_params.values())][::-1])
                        +'/data')
    return requests.get(url).json()

# Grab the data (put your time series codes here)
data = grab_ONS_time_series_data('MM23','CHMS')

# Check we have the right time series
title_text = data['description']['title']
print("Code output:\n")
print(title_text)

# Put the data into a dataframe and convert types
# Note that you'll need to change months if you're
\
# using data at a different frequency
df = pd.DataFrame(pd.io.json.json_normalize(data['months']))

# Put the data in a standard datetime format
df['date'] = pd.to_datetime(df['date'])
df['value'] = df['value'].astype(float)
df = df.set_index('date')

# Check the data look sensible
print(df.head())

# Plot the data
df['value'].plot(title=title_text,ylim=(0,df['value'].max()*1.2),lw=5.)
plt.show()

Use Session to make a pooled request to the same host (Video, Docs)

Example

import pathlib
import requests

links_file = pathilib.Path.cwd() / "links.txt"
links = links_file.read_text().splitlines()[:10]
headers = {"User-Agent": "Mozilla/5.0 (X!!; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0}

# W/o Session (takes about 16sec)
for link in links:
  response = requests.get(link, headers=headers)
  print(f"{link} - {response.status_code}")

# W/Session (takes about 6sec)
with requests.Session() as session:
  for link in links:
    response = session.get(link, headers=headers)
    print(f"{link} - {response.status_code}")

The first way syncronously makes a get request to each URL
- Makes several requests to the same host
The second way reuses the underlying TCP connection, which can result in a significant performance increase.

Retrieve Paged Results One at a Time

Generator

from typing import Iterator, Dict, Any
from urllib.parse import urlencode
import requests


def iter_beers_from_api(page_size: int = 5) -> Iterator[Dict[str, Any]]:
    session = requests.Session()
    page = 1
    while True:
        response = session.get('https://api.punkapi.com/v2/beers?' + urlencode({
            'page': page,
            'per_page': page_size
        }))
        response.raise_for_status()

        data = response.json()
        if not data:
            break

        yield from data

        page += 1

Iterate through each page of results

beers = iter_beers_from_api()
next(beers)
#> {'id': 1,
#>  'name': 'Buzz',
#>  'tagline': 'A Real Bitter Experience.',
#>  'first_brewed': '09/2007',
#>  'description': 'A light, crisp and bitter IPA brewed...',
#>  'image_url': 'https://images.punkapi.com/v2/keg.png',
#>  'abv': 4.5,
#>  'ibu': 60,
#>  'target_fg': 1010,
#> ...
#> }
next(beers)
#> {'id': 2,
#>  'name': 'Trashy Blonde',
#>  'tagline': "You Know You Shouldn't",
#>  'first_brewed': '04/2008',
#>  'description': 'A titillating, ...',
#>  'image_url': 'https://images.punkapi.com/v2/2.png',
#>  'abv': 4.1,
#>  'ibu': 41.5,
#>  ...
#> }

Use Concurrency

Use threads on your computer to make requests at the same time. It’s essentially parallelism.

Example (source)

import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from requests_ratelimiter import LimiterSession

# Limit to max 2 calls per second
request_session = LimiterSession(per_second=2)


def get_post(post_id: int) -> dict:
    if post_id > 100:
        raise ValueError("Parameter `post_id` must be less than or equal to 100")

    url = f"https://jsonplaceholder.typicode.com/posts/{post_id}"

    # Use the request_session now
    r = request_session.get(url)
    r.raise_for_status()
    result = r.json()
    # Remove the longest key-value pair for formatting reasons
    del result["body"]
    return result


if __name__ == "__main__":
    print("Starting to fetch posts...\n")

    # Run post fetching concurrently
    with ThreadPoolExecutor() as tpe:
        # Submit tasks and get future objects
        futures = [tpe.submit(get_post, post_id) for post_id in range(1, 16)]
        for future in as_completed(futures):
            # Your typical try/except block
            try:
                result = future.result()
                print(result)
            except Exception as e:
                print(f"Exception raised: {str(e)}")
            future.add_done_callback(future_callback_fn)
            result = future.result()
            print(result)

ThreadPoolExecutor class manages a pool of worker threads for you
- Number of CPUs + 4, e.g. 12 CPU cores means 16 ThreadPoolExecutor workers
Uses a standard try/except to handle errors. Errors don’t stop code from completing the other requests
future.add_done_callback calls your custom Python function. This function will have access to the Future object

Rate Limiting

Example: Exponential backoff with jitter

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
)
import time
import requests

# Rate limiting: Don't exceed 100 requests/second
RATE_LIMIT = 100
MIN_INTERVAL = 1.0 / RATE_LIMIT  # 0.01 seconds between requests

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=60),
    retry=retry_if_exception_type((requests.HTTPError, ConnectionError)),
    reraise=True
)
def fetch_with_backoff(url, last_request_time):
    """Fetch with rate limiting and exponential backoff."""
    # Rate limiting: ensure minimum interval between requests
    elapsed = time.time() - last_request_time
    if elapsed < MIN_INTERVAL:
        time.sleep(MIN_INTERVAL - elapsed)

    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response.json(), time.time()

# Process requests with rate limiting
last_request_time = 0
for url in api_urls:
    data, last_request_time = fetch_with_backoff(url, last_request_time)
    process(data)

Backs off exponentially on failures, but add randomness (jitter) to prevent thundering herd problems.
Exponential backoff: Waits 1s, 2s, 4s, 8s between retries (up to 60s max)
Jitter: Tenacity adds randomness to prevent synchronized retries
Conditional retries: Only retries on specific exceptions (not all errors)
Max attempts: Stops after 5 attempts to avoid infinite loops

Using API keys

Example (source)

import requests
from requests.auth import HTTPBasicAuth
import json

username = "ivelasq@gmail.com"
api_key = r.api_key

social_url = "https://ivelasq.atlassian.net/rest/api/3/search?jql=project%20=%20KAN%20AND%20text%20~%20%22\%22social\%22%22"
blog_url = "https://ivelasq.atlassian.net/rest/api/3/search?jql=project%20=%20KAN%20AND%20text%20~%20%22\%22blog\%22%22"

def get_response_from_url(url, username, api_key):
    auth = HTTPBasicAuth(username, api_key)

    headers = {
        "Accept": "application/json"
    }

    response = requests.request("GET", url, headers=headers, auth=auth)

    if response.status_code == 200:
        results = json.dumps(json.loads(response.text), sort_keys=True, indent=4, separators=(",", ": "))
        return results
    else:
        return None

social_results = get_response_from_url(social_url, username, api_key)
blog_results = get_response_from_url(blog_url, username, api_key)

{http.client}

Docs
The Requests package is recommended for a higher-level HTTP client interface.

Example 1: Basic GET

import http.client
import json

conn = http.client.HTTPSConnection("api.example.com")
conn.request("GET", "/data")
response = conn.getresponse()
data = json.loads(response.read().decode())
conn.close()

Example 2:

GET

import http.client

url = '/fdsnws/event/1/query'
query_params = {
    'format': 'geojson',
    'starttime': "2020-01-01",
    'limit': '10000',
    'minmagnitude': 3,
    'maxlatitude': '47.009499',
    'minlatitude': '32.5295236',
    'maxlongitude': '-114.1307816',
    'minlongitude': '-124.482003',
}
full_url = f'https://earthquake.usgs.gov{url}?{"&".join(f"{key}={value}" for key, value in query_params.items())}'

print('defined params...')

conn = http.client.HTTPSConnection('earthquake.usgs.gov')
conn.request('GET', full_url)
response = conn.getresponse()

JSON response

import pandas as pd
import json

if response.status == 200:
    print('Got a response.')
    data = response.read()
    print('made the GET request...')
    data = data.decode('utf-8')
    json_data = json.loads(data)
    features = json_data['features']
    df = pd.json_normalize(features)

    if df.empty:
        print('No earthquakes recorded.')
    else:
        df[['Longitude', 'Latitude', 'Depth']] = df['geometry.coordinates'].apply(lambda x: pd.Series(x))
        df['datetime'] = df['properties.time'].apply(lambda x : datetime.datetime.fromtimestamp(x / 1000))
        df['datetime'] = df['datetime'].astype(str)
        df.sort_values(by=['datetime'], inplace=True)
else:
  print(f"Error: {response.status}")