JSON

Misc

  • Packages

    • {yyjsonr} - A fast JSON parser/serializer, which converts R data to/from JSON and NDJSON. It is around 2x to 10x faster than jsonlite at both reading and writing JSON.
    • {RcppSimdJson} - Comparable to {yyjsonr} in performance.
  • Also see

  • Read from a URL (source)

    url <- "https://projects.fivethirtyeight.com/polls/president-general/2024/national/polls.json"
    polls_raw <- RcppSimdJson::fload(url)
    dplyr::glimpse(polls_raw)
  • Tools

    • {listviewer}: Allows you to interactively explore and edit json files through the Viewer in the IDE. Docs show how it can be embedded into a Shiny app as well.

      • Example

        library(listviewer)
        moose <- jsonlite::read_json("path/to/file.json")
        jsonedit(moose)
        reactjson(moose)
        • I’ve also used this a .config file which looked like a json file when I opened in a text editor, so this seems to work on anything json-like.
        • reactjson has a copy button which is nice so that you can paste your edited version into a file.
        • jsonedit seems like it has more features, but I didn’t see a copy button. But there’s a view in which you can manually select everything a copy it via keyboard shortcut.
    • Python json.tool module

      python -m json.tool /home/trey/Downloads/download.json
      • Pretty prints a json file from CLI

{jsonlite}

  • Read

    flights <- 
      jsonlite::read_json(
        "https://tidyverse.r-universe.dev/nycflights13/data/flights/json",
        simplifyVector = TRUE
      )
  • Write

    jsonlite::stream_out(
      tibble(
        day = days,
        title = titles,
        sections = sections,
        md = markdown,
        urls = urls
      ),
      gzfile("/tmp/drops.json.gz")
    )

Python

  • Example: Parse Nested JSON into a dataframe (article)
    • Raw JSON

      • “entry” has the data we want
      • “…” at the end indicates there are multiple objectss inside the element, “entry”
        • Probably other root elements other than “feed” as well
    • Read a json file from a URL using {{requests}} and convert to list

      import requests
      
      url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
      
      r = requests.get(url)
      
      data = r.json()
      entries = data["feed"]["entry"]
      • It looks like the list conversion also ordered the elements alphabetically
      • The output list is subsetted by the root element “feed” and the child element “entry”
    • Get a feel for the final structure you want by hardcoding elements into a df

      parsed_data = defaultdict(list)
      
      for entry in entries:
          parsed_data["author_uri"].append(entry["author"]["uri"]["label"])
          parsed_data["author_name"].append(entry["author"]["name"]["label"])
          parsed_data["author_label"].append(entry["author"]["label"])
          parsed_data["content_label"].append(entry["content"]["label"])
          parsed_data["content_attributes_type"].append(entry["content"]["attributes"]["type"])
          ... 
    • Generalize extracting the properties of each object in “entry” with a nested loop

      parsed_data = defaultdict(list)
      
      for entry in entries:
          for key, val in entry.items():
              for subkey, subval in val.items():
                  if not isinstance(subval, dict):
                      parsed_data[f"{key}_{subkey}"].append(subval)
                  else:
                      for att_key, att_val in subval.items():
                          parsed_data[f"{key}_{subkey}_{att_key}"].append(att_val)
      • defaultdict creates a key from a list element (e.g. “author”) and groups the properties into a list of values where the value may also be a dict.
      • For each item in “entry”, it looks at the first key-value pair knowing that value is always a dictionary (object in JSON)
      • Then handles two different cases
        • First Case: The value dictionary is flat and does not contain another dictionary, only key-value pairs.
          • Combine the outer key with the inner key to a column name and take the value as column value for each pair.
        • Second Case: Dictionary contains a key-value pair where the value is again a dictionary.
          • Assumes at most two levels of nested dictionaries
          • Iterates over the key-value pairs of the inner dictionary and again combines the outer key and the most inner key to a column name and take the inner value as column value.
    • Recursive function that handles json elements with deeper structures

      def recursive_parser(entry: dict, data_dict: dict, col_name: str = "") -> dict:
          """Recursive parser for a list of nested JSON objects
      
          Args:
              entry (dict): A dictionary representing a single entry (row) of the final data frame.
              data_dict (dict): Accumulator holding the current parsed data.
              col_name (str): Accumulator holding the current column name. Defaults to empty string.
          """
          for key, val in entry.items():
              extended_col_name = f"{col_name}_{key}" if col_name else key
              if isinstance(val, dict):
                  recursive_parser(entry[key], data_dict, extended_col_name)
              else:
                  data_dict[extended_col_name].append(val)
      
      parsed_data = defaultdict(list)
      
      for entry in entries:
          recursive_parser(entry, parsed_data, "")
      
      df = pd.DataFrame(parsed_data)
      • Notice the check for a deeper structure with isinstance. If there is one, then the function is called again.
      • Function outputs a dict which is coerced into dataframe
      • To get rid of “label” in column names: df.columns = [col if not "label" in col else "_".join(col.split("_")[:-1]) for col in df.columns]
      • object types can be cast into more efficient types: df["im:rating"] = df["im:rating"].astype(int)

DuckDB

  • Misc
    • Notes from
    • -json flag says output as json instead of a sql query output
      • The output kind of looks like a scrunched up dictionary. So, you can pipe that into a CLI tool like jq (if you have it installed) (e.g. <query with -json> | jq) to get pretty printing
    • Read JSON from a URL in a query: from read_json('https://api.github.com/orgs/golang/repos')
  • CLI
    • Example
      • Data: Types of open source licenses used in golang repo

        [
          {
            "id": 1914329,
            "name": "gddo",
            "license": {
              "key": "bsd-3-clause",
              "name": "BSD 3-Clause \"New\" or \"Revised\" License",
              ...
            },
            ...
          },
          {
            "id": 11440704,
            "name": "glog",
            "license": {
              "key": "apache-2.0",
              "name": "Apache License 2.0",
              ...
            },
            ...
          },
          ...
        ]
      • Count most common license types used

        duckdb -c \
        "select license->>'key' as license, count(*) as count \
        from 'repos.json' \
        group by 1 \
        order by count desc"
        
        ┌──────────────┬───────┐
           license    │ count │
           varchar    │ int64 │
        ├──────────────┼───────┤
         bsd-3-clause │    23 │
         apache-2.0   │     5 │
                      │     2 │
        └──────────────┴───────┘
        • The bottom license with count = 2 is null (i.e. no licence)
        • ->> is used to drill down into a nested json field. (e.g. license to key)