JSON
Misc
Packages
- {yyjsonr} - A fast JSON parser/serializer, which converts R data to/from JSON and NDJSON. It is around 2x to 10x faster than jsonlite at both reading and writing JSON.
- {RcppSimdJson} - Comparable to {yyjsonr} in performance.
Also see
- Big Data >> Larger than Memory
- SQL >> Processing Expressions >> Nested Data
- Databases >> DuckDB >> Misc
- hrbmstr recommends trying duckdb before using the cli tools in “Big Data”
Read from a URL (source)
<- "https://projects.fivethirtyeight.com/polls/president-general/2024/national/polls.json" url <- RcppSimdJson::fload(url) polls_raw ::glimpse(polls_raw) dplyr
Tools
{listviewer}: Allows you to interactively explore and edit json files through the Viewer in the IDE. Docs show how it can be embedded into a Shiny app as well.
Example
library(listviewer) <- jsonlite::read_json("path/to/file.json") moose jsonedit(moose) reactjson(moose)
- I’ve also used this a .config file which looked like a json file when I opened in a text editor, so this seems to work on anything json-like.
reactjson
has a copy button which is nice so that you can paste your edited version into a file.jsonedit
seems like it has more features, but I didn’t see a copy button. But there’s a view in which you can manually select everything a copy it via keyboard shortcut.
Python
json.tool
modulepython -m json.tool /home/trey/Downloads/download.json
- Pretty prints a json file from CLI
{jsonlite}
Read
<- flights ::read_json( jsonlite"https://tidyverse.r-universe.dev/nycflights13/data/flights/json", simplifyVector = TRUE )
Write
::stream_out( jsonlitetibble( day = days, title = titles, sections = sections, md = markdown, urls = urls ),gzfile("/tmp/drops.json.gz") )
Python
- Example: Parse Nested JSON into a dataframe (article)
Raw JSON
- “entry” has the data we want
- “…” at the end indicates there are multiple objectss inside the element, “entry”
- Probably other root elements other than “feed” as well
Read a json file from a URL using {{requests}} and convert to list
import requests = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json" url = requests.get(url) r = r.json() data = data["feed"]["entry"] entries
- It looks like the list conversion also ordered the elements alphabetically
- The output list is subsetted by the root element “feed” and the child element “entry”
Get a feel for the final structure you want by hardcoding elements into a df
= defaultdict(list) parsed_data for entry in entries: "author_uri"].append(entry["author"]["uri"]["label"]) parsed_data["author_name"].append(entry["author"]["name"]["label"]) parsed_data["author_label"].append(entry["author"]["label"]) parsed_data["content_label"].append(entry["content"]["label"]) parsed_data["content_attributes_type"].append(entry["content"]["attributes"]["type"]) parsed_data[ ...
Generalize extracting the properties of each object in “entry” with a nested loop
= defaultdict(list) parsed_data for entry in entries: for key, val in entry.items(): for subkey, subval in val.items(): if not isinstance(subval, dict): f"{key}_{subkey}"].append(subval) parsed_data[else: for att_key, att_val in subval.items(): f"{key}_{subkey}_{att_key}"].append(att_val) parsed_data[
defaultdict
creates a key from a list element (e.g. “author”) and groups the properties into a list of values where the value may also be a dict.- See Python, General >> Types >> Dictionaries
- For each item in “entry”, it looks at the first key-value pair knowing that value is always a dictionary (object in JSON)
- Then handles two different cases
- First Case: The value dictionary is flat and does not contain another dictionary, only key-value pairs.
- Combine the outer key with the inner key to a column name and take the value as column value for each pair.
- Second Case: Dictionary contains a key-value pair where the value is again a dictionary.
- Assumes at most two levels of nested dictionaries
- Iterates over the key-value pairs of the inner dictionary and again combines the outer key and the most inner key to a column name and take the inner value as column value.
- First Case: The value dictionary is flat and does not contain another dictionary, only key-value pairs.
Recursive function that handles json elements with deeper structures
def recursive_parser(entry: dict, data_dict: dict, col_name: str = "") -> dict: """Recursive parser for a list of nested JSON objects Args: entry (dict): A dictionary representing a single entry (row) of the final data frame. data_dict (dict): Accumulator holding the current parsed data. col_name (str): Accumulator holding the current column name. Defaults to empty string. """ for key, val in entry.items(): = f"{col_name}_{key}" if col_name else key extended_col_name if isinstance(val, dict): recursive_parser(entry[key], data_dict, extended_col_name)else: data_dict[extended_col_name].append(val) = defaultdict(list) parsed_data for entry in entries: "") recursive_parser(entry, parsed_data, = pd.DataFrame(parsed_data) df
- Notice the check for a deeper structure with
isinstance
. If there is one, then the function is called again. - Function outputs a dict which is coerced into dataframe
- To get rid of “label” in column names:
df.columns = [col if not "label" in col else "_".join(col.split("_")[:-1]) for col in df.columns]
object
types can be cast into more efficient types:df["im:rating"] = df["im:rating"].astype(int)
- Notice the check for a deeper structure with
DuckDB
- Misc
- Notes from
- -json flag says output as json instead of a sql query output
- The output kind of looks like a scrunched up dictionary. So, you can pipe that into a CLI tool like jq (if you have it installed) (e.g.
<query with -json> | jq
) to get pretty printing
- The output kind of looks like a scrunched up dictionary. So, you can pipe that into a CLI tool like jq (if you have it installed) (e.g.
- Read JSON from a URL in a query:
from read_json('https://api.github.com/orgs/golang/repos')
- CLI
- Example
Data: Types of open source licenses used in golang repo
[ { "id": 1914329, "name": "gddo", "license": { "key": "bsd-3-clause", "name": "BSD 3-Clause \"New\" or \"Revised\" License", ... }, ... }, { "id": 11440704, "name": "glog", "license": { "key": "apache-2.0", "name": "Apache License 2.0", ... }, ... }, ... ]
Count most common license types used
duckdb -c \ "select license->>'key' as license, count(*) as count \ from 'repos.json' \ group by 1 \ order by count desc" ┌──────────────┬───────┐ │ license │ count │ │ varchar │ int64 │ ├──────────────┼───────┤ │ bsd-3-clause │ 23 │ │ apache-2.0 │ 5 │ │ │ 2 │ └──────────────┴───────┘
- The bottom license with count = 2 is
null
(i.e. no licence) ->>
is used to drill down into a nested json field. (e.g. license to key)
- The bottom license with count = 2 is
- Example