General

Packages

Simple API Wrappers
- {tidychatmodels} - Communicates with different chatbot vendors like openAI, mistral.ai, etc. using the same interface.
- {gemini.R} - Wrapper around Google Gemini API
- {rollama} - Wrapper around the Ollama API
- {chatAI4R} - Wrapper around OpenAI API
- {TheOpenAIR} - Wrapper around OpenAI models
Code Assistants
- {chattr} - Code assistant for RStudio
- {gander} - A higher-performance and lower-friction chat experience for data scientists in RStudio and Positron–sort of like completions with Copilot, but it knows how to talk to the objects in your R environment.
  - Brings {ellmer} chats into your project sessions, automatically incorporating relevant context and streaming their responses directly into your documents.
- {btw} - Helps you describe your computational environment to LLMs
  - Assembles context on your R environment, package documentation, and working directory, copying the results to your clipboard for easy pasting into chat interfaces.
  - Wraps methods that can be easily incorporated into ellmer tool calls for describing various kinds of objects in R
  - Support for {mcptools}
{ellmer} - Supports a wide variety of LLM providers and implements a rich set of features including streaming outputs, tool/function calling, structured data extraction, and more.
{ollamar} - R version of {{ollama}} python and {{{ollama}}} JS libraries
- Makes it easy to work with data structures (e.g., conversational/chat histories) that are standard for different LLMs (such as those provided by OpenAI and Anthropic).
- Lets you specify different output formats (e.g., dataframes, text/vector, lists) that best suit your need, allowing easy integration with other libraries/tools and parallelization via the {httr2} library.
{mall} (Intro)- Text analysis by using rows of a dataframe along with a pre-determined (depending on the function), one-shot prompt. The prompt + row gets sent to an Ollama LLM for the prediction
- Also available in Python
- Features
  - Sentiment analysis
  - Text summarizing
  - Classify text
  - Extract one, or several, specific pieces information from the text
  - Translate text
  - Verify that something it true about the text (binary)
  - Custom prompt
{batchLLM} - Process prompts through multiple LLMs at the same time.
- Uses data frames and column rows as LLM input and a new column with the text completions as the output.
- Supports OpenAI, Claude & Gemini.
{llmR} - Interface to OpenAI’s GPT models, Azure’s language models, Google’s Gemini models, or custom local servers
- Unified API: Setup and easily switch between different LLM providers and models using a consistent set of functions.
- Prompt Processing: Convert chat messages into a standard format suitable for LLMs.
- Output Processing: Can request JSON output from the LLMs and tries to sanitize the response if the parsing fails.
- Error Handling: Automatically handle errors and retry requests when rate limits are exceeded. If a response is cut due to token limits, the package will ask the LLM to complete the response.
- Custom Providers: Interrogate custom endpoints (local and online) and allow implementation of ad-hoc LLM connection functions.
- Mock Calls: Allows simulation of LLM interactions for testing purposes.
- Logging: Option to log the LLM response details for performance and cost monitoring
{aigenflow} - Enables you to create intelligent agents and orchestrate workflows with just a few lines of code, making advanced AI capabilities accessible to developers, data scientists, and researchers across diverse fields.
{samesies} - A reliability tool for comparing the similarity of texts, factors, or numbers across two or more lists. The motivating use case is to evaluate the reliability of Large Language Model (LLM) responses across models, providers, or prompts
{hellmer} - Enables sequential or parallel batch processing for chat models from ellmer.
{querychat} (also in python) - A drop-in component for Shiny that allows users to query a data frame using natural language. The results are available as a reactive data frame, so they can be easily used from Shiny outputs, reactive expressions, downloads, etc.

{shinychat} - Shiny ui component for LLM apps

Example: Basic (source)

library(shiny)
library(shinychat)

ui <- bslib::page_fluid(
  chat_ui("chat")
)

server <- function(input, output, session) {
  chat <- ellmer::chat_ollama(system_prompt = "You are a helpful assistant", model = "phi4")

  observeEvent(input$chat_user_input, {
    stream <- chat$stream_async(input$chat_user_input)
    chat_append("chat", stream)
  })
}

shinyApp(ui, server)

More robust version

{vitals} - A framework for large language model evaluation in R. It’s specifically aimed at ellmer users who want to measure the effectiveness of their LLM products
- Measure whether changes in your prompts or additions of new tools improve performance in your LLM product
- Compare how different models affect performance, cost, and/or latency of your LLM product
- Surface problematic behaviors in your LLM product

Resources
- LLM Leaderboard Collection: A useful collection of LLM leaderboards for finding the right LLM for your use case
- Awesome Generative AI Data Scientist - Curated list of LLM resources
What chatGPT is:
- “What would a response to this question sound like” machine Researchers build (train) large language models like GPT-3 and GPT-4 by using a process called “unsupervised learning,” which means the data they use to train the model isn’t specially annotated or labeled. During this process, the model is fed a large body of text (millions of books, websites, articles, poems, transcripts, and other sources) and repeatedly tries to predict the next word in every sequence of words. If the model’s prediction is close to the actual next word, the neural network updates its parameters to reinforce the patterns that led to that prediction.
  
  Conversely, if the prediction is incorrect, the model adjusts its parameters to improve its performance and tries again. This process of trial and error, though a technique called “backpropagation,” allows the model to learn from its mistakes and gradually improve its predictions during the training process. As a result, GPT learns statistical associations between words and related concepts in the data set.
  
  In the current wave of GPT models, this core training (now often called “pre-training”) happens only once. After that, people can use the trained neural network in “inference mode,” which lets users feed an input into the trained network and get a result. During inference, the input sequence for the GPT model is always provided by a human, and it’s called a “prompt.” The prompt determines the model’s output, and altering the prompt even slightly can dramatically change what the model produces.Iterative prompting is limited by the size of the model’s “context window” since each prompt is appended onto the previous prompt. ChatGPT is different from vanilla GPT-3 because it has also been trained on transcripts of conversations written by humans. “We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant,”
  
  ChatGPT has also been tuned more heavily than GPT-3 using a technique called “reinforcement learning from human feedback,” or RLHF, where human raters ranked ChatGPT’s responses in order of preference, then fed that information back into the model. This has allowed the ChatGPT to produce coherent responses with fewer confabulations than the base model. The prevalence of accurate content in the data set, recognition of factual information in the results by humans, or reinforcement learning guidance from humans that emphasizes certain factual responses.
  
  Two major types of falsehoods that LLMs like ChatGPT might produce. The first comes from inaccurate source material in its training data set, such as common misconceptions (e.g., “eating turkey makes you drowsy”). The second arises from making inferences about specific situations that are absent from its training material (data set); this falls under the aforementioned “hallucination” label.
  
  Whether the GPT model makes a wild guess or not is based on a property that AI researchers call “temperature,” which is often characterized as a “creativity” setting. If the creativity is set high, the model will guess wildly; if it’s set low, it will spit out data deterministically based on its data set. If creativity is set low, “[It] answers ‘I don’t know’ all the time or only reads what is there in the Search results (also sometimes incorrect). What is missing is the tone of voice: it shouldn’t sound so confident in those situations.”
  
  In some ways, ChatGPT is a mirror: It gives you back what you feed it. If you feed it falsehoods, it will tend to agree with you and “think” along those lines. That’s why it’s important to start fresh with a new prompt when changing subjects or experiencing unwanted responses.
  
  “One of the most actively researched approaches for increasing factuality in LLMs is retrieval augmentation—providing external documents to the model to use as sources and supporting context,” said Goodside. With that technique, he explained, researchers hope to teach models to use external search engines like Google, “citing reliable sources in their answers as a human researcher might, and rely less on the unreliable factual knowledge learned during model training.” Bing Chat and Google Bard do this already by roping in searches from the web, and soon, a browser-enabled version of ChatGPT will as well. Additionally, ChatGPT plugins aim to supplement GPT-4’s training data with information it retrieves from external sources, such as the web and purpose-built databases.
  
  Other things that might help with hallucination include, “a more sophisticated data curation and the linking of the training data with ‘trust’ scores, using a method not unlike PageRank… It would also be possible to fine-tune the model to hedge when it is less confident in the response.” (arstechnica article)
OpenAI models
Benchmarks
- Sites
  - LLM Evals (Shaul)
- Extracting data from a .pdf or .jpg of a table (source)
  - gemini-2.5-pro-preview-03-25 scored 100% accuracy
  - Claude 3.5 and 3.7 sonnet only got 1 or the two requests correct
Use Cases
- For public facing apps, LLMs should only be used to translate user input to make it possible to select a function(s) to be executed, then translate the output of the function(s) into human language the user can understand. They should not be used to execute logic. (Apr 2025 article)
- Understanding code (Can reduce cognative load)(article)
  - During code reviews or onboarding new programmers
  - under-commented code
- Generating the code scaffold for a problem where you aren’t sure where or how to start solving it.
- LLMs don’t require removing stopwords during preprocessing of documents
Generate “Impossibility” List (source)
- “I suggest that people and organizations keep an ‘impossibility list’ - things that their experiments have shown that AI can definitely not do today but which it can almost do. . . . When AI models are updated, test them on your impossibility list to see if they can now do these impossible tasks.” - Ethan Mollick, Gradually, then Suddenly: Upon the Threshold”
Cost
- For lower usage in the 1000’s of requests per day range ChatGPT works out cheaper than using open-sourced LLMs deployed to AWS. For millions of requests per day, open-sourced models deployed in AWS work out cheaper. (article, April 24th, 2023.)
Methods for giving chatGPT data
- Think you can upload a file
- Through prompt
  - See bizsci video
    - paste actual data
    - paste column names and types (glimse() with no values)
  - Generate a string for each row of data that contains the column name and value
    - Example
      - “The <column name> is <cell value>. The <column name> is <cell value>. …”
      - “The fico_score is 578.0. The load_amount is 6000.0. The annual income is 57643.54.”
Evolution of LLMs