General

Misc

Also see
Packages
- {textpress} - A lightweight, versatile NLP package for R, focused on search-centric workflows with minimal dependencies and easy data-frame integration. Uses Huggingface API.
  - Web Search: Perform search engine queries to retrieve relevant URLs.
  - Web Scraping: Extract URL content, including some relevant metadata.
  - Text Processing & Chunking: Segment text into meaningful units, eg, sentences, paragraphs, and larger chunks. Designed to support tasks related to retrieval-augmented generation (RAG).
  - Corpus Search: Perform keyword, phrase, and pattern-based searches across processed corpora, supporting both traditional in-context search techniques (e.g., KWIC, regex matching) and advanced semantic searches using embeddings.
  - Embedding Generation: Generate embeddings using the HuggingFace API for enhanced semantic search.
- {mall} - Text analysis by using rows of a dataframe along with a pre-determined (depending on the function), one-shot prompt. The prompt + row gets sent to an Ollama LLM for the prediction
  - Also available in Python
  - Features
    - Sentiment analysis
    - Text summarizing
    - Classify text
    - Extract one, or several, specific pieces information from the text
    - Translate text
    - Verify that something it true about the text (binary)
    - Custom prompt
- {tall} - Shiny app that imports a text and performs a wide range of analyses.
  - Features a comprehensive workflow, including data cleaning, preprocessing, statistical analysis, and visualization, all integrated for effective text analysis
  - Automatic Lemmatization and Part-of-Speech (PoS)-Tagging through LLM
  - Descriptive statistics, concordance analysis and word frequency distributions
  - Topic Detection through hierarchical clustering, correspondence analysis, network analysis
  - Topic Modeling, Polarity Detection, Summarization
- {rbm25} - BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a user’s search query.
Use cases for discovering hyper/hyponym relationships
- Taxonomy prediction: identifying broader categories for the terms, building taxonomy relations (like WikiData GraphAPI)
- Information extraction (IE): automated retrieval of the specific information from text is highly reliable on relation to searched entities.
- Dataset creation: advanced models need examples to be learned to identify the relationships between entities.
Baseline Models
- Regularized Logistic Regression + Bag-of-Words (BoW) (recommended by Raschka)
  - Example: py, kaggle notebook (sentiment analysis)
    - Uses Compressed Sparse Row (CSR) type of sparse matrix
    - Uses a time series cv folds and scores by precision
      - Precision because “Amazon would be more concerned about the products with negative reviews rather than positive reviews”
    - Says random search > grid search with regularized regression
    - Also fits a model with Tf-idf instead of BoW
  - Preprocess
    - Tokenize
    - Remove stopwords and punctuation
    - Stem
    - unigrams and bigrams
    - Sparse token count matrix
    - Normalize the matrix
  - Model with regularized logistic regression
GPT-4
- Accepts prompts of 25,000 words (GPT-3 accepted 1500-2000 words)
- Allegedly around 1T parameters (GPT-3 had 175B parameters)
- Some use cases: translation, q/a, text summaries, writing/getting news, creative writing
- Multi-modal training data (i.e. text and audio, pictures, etc.)
- Still hallucinates

Terms

Flood Words - Words that are too common in the domain (i.e. noise)
Hypernym - A word with a broad meaning constituting a category into which words with more specific meanings fall
- Example: A device can use multiple storage units such as a hard drive or CD
  - hyponym of storage units: hard drive, cd
  - hypernym of hard drive/cd: storage units
Hyponym - Opposite of Hypernym; a word of more specific meaning than a general term applicable to it.
Named Entity Recognition (NER) - A subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
- Example: An automated NER system will identify the incoming customer request (e.g. installation, maintenance, complaint, and troubleshoot of a particular product) and send it to the respective support desk
- aka Named Entity Identification, Entity Chunking, and Entity Extraction
- Other use cases: filtering resumés, diagnose patients based on symptoms in healthcare data
Sequence to Sequence (aka String Transduction) - problems where the input and output is text
- e.g. Text summarization, Text simplification, Question answering, Chatbots, Machine translation
Spam Words - Words that don’t belong in the domain (i.e. noise)

Hearst Patterns

A set of test patterns that can be employed to extract Hypernyms and Hyponyms from text.
In the table, X is the hypernym and Y is the hyponym
“rhyper” stands for reverse-hyper
Usually, you don’t want to extract all possible hyponyms relations, but only entities in the specific domain
Packages {{SpaCy}} - example

Use Cases

Creating labels for text that can later be used for supervised learning tasks like classification
Create metadata for columns in datasets (i.e. data dictionary)
- Predicting Metadata for Humanitarian Datasets Using GPT-3
  - Prompt: Column name; sample of data from that column.
  - Completion: metadata which is a tag that includes column attributes.
  - Potential Improvements
    - Trying other models (‘ada’ used here) to see if this improves performance (though it will cost more)
    - Model hyperparameter tuning. The log probability cutoff will likely be very important
    - More prompt engineering to perhaps include column list on the table might provide better context, we well as overlying columns on two-row header tables.
    - More preprocessing. Not much was done for this article, blindly taking tables extracted from CSV files, so the data is can be a bit messy
- Using GPT-3.5-Turbo and GPT-4 for Predicting Humanitarian Data Categories
  - GPT-4 resulted in 96% accuracy when predicting category and 89% accuracy when predicting both category and sub-category.
    - GPT-3.5-turbo for the same prompts, with 96% accuracy versus 66% for category.
  - Limitations exist due to the maximum number of tokens allowed in prompts affecting the amount of data that can be included in data excerpts, as well as performance and cost challenges — especially if you’re a small non-profit! — at this early stage of commercial generative AI.
  - Likely related to being an early preview, GPT-4 model performance was very slow, taking 20 seconds per prompt to complete
Generalizations from OSINT (article)
- Note: most of these use a combination of the others to improve their overall performance — a document clustering model might use topic extraction and NER to improve the quality of their clusters.
- Recommendation Engines: Bespoke recommendation engines can be fine-tuned to recommend related documentation that may not be within the user’s immediate sphere of interest, but still relevant.
  - Enables the discovery of “unknown unknowns”.
- Topic extraction and Document clustering: generate topics from multiple texts and detect similarities between documents publishing by dozens, sometimes hundred information feeds.
  - You don’t have the time to read every single document to get a higher view of the main problematics evolving within your multiple information feeds
- Named (and unnamed) Entity Extraction and Disambiguation (NER / NED) (see Terms): identifying and categorizing named entities
  - The extraction part involves locating and tagging entities, while the disambiguation part involves determining the correct identity or meaning of an entity, especially when it can have multiple interpretations or references in a text.
  - Allow you to build entire NLP logics to keep tracks of meaningful facts about this entity, order it by timeliness and relevance. This will allow you to start building bespoke, expert curated profiles.
- Relationship Extraction: identify the nature and type of relationships between different entities, such as individuals, organizations, and locations, and to represent them in a structured format that can be easily analyzed and interpreted.
  - Generating accurate connections between across thousands of documents will build expert driven, queryable knowledge graphs in a matter of days
- Multi-document abstractive summarization: automatically generated a concise and coherent summary of multiple documents on a given topic, by creating new sentences that capture the most important information from the original texts.
  - Enable users to obtain a concise and coherent summary of the most important information from a large amount of text data.
Search Engine for Docs
- How I Turned My Company’s Docs into a Searchable Database with OpenAI
- Steps
  - Converted all of the docs to a unified format
    - Converts all html files to markdown
  - Split docs into blocks and added some automated cleanup
    - Links to code for parsing markdown files to get text and code blocks
  - Computed embeddings for each block
    - Used OpenAI’s text-embedding-ada-002 model which is cheap and provides good performance
  - Generated a vector index from these embedding
    - Used Qdrant for the vector embeddings storage
  - Defined the index query
    - Added ablilities for the user to specify looking only in text or code and included other meta data in the output
  - Wrapped it all in a user-friendly command line interface and Python API
- Potential extensions of the project
  - Hybrid search: combine vector search with traditional keyword search
  - Go global: Use Qdrant Cloud to store and query the collection in the cloud
  - Incorporate web data: use requests to download HTML directly from the web
  - Automate updates: use Github Actions to trigger recomputation of embeddings whenever the underlying docs change
  - Embed: wrap this in a Javascript element and drop it in as a replacement for a traditional search bar