NLP

Misc

Coherence Measures

  • There goal is to evaluate topics quality with respect to interpretability

  • Misc

    • Paper uses Twitter data to show lack of performance of these measures (paper)
  • Normalized Pointwise Mutual Information (NPMI)

    \[ \mbox{NPMI} = \frac{\log(p(x)p(y))}{\log (p(x,y))} - 1 \]

    • Normalized Pointwise Mutual Information in Collocation Extraction
    • Estimates how likely is the co-occurrence of two words x and y than we would expect by chance
    • Range: -1 to 1
    • NPMI = 0 means independence between the occurrences of x and y
      • This makes it sound like a correlation measure
    • If this is a correlation type of measure, then I’m guessing you want something close to 1 for each topic. As this would imply, all the words are likely to appear together.

Explore Predictions

  • Scenario: Feature importance from a text model indicates a word or phrase from a text predictor variable is highly predictive of the outcome variable
    • Explore which values of the outcome variable are associated with high tfidf values of the value of the text variable
      • Example

        • The token, course, for the text variable, improvements, scores high in feature importance when predicting the course satisfaction rating
        • All text variables were tokenized, ngram engineered and the values assigned tfidf scores
        bake(prep(text_recipe), testing(splits)) %>%
            select(tfidf_improvements_course) %>% # tfidf naming format is tfidf_textColumn_token
            bind_cols(
                testing(splits) %>% select(satisfaction_rating)
            ) %>%
            group_by(satisfaction_rating) %>%
            summarize(mean_tfidf_course = mean(tfidf_improvements_course)) %>%
            ungroup()
        • Interpretation
          • Customers that give a satisfaction rating (outcome var) of 6 use the word, “course,” a lot (i.e. higher mean tfidf score)

Behavioral Tests

  • Misc
  • Robustness Criteria
    • Sex/Ethnicity Bias - does your model discriminate against males/females or a specific nationality?
    • Equivalent Words/Synonyms - If a candidate replaces “Good python knowledge” with “good python 3 knowledge” how does your model react?
    • Skill Grading - do your model assign a higher score for “very good knowledge” vs. “good knowledge” vs. “basic knowledge”. Are adjectives adequately understood? Candidates with “exceptional skill” should not be rated below one with “basic skill”.
    • Sentence Ordering - If we reverse the order of job experience, is the model prediction consistent?
    • Typos - I’ve seen a lot of models where a typo in a completely unimportant word changed the model prediction completely. We may argue that job applications should not contain typos, but we can all agree that, in general, this is an issue in NLP
    • Negations - I know that is difficult. But if your task requires understanding them, do you measure it? (for example, “I have no criminal records” vs. “I have criminal records” or “I finished vs. I did not finish”. How about double negations?