NLP

Misc

There goal is to evaluate topics quality with respect to interpretability
Misc
- Paper uses Twitter data to show lack of performance of these measures (paper)
Normalized Pointwise Mutual Information (NPMI)

\[ \mbox{NPMI} = \frac{\log(p(x)p(y))}{\log (p(x,y))} - 1 \]
- Normalized Pointwise Mutual Information in Collocation Extraction
- Estimates how likely is the co-occurrence of two words x and y than we would expect by chance
- Range: -1 to 1
- NPMI = 0 means independence between the occurrences of x and y
  - This makes it sound like a correlation measure
- If this is a correlation type of measure, then I’m guessing you want something close to 1 for each topic. As this would imply, all the words are likely to appear together.

Misc
- Notes from: Metrics are not enough — you need behavioral tests for NLP
- Packages: {{checklist}}
Robustness Criteria
- Sex/Ethnicity Bias - does your model discriminate against males/females or a specific nationality?
- Equivalent Words/Synonyms - If a candidate replaces “Good python knowledge” with “good python 3 knowledge” how does your model react?
- Skill Grading - do your model assign a higher score for “very good knowledge” vs. “good knowledge” vs. “basic knowledge”. Are adjectives adequately understood? Candidates with “exceptional skill” should not be rated below one with “basic skill”.
- Sentence Ordering - If we reverse the order of job experience, is the model prediction consistent?
- Typos - I’ve seen a lot of models where a typo in a completely unimportant word changed the model prediction completely. We may argue that job applications should not contain typos, but we can all agree that, in general, this is an issue in NLP
- Negations - I know that is difficult. But if your task requires understanding them, do you measure it? (for example, “I have no criminal records” vs. “I have criminal records” or “I finished vs. I did not finish”. How about double negations?