AI governance and threat modelling
AI/ML applications bring with them a new class of problems and solutions. These are my notes learning about this
last update: 2024-11-19
How are ML systems measured.
While training this is the loss and accuracy of the prediction.
For LLMs there are benchmarks like,
ARC (A12 Reasoning Challenge)
- Knowledge Reasoning framework, instead of a Q&A like SQuAD (Stanford Question Answering Dataset) and (SNLI) Stanford Natural Language Inference.
- Distributed evidence in sentences, meaning that the answer is evenly distributed throughout the question.
API-bank, LLM tool usage, evaluation .
hellaSwag, to evaluate common sense of LLMs by having “adversarial endings”
GLUE (General Language Understanding) and SuperGLUE, MMLU,