Journal Club: JudgeLM: Fine-tuned Large Language Models are Scalable Judges
To automate the evaluation of natural language generation and address the costs of prompting large closed LLMs, Zhu et al. developed an LLM specifically for evaluation purposes by fine-tuning open-source LLMs. The key components of fine-tuning LLMs are data construction and evaluation criteria. They assessed JudgeLM’s performance against GPT-4 and human annotations. Here are Google Slides with a journal club presentation: Technical Paper Deep Dive: JudgeLM
Zhu et al 2023: https://arxiv.org/abs/2310.17631