Evaluating LLMs for my personal use case August 23, 2025 software ai evals My life is not a math Olympiad