shahAIAssistedGenerationDifficult2025

# AI-Assisted Generation of Difficult Math Questions [@shahAIAssistedGenerationDifficult2025] ## Abstract Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills. ## Notes - Challenges: difficulty and diversity. - Most generations are intended for fine-tuning the model, only a few for actual evaluation. - Ask the model to make a *defeatist approach* attempt to solve the problem first can improve quality of generation. - Using generated question/answer pairs as in-context few-shot exemplars, and assess the performance improvement of models on existing dataset, can be a good way of assessing generation quality. ## Annotations > However, recent research (Arora & Goyal, 2023; Didolkar et al., 2024) has demonstrated that top LLMs possess a robust understanding of mathematical skills, including the capability to identify the skills required to solve given questions (Reid et al., 2024; Achiam et al., 2023). > While leading models could produce creative math questions when provided with a list of skills, the majority of these questions exhibited one or more of the following shortcomings: too similar to existing questions in datasets; have errors or nonsensical elements; are too tedious or mechanical to be engaging for human annotators. > Starting with a list of mathematical skills extracted from the MATH dataset using recently discovered methods (Didolkar et al., 2024), we focused on creating questions that involve one skill from pre-algebra and algebra portions of the MATH dataset and one other skill randomly sampled from different sections of MATH. > In our AI-assisted process, human experts played a crucial role. Using the (question, answer) pairs generated by LLMs and leveraging API access to leading models, experts identified promising questions—-often those incorrectly answered by the LLMs but containing many correct ideas. These experts were graduate students pursuing computer science programs at leading universities. > A possible test for the quality of a Q&A pair on similar topics as MATH dataset is whether performance on MATH improves when using these as in-context exemplars. > Notably, the inclusion of the attempted solution and question validation steps significantly enhanced the pipeline’s effectiveness.