The Challenge of Human-Like Abstraction in Contemporary AI
Abstract
Contemporary AI models have matched or exceeded human performance on many benchmarks meant to assess general human-like reasoning abilities, including the prominent Abstraction and Reasoning Corpus (ARC). However, it is often unclear whether these models achieve high accuracy by reasoning with the abstractions these benchmarks were designed to evaluate, or through other non-human-like strategies that focus on surface-level patterns. Here, we articulate cognitive-science-inspired evaluation principles to investigate the abstraction abilities of AI models and human participants. As a case study, we use ConceptARC, a benchmark in the ARC domain that assesses abstract reasoning using isolated “core-knowledge” concepts. In addition to measuring accuracy, we evaluate the natural-language rules that models and humans generate to explain their solutions, allowing us to distinguish between solutions using intended abstractions and those relying on surface-level patterns. While some models exceed human accuracy on textual versions of the tasks, their rules are substantially less likely than human-generated rules to capture intended abstractions. When given visual inputs, the accuracy of these models decreases dramatically; in numerous cases they are able to abstract a correct rule but fail to apply it to form a correct output. These findings illustrate that evaluations based on accuracy alone are not reliable indicators of a model's general capabilities, and that humans still exhibit a greater propensity for abstract reasoning than AI models. The evaluation principles we articulate can provide a more rigorous assessment of AI models' capabilities than measures based solely on accuracy.
Dataset viewer
Use this viewer to investigate model and human responses on the ConceptARC tasks.
Full data can be downloaded from ClaasBeger/HumanLikeARCAbstraction on Hugging Face.