Learning Reasoning Reward Models from Expert Demonstration via Inverse Reinforcement Learning

Authors: Claudio Fanconi, Nicolas Astorga, Mihaela van der Schaar

Affiliations: University of Cambridge

Venue: ICLR 2026 Workshop on LLM Reasoning

Representative figure for Learning Reasoning Reward Models from Expert Demonstration via Inverse Reinforcement Learning

Abstract

Reasoning in large language models is typically trained via supervised fine-tuning on expert traces or reinforcement learning with outcome-based rewards. Supervised imitation does not directly optimise sequential decision quality, while outcome-based RL requires explicit reward design. This paper proposes an inverse reinforcement learning framework that learns partially dense token-level reasoning rewards directly from expert demonstrations. The learned reward is used both as a dense training signal and as an inference-time reranking signal. Compared with supervised baselines, the approach reports gains on GSM8K (79% vs 56%) and MedReason (74% vs 65%), and up to 12 percentage-point improvements through reward-guided reranking on Llama3 architectures. The learned dense rewards also provide interpretable step-wise diagnostics for localising reasoning errors.