~ali/blog/rlvr-verifiable-rewards
$ cat blog/rlvr-verifiable-rewards.md

RLVR — Training Models to Be Right, Not Just Sound Right

There's a fundamental problem with how we've been training language models to be helpful. RLHF — Reinforcement Learning from Human Feedback — works by having humans rate model outputs. The model learns to produce responses that humans prefer.

But here's the catch: humans prefer responses that sound confident, well-structured, and authoritative. We're biased toward fluency over accuracy. A wrong answer delivered with conviction beats a correct answer delivered with uncertainty, at least in human preference rankings.

This is why your model can write a beautifully formatted, completely wrong analysis of a dataset. It learned to optimize for sounding right, not for being right.

RLVR is the fix.

What RLVR Actually Is

Reinforcement Learning from Verifiable Rewards replaces human judgment with objective verification. Instead of asking "does this response seem good?", it asks "is this response correct?"

RLHF vs RLVR
RLHF:
  Model generates response
    → Human rates it (1-5 scale)
    → Model optimizes for high ratings
    → Problem: humans reward fluency over accuracy

RLVR:
  Model generates response
    → Verifier checks if answer is correct
    → Model optimizes for correctness
    → Result: actual reliability improvement

The "verifiable" part is key. RLVR works on tasks where you can objectively check the output: math problems have definite answers. Code either compiles and passes tests or it doesn't. Structured data extraction either matches the schema or it doesn't. Logical reasoning either follows valid rules or it doesn't.

Why It Works So Well for Reasoning

The impact on reasoning models has been dramatic. DeepSeek's R1 model — which made waves in early 2025 — used RLVR extensively during training. The model learned to write out its intermediate reasoning steps because that process leads to verifiably correct answers, not because a human rater preferred seeing the work.

This produces a qualitatively different kind of chain-of-thought reasoning. Instead of the model performing confidence theater ("Let me think through this step by step..."), it develops genuine problem-solving strategies because those strategies are what actually get rewarded.

RLHF-trained reasoning
"Let me think about this carefully.
 First, I'll consider the key factors...
 Based on my analysis, the answer is 42."
 
 (Sounds great. Answer is wrong.)
RLVR-trained reasoning
"x² + 5x + 6 = 0
 Factor: (x+2)(x+3) = 0
 x = -2 or x = -3
 
 Verify: (-2)² + 5(-2) + 6 = 4 - 10 + 6 = 0 ✓
 Verify: (-3)² + 5(-3) + 6 = 9 - 15 + 6 = 0 ✓"
 
 (Shows real work. Answer is verifiably correct.)

Beyond Math and Code

The frontier in 2026 is expanding RLVR beyond its original domains. Math and code are easy to verify — you can check answers computationally. But what about biology? Chemistry? Legal reasoning?

The approach being explored uses a second LLM as a verifier for domains where automated checking is harder. A domain-expert model evaluates whether the primary model's reasoning follows valid scientific or legal principles. This introduces its own biases, but it's a meaningful step beyond "does this sound good to a crowdworker?"

For ML engineers, this opens a practical question: can you build a verifier for your specific domain? If you can, RLVR-style training becomes accessible.

domain-specific verification
Medical coverage checking:
  Model says "Procedure X is covered under policy Y"
  Verifier: parse policy Y → check coverage rules → confirm/deny
  → Reward signal based on correctness

Data extraction:
  Model extracts {name, date, amount} from invoice
  Verifier: compare against ground truth labels
  → Reward signal based on field-level accuracy

SQL generation:
  Model writes SELECT query from natural language
  Verifier: execute query → compare results to expected output
  → Reward signal based on query correctness

The Production Implications

For anyone building production ML systems, RLVR changes how you think about model selection and evaluation:

Evaluation metrics shift. Instead of measuring perplexity or human preference scores, you measure task-specific correctness rates. For a medical coverage system, the metric is: "what percentage of coverage decisions are correct?" For a code assistant, it's: "what percentage of generated code passes the test suite?"

Fine-tuning becomes more rigorous. If you're fine-tuning a model for a specific domain, RLVR suggests you need a verification pipeline — not just training data. You need a way to automatically check whether the model's outputs are correct, and feed that signal back into training.

Inference-time scaling matters more. Models trained with RLVR benefit disproportionately from thinking longer. Giving the model more tokens to reason through a problem — inference-time scaling — produces compounding accuracy improvements. This is a direct tradeoff: more latency and cost per query, but dramatically better accuracy on hard problems.

What This Means for 2026

Sebastian Raschka made a prediction I agree with: the biggest improvements in LLM performance this year won't come from bigger models or new architectures. They'll come from better training signals (RLVR) and smarter inference (spending more compute at generation time where it matters).

The model that sounds the most confident is not the model that is most correct. RLVR is teaching the field to stop conflating the two. For ML engineers building systems that people rely on — medical, financial, legal, safety-critical — that distinction is everything.

The question "can we verify this?" is becoming the most important architectural decision in model training. If the answer is yes, you can train for actual reliability. If the answer is no, you're still in the world of "sounds about right."

I know which world I want my systems to live in.

$ cd ../blog