"Code RLHF" is a category, not a task. Buyers asking vendors for "code RLHF data" are usually asking for one of seven different things, each producing different training signal, each costing a different amount per item, and each suited to a different point in a coding-model training pipeline.
This essay walks through all seven, with a practical view of what each one teaches a model, what it costs to produce well, and when to spend on it. It's meant for the person at an AI lab or AI-data platform who needs to decide, this quarter, where their training budget goes. If you're scoping a code RLHF and annotation service, this is the framework we'd walk you through on a scoping call anyway.
Why "code RLHF" is a misleading bucket
The label "code RLHF" inherited its shape from the original InstructGPT RLHF pipeline: a base model is fine-tuned on human-rated outputs to produce a reward signal, then the model is RL-tuned against that signal. In practice, most modern code training pipelines use a mix of supervised fine-tuning, direct preference optimization, and rejection sampling - many of which are loosely called "RLHF" even when the RL part is gone.
The result is that "we need code RLHF data" can mean any of: training data for supervised fine-tuning on code, preference pairs for DPO, code review labels for reward model training, full agent trajectories for multi-step training, or evaluation gold sets used in post-training analysis. The data shape is wildly different across these. The cost-per-item is different by an order of magnitude. The quality bar is different.
Treating "code RLHF" as one line item is the most expensive mistake a buyer can make. Here is the breakdown.
Task type 1 - Code generation with reference solutions
The most common code RLHF task: a prompt is provided, and a senior engineer writes a high-quality reference solution. Used for supervised fine-tuning, rejection sampling, and sometimes as the "chosen" side of a preference pair.
The signal this teaches is straightforward: given this prompt, produce code that looks like this. The model learns style, structure, common idioms, and the kind of code a senior engineer would write rather than the kind that compiles and works on the happy path.
The trap: prompts that are too short produce solutions that look like HumanEval-style single-function code, which doesn't reflect how engineers actually work. The high-value version of this task uses multi-file prompts with real-world context - existing code in the repo, conventions to match, edge cases to handle. The cost to produce one of these well is substantially higher than the typical leetcode-style reference task, but the training signal is also substantially better.
When to spend on it: early in training, when you need broad coverage of the model's coding distribution. Less valuable after that - at later stages, preference comparisons and trajectories produce more marginal signal per dollar.
Task type 2 - Pairwise preference comparisons (for DPO)
Two implementations of the same prompt are produced (either both by humans, both by the model, or one of each) and a senior engineer picks the better one with annotated rationale. Used for DPO and reward model training.
The signal this teaches is preference: "given these two responses, prefer this one because." It's particularly powerful because it captures the gradient between "correct" and "good" - the difference between code that works and code that you'd actually merge into your repo. Most of what makes a senior engineer's code valuable lives in that gradient, and it's nearly impossible to capture in single-output reference tasks.
The trap: shallow preference pairs where one option is obviously broken. The model learns nothing useful from those. The high-value version uses pairs that are both plausibly correct, where the rationale captures subtle but real differences - error handling, performance characteristics, idiomatic clarity, future-maintainability.
Pairwise preference is also where vendor quality matters most. A junior annotator picks based on what looks correct on the surface. A senior engineer picks based on what they'd want to inherit. The two produce different models.
When to spend on it: throughout training, particularly in the alignment phase. DPO has largely replaced reward-model RLHF at frontier labs, so preference pairs are the highest-leverage signal you can produce right now for most coding models.
Task type 3 - Code review annotations against a rubric
A piece of code (either AI-generated or human-written) is presented, and the annotator marks defects, style issues, severity ratings, and improvement suggestions against a provided rubric. Used for reward modeling, safety training, and code-review-capability training.
The signal is structured: this code has these problems, here is their severity, here is what would fix them. It's particularly valuable for training models that need to do code review themselves - agent-assistant models, autonomous engineering systems, and CI-integrated code analysis tools.
The trap: rubric ambiguity. If two senior engineers can plausibly disagree on whether something is a "style preference" or a "defect," the rubric needs to specify which. Most published rubrics fail this test. Building a usable rubric takes a calibration session with the client, then iterative refinement across the first 100 to 200 annotations. Vendors who skip this step ship data that scores high on inter-annotator agreement and low on actual model utility.
When to spend on it: when training a model that needs to be a reviewer, not just a writer. Less valuable for pure generation training.
Task type 4 - Bug-fix demonstrations with root-cause
The annotator is given a piece of buggy code (often a real or realistic regression) and produces: a root-cause analysis, a corrected version, a regression test, and a verbalization of how they identified the bug. Used for supervised fine-tuning and as the demonstration side of preference data.
This is the task type closest to real engineering work. The signal it teaches is debug methodology: see symptom, hypothesize cause, verify hypothesis, fix root cause, prevent regression. Models trained on this data behave noticeably differently in debugging contexts - they hypothesize more carefully, they fix root causes rather than papering over symptoms, and they reach for regression tests reflexively.
The trap: synthetic bugs. A lot of bug-fix data in the open is based on injected bugs (mutate the code until it breaks). Models trained on this perform well on injected-bug benchmarks and poorly on real bugs, because real bugs look different. The high-value version uses bugs harvested from real codebases or carefully crafted to mimic real failure modes. SWE-bench is built on this principle and has rapidly become the benchmark frontier labs calibrate against.
When to spend on it: throughout training, especially for any model that will be used in IDE integration or agent contexts. This is the highest-priority task type for code-fix and agent training right now.
Task type 5 - Refactoring trajectories
The annotator is given a piece of working but suboptimal code and produces a step-by-step refactor: each intermediate state, the rationale for each step, the tests they run between steps, and the final result. Used for multi-step reasoning training and agentic refactor tooling.
The signal here is not the final state but the path. Refactoring is the canonical "looks easy in retrospect, hard to do safely" engineering task: an engineer can describe what the refactor should look like, but the steps to get there safely (without breaking tests, without breaking external consumers, without breaking the build along the way) are where the actual judgment lives.
The trap: collapsing the trajectory to before-and-after. Most published refactoring datasets do this, and they teach the model nothing about how to refactor. The high-value version records every commit, every test run, every intermediate state, with rationale per step.
When to spend on it: when training agentic models that will do multi-step engineering work. Significantly more expensive per item than single-step tasks, but produces signal you can't get any other way.
Task type 6 - Reasoning verbalizations
First-person reasoning logs captured during any of the above tasks: "I'm looking at this code first because...", "I noticed the test is failing on this edge case, which suggests...", "I'm going to try X first because if it works it's faster, and if it doesn't I'll fall back to Y." Used for chain-of-thought training and reasoning-capability training.
This is the task type most often treated as an afterthought and most often the difference between a useful dataset and a useless one. The model learns reasoning patterns from verbalizations, not from the underlying actions. An annotator who produces correct code with no verbalization is producing supervised fine-tuning data; an annotator who produces correct code with full verbalization is producing data that meaningfully improves reasoning capability.
The trap: scripted-sounding verbalizations. If the annotator writes the verbalization after completing the task, it tends to be a rationalization rather than a reasoning trace - they explain what they did, not what they were thinking. The high-value version captures verbalization live, ideally with audio that's transcribed afterward. The friction is higher; the signal is dramatically better.
When to spend on it: anywhere you want reasoning capability in the model. Pair with task types 4 and 5 for highest leverage.
Task type 7 - Multi-file agent tasks
The annotator is given a task spec that requires modifying multiple files, running tests, possibly using a browser to look up documentation, and arriving at a working state. Every action is captured - terminal commands, file diffs, browser activity, test outputs, IDE state - alongside reasoning verbalization. Used for agent training, particularly for autonomous coding agents.
This is the most expensive task type and the highest-margin one. Trajectories are dense - a single trace can contain hundreds of discrete decisions, each one a training signal. Models trained on these data behave qualitatively differently in agentic contexts: they handle multi-step tasks with more patience, they recover from errors more gracefully, and they use tools more deliberately.
This is what we mean when we talk about agent trajectory data as a distinct service line. The capture overhead is real, the QA process is heavier, and the per-item rate reflects that. For labs and product teams training coding agents, it's also the data type most likely to differentiate a model in 2026.
When to spend on it: when training any agent or multi-step model. The frontier of coding-model capability is moving toward agent benchmarks, and trajectory data is the input that moves the needle.
A practical decision tree
The seven task types are not equally valuable for every training objective. If you're scoping a budget, here is how we'd think about it.
Training a code-completion model from scratch? Lead with Task 1 (reference solutions) for breadth, add Task 2 (preference pairs) once the model has basic coverage. Tasks 4 and 6 in smaller volume for capability lift.
Improving an existing code model that already has broad coverage? Tasks 2 (preference pairs) and 4 (bug-fix demonstrations) produce the highest marginal signal per dollar. Skip Task 1.
Training a code reviewer model? Task 3 (code review annotations) is the bread and butter. Pair with Task 2 for preference signal.
Training a coding agent? Tasks 5 (refactoring trajectories), 6 (reasoning verbalizations), and 7 (multi-file agent tasks) are the priority. Tasks 1 and 2 are still useful but lower leverage.
Building eval sets, not training data? All seven types can be repurposed as evaluations with appropriate gold sets, but the QA bar shifts - eval data needs to be definitively correct, where training data can tolerate some noise.
How we structure batches across task types
A typical engagement spans three to five task types in parallel, weighted toward whatever the client's training objective requires. Our delivery process treats each type as a separate stream with its own rubric, its own calibrator, and its own gold-set sampling - because the failure modes are different and you can't share QA infrastructure across them.
What we keep consistent across types: the engineer pool (every task is produced by a senior engineer who's been through how we vet engineers), the verbalization quality bar (every task includes a reasoning trace where applicable), and the sign-off process (every batch is signed off by a senior calibrator before delivery).
What we vary: the rubric, the gold-set composition, the inter-reviewer protocol, and the per-task time budget. Task 7 batches take 10x to 20x the time-per-item of Task 1 batches, and we structure pricing accordingly. See our pricing and pod structures page for the indicative tiers.
The mistake we see most often
The most common mistake we see buyers make is treating all task types as roughly equivalent and pricing them on a flat per-task or per-token rate. This collapses the cost-versus-signal trade-off and pushes vendors toward the cheapest task types regardless of training value. The result is large datasets that improve benchmark numbers on cheap evals and barely move the needle on the capabilities the buyer actually cares about.
The fix is to scope each task type separately, price each separately, and make the QA bar match the task. A pilot pod can run all seven types in parallel for six to eight weeks; the resulting data shows which types produce the biggest training lift for the specific objective. That's how we'd recommend scoping a first engagement, and it's how most of our long-term client relationships have started.
If you're scoping a code training data engagement and want a second opinion on the task-type mix, we'd be happy to walk through your task spec and tell you where we'd spend your budget.