How We Vet a Senior Engineer for AI Training Work in 90 Minutes

Discover how Beyond Labs vets senior engineers for RLHF and AI training work through a rigorous 5-stage process, resulting in a 2–4% acceptance rate.

Sachin Rathor

19 Jun 2026

7 min read

Discover our 5-stage process for vetting senior engineers for RLHF and AI training data, featuring a 90-minute evaluation, rigorous background checks, and a selective 2–4% acceptance rate.

The quality of your training data is the quality of the people producing it. Most AI data vendors hire on volume and weed people out later - through batch rejection rates, quality scoring, or eventual off-boarding. By the time those signals fire, model errors are already baked into the dataset.

We do the opposite. We run a deep vetting process up front and keep a small bench of engineers we trust to ship. About two to four percent of applicants make it through. This post documents the full 5-stage vetting process, end-to-end, so you can decide for yourself whether it's the kind of pipeline you want producing your RLHF data.

Why "we hire senior engineers" is meaningless without process

Every AI data vendor on a sales call says some version of "our annotators are senior engineers." The phrase has been emptied out by overuse. A vendor can say it whether their median contractor has eight years of Stripe experience or two years writing Excel macros, and the buyer has no way to tell the difference until the first batch ships.

What separates real vetting from marketing claims is the process - specifically, the parts of the process that are easy to skip when growth pressure hits. Background checks slow hiring. Live calibration calls don't scale. Probationary batches eat into margin. Vendors who skip these steps still call their people "senior engineers." Vendors who don't skip them produce different data.

Here is what we don't skip.

Stage 1 - Application screen (5 minutes)

Every application goes through a structured first-pass review by a member of the AI Data Practice. We look at four things: public profile (GitHub, personal site, or portfolio), prior production work and which companies it shipped at, language and stack proficiency relative to what we're currently hiring for, and basic availability and geographic fit for the engagement tier.

About 80 percent of applicants do not reach Stage 2. The most common reasons are stack mismatch (we're hiring for Rust, the applicant is a senior PHP engineer), insufficient production tenure (less than six years of shipping real software), or a public footprint that doesn't substantiate the resume claims.

Notably, we do not gate on degrees, prior employer prestige, or specific named credentials. We gate on demonstrable production work and the ability to articulate what they built and why.

Stage 2 - Take-home task (60 minutes)

Applicants who clear Stage 1 receive a representative annotation or trajectory task scoped to the role we're hiring for. The task takes about an hour. It's graded against an internal rubric we calibrate quarterly against client gold sets.

For code RLHF roles, the task typically asks the applicant to do something like: review four pieces of code, two AI-generated and two human-written, against a provided rubric; identify defects and rate severity; write a reasoning verbalization explaining their thought process at each step. For agent trajectory roles, the task asks them to complete a short multi-step coding workflow in a sandboxed environment with their reasoning recorded.

What we're grading: factual correctness, calibration to the rubric, the quality of their verbalization, and the speed/quality trade-off they make. About 60 percent of Stage 1 passes do not reach Stage 3. The most common failure mode is rubric misalignment - the applicant has the technical skill but doesn't internalize the grading framework, which is the actual job.

This is also where we filter out one specific failure mode that's worth naming: the applicant who tries to use a coding model to do the task for them. Our task design makes this detectable; we flag it and remove the applicant.

Stage 3 - Live calibration call (45 minutes)

A senior calibrator runs a live video session with the candidate. The structure is fixed: three sample tasks worked through together, with the candidate verbalizing their reasoning at each step. The calibrator probes specific decisions ("why did you mark that as a defect rather than a style preference?") and watches how the candidate handles disagreement.

We are testing for two things, and only two things: judgment and verbalizability. Raw technical skill was already established in Stage 2; the question now is whether the candidate's decision-making maps onto the kind of reasoning we want surfaced in training data. The job of an RLHF annotator isn't to produce output - it's to produce legible output, where the rationale is as valuable as the label. This is part of why the field treats reward modeling and preference labeling as a distinct discipline from simple classification - work like the DPO paper frames the whole training loop around the quality of those preference judgments, not just their direction.

About 50 percent of Stage 2 passes do not reach Stage 4. The most common failure: a technically competent engineer who can do the work but can't articulate why they made each decision. That gap is fatal for code RLHF and annotation services, where the verbalization is half the deliverable.

This stage is where most vendors stop, because it doesn't scale. We can't run more than about fifteen calibration calls per calibrator per week without quality degrading. That headcount constraint is one reason our bench is small and stays small.

Stage 4 - Reference and background checks

The candidates who clear Stage 3 enter a more traditional verification stage. We request two professional references from prior engineering roles and run a background check appropriate to the security tier of the engagement.

For engagements with elevated security requirements (defense-adjacent work, regulated industries, frontier-lab sensitive data), we run enhanced screening: criminal background, employment verification, identity confirmation, and an extended reference check protocol. This work is contracted out to a specialized provider; we don't shortcut it.

Stage 4 takes roughly five to seven business days to clear. We do not start candidates on production work until it's complete.

Stage 5 - Probationary first batch

The first production batch from any new engineer is double-reviewed by a senior calibrator. Every output is checked, not just sampled. We're watching for three things: consistency with the calibration call (does their judgment match what we saw?), drift from the rubric over time within the batch, and how they handle ambiguous cases that weren't covered in calibration.

For coding-focused roles, we also check whether the engineer's sense of "what counts as a real defect" lines up with how the field actually evaluates code models - the same instinct that makes a benchmark like SWE-bench useful for model evaluation is what we're listening for in a human reviewer: grounding in real GitHub issues and pull requests rather than abstract style preferences.

About 15 percent of Stage 4 passes do not graduate to standard production. The most common reason is rubric drift under fatigue: the engineer's first ten tasks are excellent, the next forty show slow degradation in reasoning quality. That pattern is invisible at sample-based QA but obvious under full review, and it predicts long-term batch quality with high accuracy.

The engineers who graduate from Stage 5 move to standard production with weekly batch sampling rather than full review. We've also written about our agent trajectory data work for engagements where Stage 5 review extends beyond single-turn code judgments into multi-step agent workflows.

End-to-end accept rate: 2–4%

Add it up: 20 percent pass Stage 1, 40 percent of those pass Stage 2, 50 percent of those pass Stage 3, near 100 percent pass Stage 4, and 85 percent of those graduate Stage 5. Multiplied through, our application-to-production accept rate sits between 2 and 4 percent depending on the role.

We publish that number because it's the metric most predictive of downstream data quality, and most vendors won't share it. A vendor reporting a 30 percent accept rate is hiring on a different curve. That is not necessarily wrong - it depends on the work - but it produces a different dataset, and buyers should know which curve their vendor is on.

Calibration after onboarding (the part that breaks at most vendors)

Vetting at hire is half the system. The other half is what happens after. Engineers who clear Stage 5 enter ongoing calibration with four mechanisms running in parallel.

We sample roughly five to ten percent of every batch against client-defined gold sets. The sample is blind - the engineer doesn't know which items are gold-set checks. Scores are tracked per engineer and per batch.

Quarterly, we run recalibration sessions with the client lead. The client picks two or three recent batches and walks through their disagreements with our output. Rubrics get updated, gold sets evolve, and edge cases get codified. This is where most vendor relationships go wrong: rubrics drift because nobody refreshes them, and quality slowly degrades while everyone reports green metrics.

We monitor drift on inter-reviewer agreement metrics within batches, with automatic alerts when any reviewer's agreement with the cohort drops below a threshold. We lean on this with some caution: the classic survey on inter-coder agreement methods is clear that agreement coefficients can mask exactly the kind of subjective, context-dependent variation that subjective code review tends to produce, so we treat IAA as one signal among several rather than the sole gate. When alerts fire, we pause the engineer's work and re-run a short calibration before they continue.

Quarterly, every engineer has a performance review. Some are retained, some are coached, and a small number are off-boarded. The off-boarding rate hovers around 10 percent annually, which is high by traditional staffing standards and low by what's needed for sustained data quality.

What we look for, in plain language

Pulling back from the process, here is what the system is actually selecting for:

Six or more years of production engineering experience in their primary stack, with shipped work we can verify.
The ability to verbalize reasoning, not just produce output. The job is half label, half rationale.
Willingness to operate under NDA and full IP assignment. Some senior engineers won't, and that's a hard filter at Stage 1.
A track record of code or design that has shipped to real users. Internal-only work counts, but pure academic background usually doesn't.

What we don't optimize for: speed, low rates, willingness to do nights and weekends. Engineers selected on those axes produce different data. That difference is usually invisible to a sample-based QA process and visible to the model in training. We'd rather charge more and ship cleaner data than the inverse.

Why rubric design matters as much as who you hire

None of this works without a rubric worth calibrating against. Labs structure their own feedback pipelines around the same tension - OpenAI's account of training InstructGPT describes the core loop as collecting human demonstrations and comparisons, then training a reward model to reflect labeler preference, which only works if those labelers are applying a consistent standard in the first place. Anthropic's Constitutional AI work makes a related point from a different angle: when the standard itself is written down as an explicit set of principles, it becomes possible to audit why a judgment was made, not just what the judgment was. Our rubrics borrow that instinct - every defect category traces back to a written principle a candidate can be tested against, which is what Stage 2 and Stage 3 are actually probing for.

What this costs us - and what it saves the client

The vetting funnel is expensive to run. Each application costs roughly 90 minutes of calibrator time across Stages 1 through 3, plus the cost of the background-check provider in Stage 4 and double-reviewer time in Stage 5. Spread across a 2–4 percent accept rate, the per-hire cost is meaningful.

We invest in that funnel because the alternative - hiring on volume and filtering through batch rejection - produces measurably worse data. Calibration sessions get longer, gold-set acceptance trends down, drift fires more often, and client confidence erodes. Vendor relationships in this category live and die on data quality consistency. Cheap hiring is the most expensive thing a data vendor can do.

For our clients, the upside is that their gold-set acceptance rates land where we said they would, calibration cadence is predictable, and rubric updates ship cleanly across the cohort. None of that is glamorous. All of it shows up in the model.

See the process in action

If you're evaluating vendors for an RLHF, agent-trace, or evaluation engagement, the best thing you can do is run a calibrated pilot. Pick a task spec, define a gold set, and watch how each candidate vendor handles the first 100 items. The gap between marketing claims and actual output becomes obvious within a week.

When you're ready, we'd be glad to scope a pilot - or share our vendor security packet if you're earlier in the procurement cycle. Either way, the conversation starts the same way: tell us about your task spec, and we'll show you exactly how we'd structure the pod.