When he was 19 years old, Brendan Foody started Mercor with two of his high school friends as a way for his other friends, who also had startups, to hire software engineers overseas. It launched in 2023 as essentially a staffing agency, albeit a highly automated one. Language models reviewed resumes and did the interviewing. Within months, Mercor was bringing in $1 million in annualized revenue and turning a modest profit.
Then, in early 2024, the company Scale AI approached Mercor with a big request: They needed 1,200 software engineers. At the time, Scale was one of the only well-known names in the historically back-of-house business of producing AI training data. It had grown to a valuation of nearly $14 billion by orchestrating hundreds of thousands of people around the world to label data for self-driving cars, e-commerce algorithms, and language-model-powered chatbots. Now that OpenAI, Anthropic, and other companies were trying to teach their chatbots to code, Scale needed software engineers to produce the training data.
This, Foody sensed, could herald a larger change in the AI industry. He’d heard about growing demand for specialized data work, and now here was Scale asking for a thousand coders. When the engineers he recruited started complaining about missed pay (Scale has a reputation among data workers for chaotic platform management and is being sued in California over wage theft, among other infractions), Foody decided to cut out the middleman.
In September, Foody announced that Mercor had reached $500 million annualized revenue, making it “the fastest growing company of all time.” The previous titleholder was Anysphere, which makes the AI coding tool Cursor. In a sign of the times, Cursor recently noted that its users produce the exact sort of training data labs are paying for, and The Information recently reported that OpenAI and xAI are interested in buying it.
Mercor’s most recent fundraising round valued the company at $10 billion. Foody and his two cofounders are 22 years old, making them the youngest self-made billionaires. At least one of their early employees has already left to start an AI data company of her own.
While discussions of AI infrastructure typically focus on the gargantuan buildout of data centers, an analogous race is happening with training data. Labs have already exhausted all the easily accessible data, adding to questions about whether early rapid progress through sheer increases in scale will continue. Meanwhile, most recent improvements have come through new training techniques that make use of smaller datasets tailor-made by experts in particular fields, like programming and finance, and AI companies will pay premium prices for it.
There are no good statistics on how much labs are spending, but rough estimates from investors and industry insiders place the figure at over $10 billion this year and growing, the vast majority coming from five or so companies. These companies have yet to find a way to make money from AI, but the people selling them training data have. For now, they are some of the only AI companies turning a profit.
“It’s every nook and cranny of human expertise.”
The data industry has long been the most undervalued and unglamorous aspect of AI development, according to a 2021 study by Google researchers, seen as regrettably necessary janitorial work to be done as quickly and cheaply as possible. Yet modern machine learning could not exist without its ecosystem of data suppliers, and the two spheres move in tandem.
The enormous datasets that proved the viability of machine learning in the early 2010s were made possible by the emergence several years before of Amazon Mechanical Turk, an early crowdsourcing platform where thousands of people could be paid pennies to label images of dogs and cats. The push to develop autonomous vehicles fed the growth of a new batch of companies, among them Scale AI, which refined the crowdsourcing approach through a dedicated work platform called Remotasks where workers used semi-automated annotation software to draw boxes around stop signs and traffic cones.
The turn to language model chatbots after the launch of ChatGPT initiated another transformation of the industry. ChatGPT got its humanlike fluency from a training approach called reinforcement learning from human feedback, or RLHF, which involved paying contractors to rate the quality of chatbot responses. A second model trained on these ratings, then rewarded ChatGPT whenever it did something that this second model predicted humans would like. Providing these ratings was a more nuanced affair than past iterations of crowdsourced data work, particularly as the chatbots got more advanced; it takes someone with medical training to judge whether medical advice is good.
Scale supplied much of the human ratings, but a new company, Surge AI, self-funded by a data scientist named Edwin Chen, quietly grew to become the industry’s other major provider. In Chen’s past jobs at Google, Twitter, and Facebook, he had been dismayed at the poor quality of the data he received from vendors, full of mislabelings done for minimal pay by people who lacked relevant backgrounds. The vendors, Chen said, were just “body shops,” throwing people at the problem and trying to substitute quantity for quality.
Where Scale had its Remotasks platform, Surge has Data Annotation Tech: smaller, more targeted in its recruiting, and with tighter quality controls. It also paid better, around $30 an hour, though like Scale, Surge is also being sued in California for misclassification and unpaid wages. Demand from OpenAI and the labs trying to catch up was immense. The company has been profitable since it launched, and last year, it reportedly took in more than $1 billion in revenue, surpassing Scale’s reported $870 million. Earlier this year, Reuters reported that Surge is considering taking funding for the first time, looking for a $1 billion investment at a $15 billion valuation. According to Forbes, Chen still owns approximately 75 percent of it.
Data about which chatbot responses people prefer is a crude signal, however. Models are prone to learning simple hacks like “tell the user they’ve made an excellent point” instead of something as complex as “check for factual consistency with reliable sources.” Even when domain experts are doing the judging, the results often just sound more expert but are still too unreliable to actually be useful. Models ace bar exams but invent case law, pass CPA tests but pick the wrong cells in a spreadsheet. In July, researchers at MIT released a study finding that 95 percent of the businesses that have adopted generative AI have seen zero return.
AI companies hope that reinforcement learning with more granular criteria will change this. Recent improvements in math and coding are a proof of concept. OpenAI’s o1 and DeepSeek’s R1 showed that given a bunch of math and coding problems and a few step-by-step examples of how humans thought their way to solutions, models can become quite adept at these domains. As they trial-and-error their way to correct solutions, models weigh possible approaches, backtrack, and display other problem-solving techniques developers have called “reasoning.”
The problem is that math and coding problems are idealized, self-contained tasks compared to what a software engineer might encounter in the real world, so scores on benchmarks don’t reflect actual performance. To make models useful, AI companies need more data that is reflective of real tasks an engineer might do — hence the rush to hire software engineers.
The other problem is that math and coding might be the easiest possible domains for AI to conquer. For reinforcement learning to work, models need a clear signal of success to optimize for. This is why the method works so well for games like Go: Winning is a clear, unambiguous outcome, so models can try a million ways to achieve it. Similarly, code either runs or it doesn’t. The analogy isn’t perfect; ugly, inefficient code can still run, but it provides something verifiable to optimize for.
Few other things in life are like this. There is no universal test for determining whether a legal brief or consulting analysis is “good.” Success depends on the context, goals, audience, and countless other variables.
“There seems to be a belief in the community that there’s a single reward function, that if we can just specify what we want these AI systems to do, then we can train them to [do it],” said Joelle Pineau, chief AI officer at Cohere, an enterprise-focused AI lab. But, she said, the reality is more varied and nuanced.
“[Reinforcement learning] wants one reward function. It’s not very good about finding solutions when you have multiple conflicting values that need to coexist, so we may need a very different paradigm than that.”
In lieu of a new paradigm, AI companies are attempting to brute force the problem by paying — via companies like Mercor and Surge — thousands of lawyers, consultants, and other professionals to write out in painstaking detail the criteria for what counts as a job well done in every conceivable context. The hope is that these lists, often called grading rubrics, will allow models to reinforcement-learn their way to competence in the same way they have begun doing with software engineering.
It was like breaking a billion-dollar piñata over all the data startups. Handshake saw demand triple overnight.
Rubrics are extremely labor-intensive to produce. People who work on them said that it is not unusual to spend 10 hours or more refining a single one, which might include more than a dozen different criteria. Companies guard the details of their training methods closely, but an example OpenAI released for its recent medical benchmark offers a good indication of what they’re like. Asked a question about an unresponsive neighbor, the model gets rewarded if its response includes advice to check for a pulse, locate a defibrillator, perform CPR, and 16 other criteria. There are nearly 50,000 such criteria in the benchmark, with different ones applying to different prompts. Labs are ordering tens to hundreds of thousands of rubrics with millions of criteria between them per training run, according to people in the data industry.
These rubrics need to be “super granular,” according to Mercor’s Foody. Producing consulting rubrics, Foody said, would start by creating a taxonomy of all the industries a consulting company operates in, then all the types of consulting it does in each of those industries, then all the types of reports and analyses a consultant might produce in each of those categories.
Performing these tasks typically requires doing things on computers, and each of those things needs a rubric, too. Sending an email requires a lot of steps — opening a browser, beginning a new message, typing it out, and so on. But what if your only verifier for success was whether the email was sent or received? It’s important to check for more actions than just one, according to Aakash Sabharwal, Scale’s VP of engineering.
Models learn to perform these tasks in simplified versions of software called reinforcement learning environments, often described as AI “gyms,” where models can stumble around until they figure out how to do the clicking and dragging required to score well on the grading rubric. The market for these environments is booming, too.
As with rubrics, each one needs to be tailored to its use. “Sometimes it’s a DoorDash or a Salesforce clone, but a lot of times it’s just an enterprise-specific environment,” said Alex Ratner, cofounder and CEO of Snorkel AI. Snorkel makes annotation software but recently launched a human data service of its own.
Ratner cites a recurring irony in AI development known as Moravec’s paradox, named for a researcher working on computer vision in the 1980s who observed that the things that come easiest to humans are often the most difficult for machines. At the time, conventional wisdom was that machine vision would be solved before chess; after all, only a select few humans have the talent and training to be grandmasters, whereas even children can see. Now models can solve complex one-off coding challenges, but they flounder on more basic real-world engineering tasks without close human supervision, misusing tools and making obvious errors.
“That kind of real work, with ambiguous, intermediate metrics of success that seem way more mundane than a coding competition, that is where models struggle,” Ratner said. “That’s the counterintuitive frontier, and that’s where people are trying to lean in, ourselves included, with building more complex environments, more nuanced rubrics.”
According to vendors, the most in-demand fields are the ones that sit at the sweet spot of verifiability and economic value. Software engineering continues to be the largest, followed by finance and consulting. Law is popular, though so far it is proving to be less verifiable and thus amenable to reinforcement learning. Physics, chemistry, math are all in demand. Really, it’s nearly anything you can imagine. There are ads for nuclear engineers and animal trainers.
“It’s everything from clinical hospital settings to legal deep research to — we got a request for woodworking the other day,” Ratner said. “It’s every nook and cranny of human expertise.”
Encoding all of humanity’s skill and know-how into checklists is an enormous, possibly quixotic undertaking, but the frontier labs have billions to spend, and the sheer scale of their demand is reconfiguring the data industry. New entrants seem to appear by the day, and everyone is touting successively more pedigreed experts getting paid ever higher rates.
Surge touts its Fields Medalist mathematicians, Supreme Court litigators, and Harvard historians. Mercor advertises its Goldman analysts and McKinsey consultants. Handshake AI, another fast-growing expert provider, boasts of its physicists from Berkeley and Stanford and the ability to draw alumni from more than 1,000 universities.
Garrett Lord, the CEO and cofounder of Handshake, started picking up signals about the changing data market last year, when incumbent data providers came around asking for experts. Handshake had experts. Lord founded the company in 2014 as a sort of LinkedIn-meets-Glassdoor for college students and recent grads looking for internships and first jobs. More than a thousand college career centers pay for access, as do companies looking to recruit from Handshake’s 20 million alumni, grad students, masters, and PhDs. Early this year, Lord entered the AI data market himself, launching essentially a second company within his existing one, called Handshake AI.
Then, in June, Meta hired away Scale’s CEO and took a 49 percent stake in the company. Rival labs fled, wary that Scale would no longer be a neutral provider — could they trust the data now that it was being provided by a quasi-Meta subsidiary? It was like breaking a billion-dollar piñata over all the data startups. Handshake saw demand triple overnight.
In November, Handshake surpassed a $150 million run rate, exceeding the original decade-old business. There is more demand than the company can meet, Lord said. “We’ve gone from three to 150 people in five months,” Lord said. “We’ve had 18 people start on a Monday. We’re running out of desks.”
The ravenous demand of AI model-builders is pulling any company that might have data to offer into its gravitational field. Turing, which began as a staffing agency but pivoted to training data after OpenAI approached the company in 2022, also saw demand spike following the Scale deal. As did Labelbox, which makes annotation software but last year launched its own expert-annotator service, called Alignerr, where buyers can search for experts, called “Alignerrs,” who’ve been vetted by Labelbox’s AI interviewer, named Zara.
Staffing agencies, content moderation subcontractors, and other adjacent businesses are also reorienting around the labs. Invisible Technologies started 10 years ago as a personal assistant bot that directed tasks to workers overseas, but it started posting twentyfold revenue increases as AI labs hired those workers to produce data. This year, it brought on an ex-McKinsey executive as CEO, took on venture funding, and is positioning itself as an AI training company. The company Pareto followed the same trajectory, launching in 2020 by offering executive assistants based in the Philippines and now selling AI training data services.
The company Micro1 began in 2022 as a staffing agency for hiring software engineers, who had been vetted by AI, but now it’s a data labeling company too. In July, Reuters reported that the company had seen annualized revenue go from $10 million to $100 million this year and was finalizing a Series A funding round valuing the company at $500 million.
Even Uber is angling to get a piece of the action. In October, it bought a Belgian data labeling startup and is in the process of rolling out an annotation platform to US workers, so drivers can annotate when they aren’t driving.
“This Cambrian explosion happened, and now let’s see who survives.”
Then there is a long list of smaller, niche players. The company Sapien is paying data labelers in crypto. Rowan Stone, CEO of Sapien, told The Verge in July that the data labeling company — which specializes in vertical models focused on just one thing and has Scale cofounder Lucy Guo on its advisory board — is “absorbing the collective knowledge of humanity.” They aren’t even the only human data startup paying in crypto tokens.
Stellar, Aligned, FlexiBench, Revelo, Deccan AI — everyone is touting their talent networks, their experts in the loop, their data enrichment pipelines. The company Mechanize rose above the scrum on a wave of viral outrage by announcing in April that its goal was “the full automation of all work.” How will it accomplish this provocative goal? By selling training data and environments, like everyone else.
Like Nvidia, the dominant designer of AI chips, these companies sell the picks and shovels for the AI gold rush, capturing the billions in debt-financed spending flowing out of the frontier labs as they race to achieve superintelligence. It’s a safer business than prospecting, and it is much easier to start selling data than to design new chips, so startups are proliferating.
“It’s like everyone and their mother realized, ‘Hey, I’m doing a human data startup,’” said Adam J. Gramling, a former Scale employee who said he received approximately 300 recruiting messages on LinkedIn when he announced his departure in one of Scale’s recent rounds of layoffs. “This Cambrian explosion happened, and now let’s see who survives.”
The data industry may be growing quickly, but it is a historically tumultuous business. The industry is littered with former giants felled by a sudden change in training techniques or customer departure. In August 2020, the Australian data annotation company Appen’s market cap surpassed the equivalent of $4.3 billion USD; now, it’s less than $130 million, a 97 percent decline. For Appen, 80 percent of its revenue came from just five clients — Microsoft, Apple, Meta, Google, and Amazon — which made even a single client departure an existential event.
Today’s market is also highly concentrated. On a recent podcast, Foody compared Mercor’s customer concentration to Nvidia, where four customers represent 61 percent of its revenue. If investors tire of giving money to model-builders, or the labs take a different approach to training, the effects could be devastating. All of the AI developers use multiple data suppliers already, and as the exodus from Scale showed, they are quick to take their money elsewhere.
All this lends itself to a fiercely competitive atmosphere. On podcasts and in interviews, the CEOs take swipes at the business models of their rivals. Chen still thinks most of his competitors are “body shops.” Foody refers to Surge and Scale as legacy crowdsourcers in an era of highly paid experts. Handshake’s Lord says his rivals are spending thousands on recruiters spamming physicists on TikTok, but they’re all already on his platform. All three say Scale had quality problems even before it was tainted by Meta’s investment. Every time one of these barbs is reported, a Scale spokesperson snipes back, accusing Foody of seeking publicity or mocking Chen for his lengthy fundraising round. Scale is also currently suing Mercor, claiming it poached an employee who stole clients on their way out the door.
For now, there is more than enough money flowing from the labs for everyone. They want rubrics, environments, experts of every conceivable type, but they’re still buying the old types of data too. “It’s always increasing,” says Surge’s Chen. “These ever-increasing new forms of training, they’re almost complementary to each other.”
Even Scale is growing after its post-Meta setback, and major customers have come back, at least in some capacity. Interim CEO Jason Droege said in an onstage interview in September that the company is still working with Google, Microsoft, OpenAI, and xAI. To better compete in the enterprise AI space, Scale has also started a program called the “Human Frontier Collective” for white-collar professionals in STEM fields like computer science, engineering, mathematics, and cognitive science.
Scale told The Verge that both its data and applications businesses are each generating nine figures of revenue, with its data business growing each month since the Meta investment and its application business doubling from the first half to the second half of 2025. It also said that the third quarter of 2025 was its public sector business’s best quarter since 2020, partly due to government contracts. Scale also reportedly expects revenue for this year to more than double, to $2 billion. (The company declined to comment on the figure on the record.)
It has diversified into selling evaluations, the tests that AI developers use to see where their models are weak and need more training data, according to Bing Liu, Scale’s head of research. The business strategy: Companies will ideally use the evaluations to see where their own models are lacking in data — and then, ideally, buy those types of data from Scale.
The 11-digit valuations of just-launched data companies could be seen as signs of an AI bubble, but they could also represent a bet on a certain trajectory of AI development. (Both can also be true.) The goal held out by the AI labs when justifying their enormous expenditures is an imminent breakthrough to artificial general intelligence, something, to use the definition in OpenAI’s charter, that is “highly autonomous” and can “outperform humans at most economically valuable work.”
The term is amorphous and disputed, but one thing artificial general intelligence should be able to do is, well, generalize. If you train it to do math and accounting, it should be able to do your taxes without further rounds of reinforcement learning on tax law, state-specific tax rules, the most recent edition of TurboTax, and so on. A generally capable agent should not need massive amounts of new data to handle each variety of task in every domain.
“The future where the AI labs are right is one where as performance goes up, the need for human data goes down, until you can take the human out of the loop entirely,” said Daniel Kang, assistant professor of computing and data science at the University of Illinois Urbana-Champaign, who has written about the demand for training data. Instead, the opposite seems to be happening. Labs are spending more on data than ever before, and improvements are coming from bespoke datasets tailored to increasingly specific applications. Given current training trends, Kang predicts that getting high-quality human data in each discrete domain will be the primary bottleneck for future AI progress.
In this scenario, AI looks more like a “normal technology,” Kang said. Normal technology here being something like steam engines or the internet — potentially transformative, but also not computer god. (This is also, he hypothesized, why companies are less keen to trumpet their spending on data than they are on data centers: It cuts against their fundraising narrative.) In the AI-as-normal future, companies will need to buy new data whenever they want to automate a particular task, and keep buying data as workflows change.
The data companies are betting on that too. “The labs very much want to say that we’re going to have superintelligence that generalizes as soon as possible,” said Foody. “The way it’s playing out in practice is that reinforcement learning has a limited generalization radius, so they need to build evals across all the things that they want to optimize for, and their investments in that are exploding very quickly.”
Other companies, predicting that the frontier models will not “just hit this point of generalization where it’s just magic and you can do everything,” in the words of Ryan Wexler, who manages AI infrastructure investments at SignalFire, are positioning themselves to cater to the many companies that will need to tune models to suit their purposes.
SignalFire invested in Centaur AI, a medical and scientific data company. Rather than the frontier labs, most of Centaur’s customers are medical institutions like Memorial Sloan Kettering or Medtronic with highly specific applications and low margins for error. Last year, the smart mattress company Eight Sleep wanted to add “snore detection” to its bed’s suite of capabilities. Existing models struggled, so the company hired Centaur to enlist more than 50,000 people to label snores.
“The attempts to make the God model, I don’t know what will happen there, but I’m very confident that demand will keep growing among everyone else,” said Centaur’s founder and CEO, Erik Duhaime. “Everyone was sold some dream that this will be easy, plug and play,” Duhaime said. “Now they’re realizing, ‘Oh, we need to customize this thing for our use case.’”
Matt Fitzpatrick, the CEO of Invisible, is also focusing on its enterprise services. If you look at “spend curves over time,” he said, the enterprise is “where a lot of this will move.” Since January, the company has overhauled its business to focus more on attracting enterprise clients, with about 30 percent of its data annotation pool now being people with PhDs and master’s degrees. Fitzpatrick describes the company as a “digital assembly line” where experts “anywhere on Earth” can be called in to generate data. Invisible is currently often being asked to provide environments for software development and contact centers, he said.
If AGI is to be achieved one order of contact-center training rubrics at a time, the future looks bright for data vendors, which is perhaps why a new grandeur has entered the language of the CEOs. Turing’s CEO predicts that AI data annotator will become the most common job on the planet in the coming years, with billions of people evaluating and training models. Handshake’s Lord sees the nascent formation of a new category of work, comparing it to Uber drivers a decade ago.
“We’re going to need a huge build-out of data and evals across every industry in the economy,” Foody said. At Mercor, he says, the customer support team responds to tickets the AI agent can’t manage, but also updates its rubrics so it can field those questions next time. “If you zoom out,” he said, “it feels like the entire economy will become a reinforcement learning environment.”
If investors don’t find this vision as enticing as a country of geniuses in a data center, as Anthropic’s Dario Amodei described the impending transformation, they can take consolation in the fact that someone, at least, has found a way to make money off AI.


