For years, the great CEOs have all the visions of Agents of Ai This can automatically use software applications to complete activities for people. But remove the agents of Ai del Consumer today for a lap, where the Openi’s Openai’s Chatgpt agent But perplexity CometAnd you will soon realize how limited technology is still. Making artificial intelligence agents more robust can take a new series of techniques that the industry is still discovering.
One of these techniques is carefully simulating the work spaces in which agents can be trained on more activities known as the learning of the reinforcement (RL) around. Similarly to as a set of data labeled by the last wave of artificial intelligence, the RL packs are starting to seem like a critical element in the development of the agents.
Researchers, founders and investors of AIs say that Techcruncha that AI’s main workshops ask for more RL around and there is no lack of startups that hope to provide them.
“All the big artificial intelligence laboratories are building an internally RL environment,” said Jennifer Li, a general partner of Andesesen Horowitz, in an interview with Techcrunch. “But as you can imagine, the creation of these data sets is very complex, so even the Ai workshops look at third -party suppliers who can create high quality and evaluations. Everyone looks at this space.”
The push for the RL environments has been coined a new class of well -counted startups, such as Mechanize and Prime Intellect, who love to drive the space. In the meantime, large data labeling companies such as Mercor and Surge claim to invest more in RL around to keep up with the changes in the sector from static data sets to interactive simulations. The main workshops are taking into consideration the idea of investing heavy $ 1 billion on RL Shors During the following year.
The hope for investors and founders is that one of these startups emerges as “to the scale around” 29 billion dollars of data labeling This fueled the era of the chatbot.
The question is where the RL surrounding environment will really push the frontier of AI progress.
Techcrunch event
San Francisco
|
27-29 October 2025
What is an environment?
At their center, RL events are training groups that have simulated what an artificial intelligence agent is doing in a real software app. A founder described to build them Recent interview “How to create a very boring video game.”
For example, an environment could simulate a chrome browser and a task an artificial intelligence agent with the purchase of a pair of socks on Amazon. The agent is classified on his performance and sniffs a reward signal when it happened (in this case, buying a worthy pair of socks).
While such a task seems relatively simple, there are many places in which an artificial intelligence agent may have stumbled. It may be lost by browsing the web page drop -down menus or buy too many soc. And because the developers can predict exactly as an incorrect turning point that an agent will take, the environment itself has been robust to capture behaviors too and still offer useful feedback. This makes construction activities much more complex than a static set set.
Some surroundings are processed, which allow artificial intelligence agents to use tools, access the internet or use various software applications to complete a determined activity. Others are more restricted, loved in helping an agent to learn specific activities in company software applications.
While RL around the hot thing in Silicon Valley right now, there is very previous to use this technique. One of Openi’s first projects in 2016 was to build “RL gyms“That were quite similar to the modern conception of the environments. The same year, Google Deepmind’s Alphago The Ai system beat a world champion on board game, go. He also used RL techniques within a simulated environment.
The unique thing about today’s environments is that researchers are trying to build computer agents with large transformation models. Unlike Alphago, which was an artificial intelligence system specialized in a closed environmental, today’s artificial intelligence agents are trained to have more general skills. Today AI researchers have a strong starting point, but also a complicated goal in which the more can go wrong.
A crowded field
The data labeling companies to the AI, Surge and Mercor are trying to meet the moment and build RL around. These companies have more resources than many startups in space, as well as deep relationships with Labs.
The CEO of Surge Edwin Chen tells Techcrunch which has recently seen “insignificant” demand for RL environments within the Labs. SURGE – which according to what reported has generated $ 1.2 billion in revenue Last year from work with the Labs such as Openii, Google, Anthropic and Meta – he recently made a new internal organization specifically in charge of building RL environments, he said.
Close to Surge there is Mercor, a startup of $ 10 billion, who also worked with Openi, Meta and Antropic. Mercor is launching investors on his business Build the RL environment For specific domain activities such as coding, health care and law, according to the marketing materials seen by Techcrunch.
Mercor Brendan Foody CEO told Techcrunch in an interview that “few understand how big the opportunity for RL manners is.
Scale used to dominate the space of the data labeling, but has lost the grinding grind Invested $ 14 billion And he hired his CEO. Since then, Google and Openai fallen Scale to the data supplier and the startup must even face competition for data labeling work Inside. However, Scale is trying to meet the moment and buildings.
“This is only the nature of the business [Scale AI] He is inside, “said Chetan Rane, stairs in Head of Product for agents and Shidments RL”. La Scala has shown its ability to adapt quickly. We did it in the first days of autonomous vehicles, our first corporate unit. When Chatgpt Cam came out, scale adapted to this. And now, once again, we are adapting to new border spaces such as agents and environments. “
The newer new players are focusing exclusively on the powers from the Outst. Among these is Meccanizza, a startup founded about six months ago with the bold goal of “automating all the works”. However, co-founder Matthew Barnett tells Techcrunch that his company is starting with RL around the AI coding agents.
Mechanize aims to provide artificial intelligence workshops with a limited number of robust RL around, says Barnett, rather than larger data companies that create a wide range of simple RL around. At this point, the start offers software engineers $ 500,000 employees To build RL environments – much higher than a time contractor could earn the work on a scale Ai or Surge.
Mechanize has already worked with anthropic on RL around, two sources that are familiar with the question told Techcrunch. Meccanize and anthropic refused to how in partnership.
Other startups are betting that the RL environment will influence the Ai workshops. First Intellects – a startup supported by the artificial intelligence researcher Andrej Karpathy, founders Fund and Menlo Ventures – is targeting the smallest developers with his surroundings RL.
Last month, first intellect launched a Hub rl levenings, Who aims to “embrace his face for RL environments”. The idea is to give access to the development open to the same reinterpretations of large artificial intelligence laboratories and sell those developers access to computational resources in the process.
Training generally capable agents in RL environments can be more computational essed than the previous AI training techniques, according to the first Intellect Will Brown researcher. In addition to the startups that build RL environments, there is another opportunity for GPU suppliers that can feed the process.
“The RL Entvolines will be too large for every company to be dominated,” Brown said in an interview. “Part of what we do is just trying to build a good open source infrastructure around it. The service we sell is calculated, so it is a convent of onramp to use the GPUs, but we are thinking about the longer term.”
Will it scale?
The open question surrounds the surrounding environment is that the technique will reduce itself as the previous AI training methods.
Reinforcement learning has fueled some of the major jumps in AI in the last year, including Openi’s O1 and anthropes Claude Opus 4. These are particularly important discovered because the methods previously used to improve artificial intelligence models are now Dominishing is back on display.
The brights are part of the largest bets of Ai Labs on RL, which Mayy believes will continue to guide progress while adding more data and computational resources to the process. Some of the Openai researchers behind O1 have previously said to Techcrunch that the company originally investing in AI reasoning models, which have been created through RL and Test-time-Complete investments, because They have never resized Well.
The best way to reduce RL remains unclear, the environmental objectives seem to be promising content. Intuition simply to reward the chatbots for text responses, they let the operational agents go to simulations with tools and computers in their shipment. It is much more to resources, but potentially more rewarding.
Some are skeptical about the fact that all these RL surroundings will be in the panoramic phase. Ross Taylor, a protagonist of research with a destination that has co-founded the general reasoning, tells Techcrunch that the RL environment is inclined to reward hacking. This is a process in which the models to the cheating to obtain a reward, without really doing the task.
“I think people are underestimating how difficult it is to climb,” said Taylor. “Even the best publicly available [RL environments] Generally the donation work without serious changes. “
Openi’s engineering manager for his API activity, Sherwin Wu, said in a Recent podcast which was “short” on the environmental startups RL. Wu notes that it is a very competitive space, but also that research AI is evolving so quickly that it is difficult to serve artificial intelligence wells well.
Karpathy, a first intellect investor who defined the RL environments a potential turning point, also won the link for the RL space in a wider way. In a Placed on xHe has raised concerns about more artificial intelligence progress can be squeezed by RL.
“I am bullish for the surrounding environment and the interactions acting, but I am reduced to the learning of the specific reinforcement,” said Karpathy.
Update: a previous version of this article Redd to mechanize how to mechanize the work. It has been updated to reflect the company’s official name.