April 25 (Reuters) – As the summer of 2022 came to a close, Meta CEO Mark Zuckerberg gathered his top lieutenants for a five-hour dissection of the company’s computing capacity, focused on its ability to do cutting-edge artificial intelligence work, according to a company memo dated Sept. 20 reviewed by Reuters.
They had a thorny problem: despite high-profile investments in AI research, the social media giant had been slow to adopt expensive AI-friendly hardware and software systems for its main business, hobbling its ability to keep pace with innovation at scale even as it increasingly relied on AI to support its growth, according to the memo, company statements and interviews with 12 people familiar with the changes, who spoke on condition of anonymity to discuss internal company matters.
“We have a significant gap in our tooling, workflows and processes when it comes to developing for AI. We need to invest heavily here,” said the memo, written by new head of infrastructure Santosh Janardhan, which was posted on Meta’s internal message board in September and is being reported now for the first time.
Supporting AI work would require Meta (META.O) to “fundamentally shift our physical infrastructure design, our software systems, and our approach to providing a stable platform,” it added.
For more than a year, Meta has been engaged in a massive project to whip its AI infrastructure into shape. While the company has publicly acknowledged “playing a little bit of catch-up” on AI hardware trends, details of the overhaul – including capacity crunches, leadership changes and a scrapped AI chip project – have not been reported previously.
Asked about the memo and the restructuring, Meta spokesperson Jon Carvill said the company “has a proven track record in creating and deploying state-of-the-art infrastructure at scale combined with deep expertise in AI research and engineering.”
“We’re confident in our ability to continue expanding our infrastructure’s capabilities to meet our near-term and long-term needs as we bring new AI-powered experiences to our family of apps and consumer products,” said Carvill. He declined to comment on whether Meta abandoned its AI chip.
Janardhan and other executives did not grant requests for interviews made via the company.
The overhaul spiked Meta’s capital expenditures by about $4 billion a quarter, according to company disclosures – nearly double its spend as of 2021 – and led it to pause or cancel previously planned data center builds in four locations.
Those investments have coincided with a period of severe financial squeeze for Meta, which has been laying off employees since November at a scale not seen since the dotcom bust.
Meanwhile, Microsoft-backed OpenAI’s ChatGPT surged to become the fastest-growing consumer application in history after its Nov. 30 debut, triggering an arms race among tech giants to release products using so-called generative AI, which, beyond recognizing patterns in data like other AI, creates human-like written and visual content in response to prompts.
Generative AI gobbles up reams of computing power, amplifying the urgency of Meta’s capacity scramble, said five of the sources.
FALLING BEHIND
A key source of the trouble, those five sources said, can be traced back to Meta’s belated embrace of the graphics processing unit, or GPU, for AI work.
GPU chips are uniquely well-suited to artificial intelligence processing because they can perform large numbers of tasks simultaneously, reducing the time needed to churn through billions of pieces of data.
However, GPUs are also more expensive than other chips, with chipmaker Nvidia Corp (NVDA.O) controlling 80% of the market and maintaining a commanding lead on accompanying software, the sources said.
Nvidia did not respond to a request for comment for this story.
Instead, until last year, Meta largely ran AI workloads using the company’s fleet of commodity central processing units (CPUs), the workhorse chip of the computing world, which has filled data centers for decades but performs AI work poorly.
According to two of those sources, the company also started using its own custom chip it had designed in-house for inference, an AI process in which algorithms trained on huge amounts of data make judgments and generate responses to prompts.
By 2021, that two-pronged approach proved slower and less efficient than one built around GPUs, which were also more flexible in running different types of models than Meta’s chip, the two people said.
Meta declined comment on its AI chip’s performance.
As Zuckerberg pivoted the company toward the metaverse – a set of digital worlds enabled by augmented and virtual reality – its capacity crunch was slowing its ability to deploy AI to respond to threats, like the rise of social media rival TikTok and Apple-led ad privacy changes, said four of the sources.
The stumbles caught the attention of former Meta board member Peter Thiel, who resigned in early 2022, without explanation.
At a board meeting before he left, Thiel told Zuckerberg and his executives they were complacent about Meta’s core social media business while focusing too much on the metaverse, which he said left the company vulnerable to the challenge from TikTok, according to two sources familiar with the exchange.
Meta declined to comment on the conversation.
CATCH-UP
After pulling the plug on a large-scale rollout of Meta’s own custom inference chip, which was planned for 2022, executives instead reversed course and placed orders that year for billions of dollars worth of Nvidia GPUs, one source said.
Meta declined to comment on the order.
By then, Meta was already several steps behind peers like Google, which had begun deploying its own custom-built version of GPUs, called the TPU, in 2015.
Executives also that spring set about reorganizing Meta’s AI units, naming two new heads of engineering in the process, including Janardhan, the author of the September memo.
More than a dozen executives left Meta during the months-long upheaval, according to their LinkedIn profiles and a source familiar with the departures, a near-wholesale change of AI infrastructure leadership.
Meta next started retooling its data centers to accommodate the incoming GPUs, which draw more power and produce more heat than CPUs, and which must be clustered closely together with specialized networking between them.
The facilities needed 24 to 32 times the networking capacity and new liquid cooling systems to manage the clusters’ heat, requiring them to be “entirely redesigned,” according to Janardhan’s memo and four sources familiar with the project, details of which have not previously been disclosed.
As the work got underway, Meta made internal plans to start developing a new and more ambitious in-house chip, which, like a GPU, would be capable of both training AI models and performing inference. The project, which has not been reported previously, is set to finish around 2025, two sources said.
Carvill, the Meta spokesperson, said data center construction that was paused while transitioning to the new designs would resume later this year. He declined to comment on the chip project.
TRADE-OFFS
While scaling up its GPU capacity, Meta, for now, has had little to show as competitors like Microsoft and Google promote public launches of commercial generative AI products.
Chief Financial Officer Susan Li acknowledged in February that Meta was not devoting much of its current compute to generative work, saying “basically all of our AI capacity is going towards ads, feeds and Reels,” its TikTok-like short video format that is popular with younger users.
According to four of the sources, Meta did not prioritize building generative AI products until after the launch of ChatGPT in November. Even though its research lab FAIR, or Facebook AI Research, has been publishing prototypes of the technology since late 2021, the company was not focused on converting its well-regarded research into products, they said.
As investor interest soars, that is changing. Zuckerberg announced a new top-level generative AI team in February that he said would “turbocharge” the company’s work in the area.
Chief Technology Officer Andrew Bosworth likewise said this month that generative AI was the area where he and Zuckerberg were spending the most time, forecasting Meta would release a product this year.
Two people familiar with the new team said its work was in the early stages and focused on building a foundation model, a core program that later can be fine tuned and adapted for different products.
Carvill, the Meta spokesperson, said the company has been building generative AI products on different teams for more than a year. He confirmed that the work has accelerated in the months since ChatGPT’s arrival.
Reporting by Katie Paul, Krystal Hu, Stephen Nellis and Anna Tong; additional reporting by Jeffrey Dastin; editing by Kenneth Li and Claudia Parsons
Our Standards: The Thomson Reuters Trust Principles.