Facebook’s data center in Prineville, Oregon, February 16, 2018. Large data centers have experienced outages which may be partly the result of chip errors. (Leah Nash/The New York Times)
IImagine for a moment that the millions of computer chips inside the servers that power the world’s largest data centers had rare, almost undetectable defects. And the only way to find the flaws was to throw those chips at giant computer problems that would have been unthinkable just a decade ago.
As the tiny switches of computer chips have shrunk to the width of a few atoms, chip reliability has become another concern for people running the world’s largest networks. Companies like Amazon, Facebook, Twitter and many other sites have experienced surprising outages over the past year.
Outages have several causes, such as programming errors and network congestion. But there are growing concerns that cloud computing networks have become larger and more complex, they still depend, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictable.
Over the past year, researchers from Facebook and Google have published studies describing computer hardware failures whose causes were not easily identified. The problem, they said, was not in the software – it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook did not return requests for comment on its study.
“They see these silent errors, basically coming from the underlying hardware,” said Subhasish Mitra, a Stanford University electrical engineer who specializes in computer hardware testing. Increasingly, Mitra said, people believe that manufacturing defects are related to these so-called silent errors that cannot be easily detected.
Researchers fear finding rare faults as they try to solve ever-larger computer problems, which strain their systems in unexpected ways.
There is growing evidence that the problem is getting worse with each new generation of chips. A 2020 report by chipmaker Advanced Micro Devices found that the most advanced computer memory chips at the time were about 5.5 times less reliable than the previous generation. AMD did not respond to requests for comment on the report.
Until now, computer designers have tried to deal with hardware faults by adding special circuits to the chips that correct the errors. Circuitry automatically detects and corrects bad data. It was once considered an extremely rare problem. But several years ago, Google’s production teams started reporting errors that were extremely difficult to diagnose. The miscalculations occurred intermittently and were difficult to reproduce, according to their report.
A team of researchers tried to track down the problem, and last year they published their findings. They concluded that the company’s vast data centers, comprised of computer systems based on millions of processor “cores,” were experiencing new errors that were likely a combination of two factors: smaller transistors that were approaching limits physical and inadequate testing.
In their “Cores That Don’t Count” article, Google researchers noted that the problem was difficult enough that they had already spent decades’ worth of engineering time solving it.
Modern CPU chips are made up of dozens of CPU cores, computational engines that allow tasks to be broken down and solved in parallel. The researchers found that a small subset of nuclei produced inaccurate results rarely and only under certain conditions. They described the behavior as sporadic. In some cases, the cores produced errors only when the calculation speed or the temperature were changed.
According to Google, the increasing complexity of processor design was a major cause of failure. But engineers also said smaller transistors, three-dimensional chips and new designs that only create errors in certain cases have all contributed to the problem.
In a similar article published last year, a group of Facebook researchers noted that some processors would pass manufacturers’ tests, but then began exhibiting failures when in the field.
Intel executives said they are familiar with research papers from Google and Facebook and are working with the two companies to develop new methods for detecting and correcting hardware errors.
Bryan Jorgensen, vice president of Intel’s Data Platforms Group, said the researchers’ claims were correct and that “the challenge they are issuing to the industry is the right place to go.”
He said Intel had recently started a project to help create standard, open-source software for data center operators. The software would allow them to find and correct hardware errors that were not detected by the integrated circuits in the chips.
Computer engineers are divided on how to meet the challenge. A popular response is the demand for new types of software that proactively monitor hardware errors and allow system operators to remove hardware when it begins to degrade. This has created an opportunity for new startups offering software that monitors the health of underlying chips in data centers.
One such operation is TidalScale, a Los Gatos, Calif., company that makes specialized software for businesses trying to minimize hardware failures. Its managing director, Gary Smerdon, suggested that TidalScale and others faced a daunting challenge.
“It will be a bit like changing an engine while a plane is still flying,” he said.
Click here to see Forbes India‘s full coverage of the Covid-19 situation and its impact on life, business and economy
Discover our end of season subscription discounts with an absolutely free Moneycontrol pro subscription. Use code EOSO2021. Click here for more details.
©2019 New York Times News Service