When Does Science Self-Correct? Lessons from a Replication Crisis in Early 20th Century Chemistry
Guest post by Paul von Hippel (Univ. of Texas)
The following is a guest post by Paul von Hippel at the University of Texas at Austin:
******
Science is self-correcting—or so we are told. But in truth it can be very hard to expunge errors from the scientific record. In 2015, a massive effort showed that 60 percent of the findings published in top psychology journals could not be replicated. This was distressing news, but it led to several healthy reforms in experimental psychology, where a growing number of journals now insist that investigators state their hypotheses in advance, ensure that their sample size is adequate, and publish their data and code.
You might also imagine that the credibility of the non-replicable findings took a hit. But it didn’t—not really. In 2022—seven years after the poor replicability of many findings was revealed, the findings that had failed to replicate were still getting cited just as much as the findings that had replicated successfully. And when they were cited, the fact that they had failed to replicate was rarely mentioned.
This pattern has been demonstrated several times in psychology and economics. Once a finding gains influence, it continues to have influence even after it fails to replicate. The original finding keeps getting cited, and the replication study is all but ignored. The finding that replications don’t correct the record is unfortunately a finding that replicates pretty well.
Social scientists facing systemic problems in their home fields sometimes look longingly at their neighbors in the natural sciences. Natural scientists make mistakes too, and have even been known to falsify results, but if the mistakes and falsifications are consequential, they are often exposed and corrected in fairly short order.
One of the best-known examples is Fleischmann and Pons’ claim to have achieved “cold fusion” in 1989. Although the cold fusion episode is often viewed as an embarrassment in physics, it is actually a great example of a community correcting the scientific record quickly. Within 6 months of Fleischmann and Pons’ announcement, suspicious inconsistencies had been pointed out in their findings, and several other labs had reported failures to replicate. Finally, the Department of Energy announced that it would not support further work on cold fusion, and except for a handful of true believers the field turned its attention elsewhere.
I recently came across a similar episode from more than 100 years ago. The characters are of course different, but the first chapters are very much like those of the cold fusion episode. The later chapters, though, go in an unexpected direction, and show a Nobel laureate spending years pursuing nonreplicable results.
I’ll draw some lessons after I tell the story, which relies heavily on several books and articles, especially Dan Charles’ 2005 book Master Mind and Thomas Hager’s 2008 book The Alchemy of Air.
The Alchemy of Air
In an 1898 presidential address to the British Association for the Advancement of Science, the physicist William Crookes announced that humanity was approaching the brink of starvation. By 1930, he estimated, the world’s population would exceed the food supply that could be extracted from all the world’s arable land. Famine would inevitably follow. This being turn-of-the-century England, Crookes also suggested that famine would threaten the dominance of “the great Caucasian race,” since for some reason he believed that the supply of wheat on which “civilized mankind” depended would be more impacted that the supply of rice, corn, or millet that were more important to “other races, vastly superior to us in numbers, but differing widely in material and intellectual progress.”
Chemistry, Crookes argued, suggested a way to avoid that grim fate. Famine was only inevitable if the soil continued to produce crops at the rates that were typical at the end of the 19th century. But the soil could produce many more crops, and produce them more sustainably if it was enriched with massive amounts of fixed nitrogen. Natural nitrogen fertilizers, such as guano or saltpeter, were dwindling and might be gone by the 1930s. But 78 percent of air was nitrogen, even if it was not in a form that crops could use.
Crookes charged chemists with figuring out a way to “fix” the nitrogen in air—to react the N2 in the atmosphere with hydrogen from water vapor and produce the ammonia NH3 which was already known to be the key component of natural fertilizers. Solving the problem of making synthetic fertilizer would be a scientific triumph and a boon to humanity. The chemist who solved it would write their name in the history books. They would have shown how to make “bread from air.”
It did not escape notice that whatever chemist succeeded in making bread from air would become fabulously wealthy. And it did not escape notice that, in addition to making fertilizer, ammonia could be used to make explosives. The military applications of synthetic ammonia seemed especially important in Germany, which correctly anticipated that in the event of a war with Great Britain, the British navy could impose a blockade that would cut off Germany’s imports of both food and fertilizer.
Although the basic chemical reaction seemed straightforward, making it actually happen was not easy. It would require pressures and temperatures that had never been achieved in a laboratory. And it would require just the right catalyst.
In 1900, just two years after Crookes’ address, the German chemist Wilhelm Ostwald announced that he had synthesized ammonia. Ostwald was just the kind of person that other chemists imagined could make bread from air. He was nearly, well established, and widely regarded as one of the “fathers” of physical chemistry. He was already considered a strong candidate for a new prize funded by the estate of the recently deceased chemist Alfred Nobel. Synthesizing ammonia would make Ostwald a shoo-in. He applied for a patent and offered to sell his process for a million marks to the German company BASF.
There was only one problem. BASF couldn’t replicate Ostwald’s results. They put a junior chemist named Carl Bosch on the problem, but when he tried Ostwald’s process Bosch couldn’t produce a meaningful amount of ammonia. At best, the process would produce a couple of drops of ammonia and then stop. Which made no sense because the supply of nitrogen in the air was practically unlimited.
Bosch’s managers sent him back to the bench, and eventually he figured out what was going wrong. Ostwald had used iron as a catalyst, and the iron that Ostwald used, like the iron that Bosch used, was sometimes contaminated with a small amount of iron nitride. The nitrogen in the ammonia was coming from the iron nitride—not from the nitrogen in air. And since there was very little iron nitride, the process would never produce a meaningful amount of ammonia.
BASF wrote to Ostwald that they could not license his process after all. Ostwald withdrew his patent application, but he wasn’t exactly a gentleman about it. “When you entrust a task to a newly hired, inexperienced, know-nothing chemist,” Ostwald wrote to BASF managers, “then naturally nothing would come of it.”
But Bosch was right. No one at BASF could make Ostwald’s process work, and when they brought Ostwald in he couldn’t make it work either. He withdrew his patent application and the race to make bread from air continued.
Nine years, the problem of synthesizing ammonia was actually solved, but it was solved by a less obvious person. The person to show how to make bread from air was Fritz Haber, a 40-year-old chemist who worked at a respectable but not terribly prestigious university. Haber was Jewish and on nobody’s short list for the Nobel Prize.
Bosch was able to replicate Haber’s experiment, and spent the next few years scaling the process up. Bosch led a team of chemists and engineers who built large reaction chambers that could tolerate the required temperatures and pressures.
Meanwhile Ostwald sued BASF in collaboration with a rival company called Hoechst. Their claim was not that Haber’s process was invalid, but that it was not novel and its patents were invalid. If Ostwald and Hoechst had won, they would have been able to get into the ammonia business, too. But BASF won the suit, in part by bribing one of the key witnesses, an erstwhile rival of Haber’s named Walther Nernst. In 1913 BASF started running the world’s first ammonia synthesis factory, which produced a promising amount of ammonia in 1913 and 1914.
But in 1914 Germany entered World War One, Britain blockaded German ports, and all BASF’s ammonia was diverted from fertilizer to explosives, so that Germany could sustain its war effort even as its population starved. The repurposing of his life-giving work to deadly purposes upset Bosch so much that he went on a drinking binge.
Nevertheless, after the War, the spread of the Haber-Bosch process and related development led to massive increases in food production, which in the developing world are called the “Green Revolution.” It has been estimated that more than half the people now on earth simply would not be here if it weren’t for synthetic nitrogen fertilizers. Many of us owe our lives to them.
Why Did Chemistry Correct Chemists’ Errors?
The story so far is very much to the credit of turn-of-the-century chemistry. A senior chemist overreached and made a mistake, as everyone eventually does, but a junior chemist corrected him, and the path was cleared for one of chemistry’s greatest contributions to human thriving (and human misery).
But why did science self-correct so efficiently in this case? None of the characters in this story was a saint. Crookes was a racist (as were many English gentlemen at the time), and Ostwald could be a self-interested jerk. Haber and Bosch came off well in this story, but their later actions showed their dark side. Haber led Germany’s poison-gas program during and after World War One. Bosch eventually led IG Farben, the German chemical conglomerate that produced the Zyklon B gas used to kill Jews (including some of Haber’s relatives) in Nazi concentration camps. Bosch’s personal feelings were anti-Nazi and anti-war, but that didn’t matter because he never took a stand.
So if early twentieth-century chemists weren’t better human beings than modern psychologists and economists, why did they do a better job at rooting out nonreplicable results?
One reason, I think , was that chemistry had important practical uses. For modern social scientists, eminence is largely a social construction. It is measured by grants, publications, prizes, TED talks, appointments at prestigious universities. That was also true in early twentieth-century chemistry, but if it was done correctly, a breakthrough in chemistry would reach beyond one’s fellow chemists. It could lead to real, practical triumphs, like making bread from air.
And that’s why mistakes had to be corrected. BASF fully recognized that Ostwald would be annoyed by criticism of his work. But they couldn’t tiptoe around it, because they were trying to make ammonia from water and air. If Ostwald’s work couldn’t help them do that, then they couldn’t get into the fertilizer and explosives business. They couldn’t make bread from air. And they couldn’t pay Ostwald royalties. If the work wasn’t right, it was useless to everyone, including Ostwald.
That’s also the reason why the cold-fusion findings got corrected so quickly. If Fleischmann and Pons’ results were right, they could have led to an important new energy source. While it might have had unforeseen hazards, cold fusion would have made all other energy sources obsolete, solved global warming, and ended the need to buy oil from dangerous authoritarian countries. That’s why Fleischmann and Pons called a press conference, made the cover of the New York Times—and it’s why other scientists pounced on their result.
My friends in the natural sciences tell me that their fields do have some non-replicable results, but they’re typically in backwaters where only one lab has the necessary equipment and other labs aren’t particularly interested. If a finding looks to have practical applications, then nonreplicable results get exposed much more quickly.
Usually. But not always.
Fritz Haber’s Quest for Fool’s Gold
Two years after World War One, Haber and Bosch won the Nobel Prize and the Entente powers decided not to try Haber as a war criminal for his work on poison gas.
Now Fritz Haber began to look for other ways to serve his country. Germany’s new shortage was money. Under the Treaty of Versailles, Germany was charged with paying 132 billion gold marks in reparations—money it simply did not have. The burden brough hyperinflation and made it impossible for Germany to develop its economy and stabilize its experiment in democracy.
Haber started working on a potential solution: a chemical process that would extract gold from the sea. Ocean water contained trace quantities of gold, and Haber thought he could come up with an economical way to precipitate it out. It was an audacious plan, and if anyone else had suggested it, they might have been laughed at. But Haber was the scientist who had made bread from air. Other scientists joined Haber’s effort.
Amazingly, Haber’s lab did come up with several ways to draw gold from sea water—and one of those methods would have been cost-effective if, as an 1878 article suggested, there were 65 micrograms of gold per liter of sea water. On the basis of that estimate, Haber’s financial backers built him a lab in an ocean liner, but once Haber got out to sea, in 1923, he found that the concentration of gold was more than 10,000 times lower than he expected. (Modern estimates suggest that the concentration is lower still.) The result that Haber had been relying on was non-replicable.
There are two lessons here. First, in early 20th century chemistry, as in 21st century psychology and economics, you could not trust everything you read in the scientific literature. Haber had believed published estimates of the gold in sea water, and he had paid dearly. Instead of spending years developing chemical processes that would only work if the published estimates were accurate, he should have started by confirming those estimates. Careful scientists take nothing for granted; they check everything themselves.
Second, no scientist is so eminent that he cannot make mistakes. Ostwald was the most eminent chemist in Germany when he let contamination fool him into thinking he had synthesized ammonia. Haber had become the most eminent chemist in Germany when he failed to check basic measurements of the gold in sea water.
Not only does eminence not guarantee accuracy, it can actually breed the hubris that leads to error. There are many examples of scientists who once did amazing work becoming sloppy, jumping to conclusions, or having so many people work for them that they can’t keep track of what’s going on in their own labs.
What Will It Take for the Social Sciences to Self-Correct?
What would it take for fields like economics and social psychology to self-correct as chemistry and physics (sometimes) do?
A crucial feature that leads a field to become self-correcting is the potential for important technological applications. As long as scientists are only writing articles, allegations of error may be difficult to resolve. But once a field has advanced to the point where it can claim to make bread from air, or draw gold from the sea, it more quickly becomes clear if the claim can hold water.
Now, work in psychology and economics sometimes aspires to offer practical benefits as well. Economists make recommendations regarding fiscal, labor, and educational policy, and the economists on the Federal Reserve Board control certain interest rates. Psychologists, for their part, offer advice and test interventions designed to change attitudes, improve mental health and academic achievement, and effect social change.
Of course, these economic and psychological interventions often fail to achieve their stated aims. But it’s always possible to explain those failures after the fact, often citing unpredictable contextual factors that worked against the intervention’s success.
And that highlights the second reason why the natural sciences are more self-correcting. Experiments in the natural sciences are closely controlled. A chemist cannot say that the success of a reaction depends on dozens of unspecified contextual factors, many of which cannot be foreseen. Instead, a chemist is expected to specify the exact temperature, pressure, and catalyst required to make a reaction occur. If another chemist cannot replicate the reaction at the same temperature and pressure, with the same catalyst, then somebody has made a mistake. It might be the original researcher; it might be the researcher conducting the replication. But there is a mistake somewhere, and somebody ought to figure it out. In a mature science, things don’t just happen for no particular reason.
In the social sciences, by contrast, scholars can be remarkably accepting of contradictory findings. In education research, for example, there are findings suggesting that achievement gaps between rich and poor children have been growing for 50 years—and other findings suggesting that they have barely changed. There are findings suggesting that achievement gaps grow faster during summer than during school—and other findings suggesting they don’t. Even results of randomized controlled trials often fail to replicate. This bothers me, but it doesn’t bother everyone, and many scholars either limit their attention to the results currently in front of them, or pick and choose from contradictory results according to which one best fits their general world view or policy preferences.
Of course, this state of affairs doesn’t provide a very firm foundation for real scientific progress. Both psychology and economics are starting to adopt practices which, when followed, produce findings that can more often be replicated across different labs and research teams. But a large number of questionable findings are still in the literature, and many of them are still getting cited at high rates. Many fields will continue to spin their wheels until they can tighten their focus to a small subset of replicable results, observed under controlled conditions, with major, practical applications in the real world.