Large Chemical Database Investigates Hundreds of Suspicious Crystal Structures

The CCDC says 992 entries in its crystallography database are under review.Credit: Patrick McCabe/Alamy

The Cambridge Crystallographic Data Center (CCDC), a go-to resource for chemists seeking information on crystal structures, is reviewing nearly 1,000 database entries after a research integrity detective reported that the underlying scientific papers potentially came from paper mills – companies that sell fake scientific papers to researchers who need them for their CVs.

The CCDC database has never seen so many entries flagged as suspicious. Scientists who use it in their daily research say they are shocked by the scale of the alleged fraud.

“It creates the possibility of people wasting their time looking at materials that were never made,” says Randall Snurr, a chemical engineer at Northwestern University in Evanston, Illinois. He was surprised that so many papers slipped through the system.

The CCDC says 992 entries are potentially affected, but these represent a “very small amount of the total”. It’s unusual for multiple underlying research investigations to be happening at the same time, says Sophie Bryant, marketing manager at CCDC in Cambridge, UK.

Crystal Collection

The CCDC has been collecting data on the crystal structures of small organic and metallo-organic molecules since 1965 and currently lists more than one million structures. Its subscription database is accessible online and through a desktop application, and is an important resource for chemists and biochemists, who use it to study bonds and the geometry of molecular structures and interactions. Many journals in the field of crystallography require researchers to deposit their structural data with the CCDC.

The Cambridge Structural Database removes entries from time to time when individual articles are removed from the literature. In 2010, it removed 70 entries due to falsified data. But fewer than 300 structures have ever been retracted in its lifetime.

The latest expressions of concern were prompted by a preprint on the Research Square repository that flagged more than 800 questionable papers published in crystallography and exotic chemistry journals between 2015 and 2022.1. Many papers propose medical applications for metal-organic structures, a class of spongy materials that include both metal ions and organic molecules. The author of the preprint, retired psychology researcher David Bimler, noted that in these papers images and spectra that purport to characterize organic or metallo-organic structures were repeated. The articles also bear the mark of having been produced by a paper mill, including recycled and irrelevant references, suspicious email addresses, and odd turns of phrase that appear repeatedly in the methods section of articles apparently unconnected.

CCDC staff members perform tests to review submitted data and manually verify each entry. Some were already wary of a handful of structures on Bimler’s list before the preprint was released. When they saw his analysis, they launched an investigation. This involves double-checking all reported structures, including tests to identify unusual bond lengths and angles, and looking for evidence that the underlying structures or data might be based on existing database entries.

So far, the CCDC has issued expressions of concern for the 992 entries involved in the preprint and removed 12 structures described in 9 articles that were retracted. As the investigation is still ongoing, 277 of the reported structures were omitted from the last office data update in mid-June. However, these structures are still available in the online database. If the editors decide to withdraw an article, the data will also be removed. “We mirror the literature,” says Bryant.

Ongoing investigations

The affected journals are also investigating the preprint allegations. Chris Graf, director of research integrity at Springer Nature, says he is investigating concerns in 157 papers published in at least 5 of his journals, but it’s too early to draw conclusions. “If these concerns were proven to be true, they would very much support the need for the publishing industry to work collaboratively to solve the stationery problem,” says Graf. (NatureThe press team is editorially independent from Springer Nature, its publisher.)

Editor Wiley says it has already pulled two articles from the Journal of the Chinese Chemical Society, both listed in Bimler’s preprint. It investigates another 50 articles published in at least 15 journals – more than the 25 articles reported in the preprint. Elsevier, which has published 88 of the articles in at least 4 journals, says it is investigating and will report its findings in due course. A spokesperson for Taylor and Francis, which has published 204 of the articles in at least 2 journals, said it is actively investigating a large number of articles in those journals. “Our investigation began with an internal audit we conducted in 2021 and was expanded following concerns raised to us by researchers,” the spokesperson said.

“That’s probably a red flag,” says Suzanna Ward, CCDC’s database manager. “We are fortunate in crystallography that there is a standard file format universally used to publish data. It’s not like the data is buried in PDFs.

Chemist Filipe Almeida Paz from the University of Aveiro in Portugal is shocked by the situation. “It’s not in our DNA as scientists to try to fool others,” he says. Researchers use the CCDC database to inform drug discovery, he adds, and incorrect data will eventually waste time, so it’s important the database doesn’t get “contaminated with misinformation.” “even if only a small proportion of structures are affected.

Jon Clardy, a biochemist at Harvard Medical School in Boston, Massachusetts, says the potentially problematic data is only a small part of the database. “I’m not overly concerned that it undermines trust in the CCDC.” He adds that the paper mill has been “extraordinarily smart” to combine metal-organic frameworks with medical applications such as cancer immunotherapy, because the chances of people having studied both topics in depth are slim.

CCDC is now considering whether its processes need to change. Discussions continue on developing more automated screening to help scientists on the CCDC integrity team identify and prioritize what to look closer, Ward says.