AI annotations added to the SureChEMBL database increase the depth of data made available to users
Summary
AI-generated annotations integrated into SureChEMBL have broadened the range of information made available to its users.
The platform’s integration with AI, developed through a collaboration with EMBL-EBI’s Europe PMC team, advances the platform’s interpretation of complex patent data.
Future developments, including an open API and expanded patent coverage, aim to enhance the accessibility and capabilities of the platform.
SureChEMBL is a tool for accessing and interpreting chemical and compound information extracted from text and images found within patent documents. Patents are a type of intellectual property that give their owners the legal right to exclude others from making, using, or selling an invention for a limited period of time.
SureChEMBL operates using an entirely automated framework, extracting and annotating chemical structures from patents. SureChEMBL also identifies biological proteins, targets, and diseases, and so adds layers of value to the raw patent data.
How is SureChEMBL used?
Discover and analyse chemical structures: SureChEMBL allows users to identify and analyse chemical structures mentioned in patents. This feature is crucial for researchers developing new drugs or looking to understand existing medications. Intellectual property (IP) research: SureChEMBL serves as a resource for IP research, enabling users to track the patenting activities in specific therapeutic areas or for particular compounds. This information is vital for pharmaceutical companies, helping them make informed decisions about their research and development strategies. Commercial sector inquiries: Users from the commercial sector engage with SureChEMBL for specialised queries, including detailed analysis of drug targets, competitive intelligence, and exploration of new therapeutic areas.
Over the last few years, SureChEMBL has undergone a significant overhaul with funding from the Wellcome Trust. This rewriting of the SureChEMBL system has streamlined it for better compatibility with EMBL-EBI’s computing infrastructure and paved the way for the integration of new capabilities, including AI-generated annotations.
“SureChEMBL has evolved into a modern, sophisticated, and sustainable platform enriched with valuable annotations,” said Andrew Leach, Head of Chemical Biology and Head of Industry Partnerships at EMBL-EBI. “This transition has broadened the depth of data available to researchers and helps to streamline their exploration of complex patent documents. Looking ahead, we are committed to including more diverse and comprehensive information to aid drug discovery and development. This will be so much easier on the new SureChEMBL technical platform.”
AI annotation update
The recent integration of AI techniques for identifying and extracting annotation information has enabled the SureChEMBL platform to process more of the data in patents, moving beyond just chemical structures to include a broader range of biomedical entities. These include disease targets, proteins and genes linking chemicals or compounds to a specific disease. The platform also contains information on intellectual property considerations associated with the chemical or compound.
The development and integration of these biomedical annotations in SureChEMBL was made possible through a collaborative effort using an in-house AI algorithm developed within Europe PMC, EMBL-EBI’s database of life science literature. This algorithm, initially developed for annotating scientific publications, was adapted and refined to suit the unique requirements of patent texts used by SureChEMBL.
“Collaboration between different groups at EMBL-EBI is something that drove this project forward,” said Nicolas Bosc, Patent Data Scientist and Informatics Expert at EMBL-EBI. “Taking the expertise from the Europe PMC team and building upon it to address the unique challenge of patent data, we were able to enhance the capabilities of SureChEMBL. This is a fantastic example of the collaborations that are possible within the organisation.”
Future developments
The SureChEMBL team plans to continue its collaboration with Europe PMC to further refine the AI model used to generate their annotations for more precise interpretation of patent texts. Looking ahead, SureChEMBL is also set to further develop its open API to expand user accessibility and interaction with the platform.
Additionally, the SureChEMBL team plans to expand the database’s patent repository, including documents from major global patent offices. They will also soon include an update that will integrate an advanced filtering protocol to help users easily identify pharmacologically relevant chemical structures.
“This pipeline will facilitate the identification of pharmacological compounds exemplifying patent claims and their differentiation from other chemicals also mentioned in the patent document with little relevance to drug discovery efforts,” said Maria Falaguera, Postdoctoral Fellow at EMBL-EBI.
“We made a strategic decision to redevelop SureChEMBL from scratch,” said Tevfik Kizilören, Software Engineer at EMBL-EBI. “This was necessary to simplify the system and embrace the latest technologies. By doing so, we’ve significantly enhanced the platform’s efficiency and user experience and paved the way for many more updates in the future.”
Funding
The development of SureChEMBL was funded by the Wellcome Trust: Grant 223716/Z/21/Z, “SureChEMBL: open patent data for all”.