Currently there are over 100 million known chemical compounds stored in various datasets, which still represents only a tiny fraction of the entire ‘drug-like’ chemical space (a staggering 1060 molecules!1). Cheminformatics, positioned at the intersection of chemistry and data science, has made it possible to efficiently store and manage these vast amounts of chemical information. Building on this, machine learning algorithms are designed to use structural representations of molecules together with their associated metadata to model molecular properties and behaviours, enabling extensive applications in drug discovery, chemical synthesis, virtual screening, and more.2
It is well established that the performance of machine learning algorithms is heavily dependent on the quantity and quality of data available. Publicly available databases like PubChem and ChEMBL collect chemical information from multiple sources including published literature, where ChEMBL is manually curated, and PubChem remains primarily as a data aggregator. The deposited chemical data can therefore suffer: for example, missing experimental conditions and errors leading to a lack of reproducibility.
Additionally, there is a heavy bias towards positive results as failed experiments are (for good reason) often not published. However the data from such failed experiments may be vital to offering a balanced view for accurate predictions.3
Although it may be more natural for chemists to think of molecules as skeletal formulae, an alternative representation must be used to enable computational processing. Simplified Molecular Input Line Entry System (SMILES) is a text-based linear representation of molecular structures. These simple and compact strings are both intuitive for chemical understanding and quick to process computationally: thus, they are widely adopted in machine learning.
However, and crucially, SMILES representations are not unique. SMILES strings are generated by following bonds from one atom to its closest neighbours, and depending on which atom is visited next, a single molecule may end up with several possible and correct representations. As the result, different SMILES annotations are implemented across various toolkits, which acts as a barrier for interoperability and exchange of information. To avoid confusion in training models, canonical SMILES algorithms (such as OpenSMILES and IUPAC SMILES+) are proposed as an alternative with the intention for wide adoption.
The four FAIR principles – Findability, Accessibility, Interoperability, and Reusability – were established in 2016 as a framework to tackle the growing challenges in accessing and using data. These principles can be applied across different fields and data outputs to support both human and machine accessibility. This strategy offers a way to increase data standardisation and interoperability to fuel the growing AI and machine learning integration.
This collaborative article was led by Anna Kukushkina, who interned in our Chemistry team last year.
Matthew is a Partner and Patent Attorney at Mewburn Ellis. Working primarily in the chemical and materials science fields, he has significant experience of the intricacies of the EPO. Matthew advises and assists clients with all stages of drafting, prosecution, opposition and appeal before the EPO. Many of his clients are Japanese and Chinese businesses that are seeking European patent protection. These include multinational corporations in the fields of high-performance ceramics and carbon fibre technologies, as well as pharmaceutical and cosmetic companies. Matthew also works with several research institutions and university technology transfer departments across Europe.
Email: matthew.smith@mewburn.com
Stay up to date with our latest thinking.