From SMILES to Frowns: Challenges in Cheminformatics

Matthew Smith

3 min read

Currently there are over 100 million known chemical compounds stored in various datasets, which still represents only a tiny fraction of the entire ‘drug-like’ chemical space (a staggering 1060 molecules!1). Cheminformatics, positioned at the intersection of chemistry and data science, has made it possible to efficiently store and manage these vast amounts of chemical information. Building on this, machine learning algorithms are designed to use structural representations of molecules together with their associated metadata to model molecular properties and behaviours, enabling extensive applications in drug discovery, chemical synthesis, virtual screening, and more.2

Data quality and availability

It is well established that the performance of machine learning algorithms is heavily dependent on the quantity and quality of data available. Publicly available databases like PubChem and ChEMBL collect chemical information from multiple sources including published literature, where ChEMBL is manually curated, and PubChem remains primarily as a data aggregator. The deposited chemical data can therefore suffer: for example, missing experimental conditions and errors leading to a lack of reproducibility.

Additionally, there is a heavy bias towards positive results as failed experiments are (for good reason) often not published. However the data from such failed experiments may be vital to offering a balanced view for accurate predictions.3

Data representation

Although it may be more natural for chemists to think of molecules as skeletal formulae, an alternative representation must be used to enable computational processing. Simplified Molecular Input Line Entry System (SMILES) is a text-based linear representation of molecular structures. These simple and compact strings are both intuitive for chemical understanding and quick to process computationally: thus, they are widely adopted in machine learning.

However, and crucially, SMILES representations are not unique. SMILES strings are generated by following bonds from one atom to its closest neighbours, and depending on which atom is visited next, a single molecule may end up with several possible and correct representations. As the result, different SMILES annotations are implemented across various toolkits, which acts as a barrier for interoperability and exchange of information. To avoid confusion in training models, canonical SMILES algorithms (such as OpenSMILES and IUPAC SMILES+) are proposed as an alternative with the intention for wide adoption.

FAIR data

The four FAIR principles – Findability, Accessibility, Interoperability, and Reusability – were established in 2016 as a framework to tackle the growing challenges in accessing and using data. These principles can be applied across different fields and data outputs to support both human and machine accessibility. This strategy offers a way to increase data standardisation and interoperability to fuel the growing AI and machine learning integration. 

 




  1. L. Reymond. Accounts of Chemical Research, 2015, 48(3), 722-730. DOI: 10.1021/ar500432k
  2. A. Raslan, S.A. Raslan, E.M. Shehata, A.S. Mahmoud and N.A. Sabri. Pharmaceuticals (Basel), 2023, 16(7), 1050. DOI: 10.3390/ph16071050
  3. B. Joshi. Artificial Intelligence Review, 2023, 56, 9089-9114. DOI:10.1007/s10462-023-10391-w

 

This collaborative article was led by Anna Kukushkina, who interned in our Chemistry team last year. 

News, insights, and features

Stay up to date with our latest thinking.