Abstract
Introduction: Over the last decade there has been a proliferation of chemistry databases on the internet. We have gone from a point in the early 2000’s when there was little in the way of small-molecule and bioactivity data available online, to today, where web based publicly accessible databases can contain tens of millions of molecules. Many of these databases have over a million bioactivity data points [such as half-maximal inhibitory concentration (IC50) or inhibitor binding affinity (Ki) and data are shared and proliferated between them (e.g. ChEMBL, PubChem, and other databases mirror some of each other’s data). The evolution of these bioactivity databases has followed different routes. Examples include collections of molecules with one or more particular related bioactivity, collections of multiple curated sets of data, user deposited datasets and combinations of these. Databases were once mainly used to look up structure and properties, and as they expanded to include experimental and predicted properties their function shifted. Increasingly, these databases are used to predict potential targets based on the structure similarity principle, chemical–biological read-across and toxicology profiling and in many ways have evolved into portals for different data type. In parallel, commercial databases, such as Chemical Abstracts (CAS) SciFinder and GVKBio, focused on curated chemical structures, some of which have been quantitatively assessed for their complementarity with public databases and found to contain unique content. We have previously discussed the potential for divergence of these commercial systems from the public databases. The focus of this chapter will be on freely accessible databases such as BindingDB PubChem, ChEMBL,International Union of Basic and Clinical Pharmacology (IUPHAR)/BPS Guide to PHARMACOLOGY (GtoPdb) and public data in the Collaborative Drug Discovery (CDD) Vault. We also refer readers to earlier publications and discussions regarding public domain compound databases that have covered other systems and content. There have been numerous comparisons of public bioactivity databases at the level of molecules or targets that have suggested complementarity, and we do not intend to add any more from this perspective.18 There have also been efforts to combine different bioactivity databases. For example, Confederated Annotated Research Libraries of Small Molecule Biological Activity Data (CARLSBAD) brought together ChEMBL, GtoPdb, PubChem, WOMBAT19 and PDSP in order to help facilitate chemical biology research and data mining. CARLSBAD is only available to academics and non-commercial researchers; and even then one must apply in order to access it, which would likely deter the casual user. Another example of such a combined database is the ChemProt database,22 which is made up of data from seven databases and contains 1.7 million compounds and 7.8 million bioactivity measurements. It uses Daylight like fingerprints and can calculate the similarity ensemble approach (SEA). A naive Bayesian classifier was used with the Daylight like and Morgan fingerprints to build models for 850 proteins. Performance was described for only one model for hERG, although models for 143 other proteins were also suggested to outperform SEA.22
Original language | English |
---|---|
Title of host publication | High Throughput Screening Methods : Evolution and Refinement |
Editors | Joshua Bittker, Nathan Ross |
Publisher | Royal Society of Chemistry |
Pages | 344 |
Number of pages | 396 |
ISBN (Electronic) | 978-1-78262-979-5 |
ISBN (Print) | 978-1-78262-471-4 |
DOIs | |
Publication status | Published - 8 Dec 2016 |