Traditional Machine Learning and Big Data Analytics in Virtual Screening: a Comparative Study

Check out more papers on Cloud Computing Cognition Information


Nowadays, the massive amount of data that needs to be processed is increased. High-performance computation (HPC) and big data analytics are required. Inside the identical context, research on drug discovery has reached an area where it has no preference but the usage of HPC and huge data processing systems to perform its targets in a reasonable time. Virtual screening (VS) is considered one of the most computationally intensive and heavy tasks, it acts an essential role in designing new drugs and has to be done easier. In this research, machine learning and big data analytics are learned in virtual screening, to use a ligand base and a structural base and rank molecular databases as active against a specific target protein. Both ligand-based and structure-based docking have implemented in system learning algorithms, including random forest, naïve Bayesian classifiers, support vector machines, neural networks, and decision trees, as well as deep learning techniques. 

This paper offers a summary of the use of machine learning and big data analytics framework in virtual screening. This paper summarizes the current progress in the use of traditional machine learning methods and big data analytics framework for a large data set in a multiple node. This paper compares the performance evaluation of machine learning methods and big analytics framework for ligand base. Discuss the feasibility of improving the performance of machine learning methods in large libraries on the various problems of classifying virtual screening. Finally, various challenges and solutions of virtual screening dataset in the machine learning and big data analytics are discussed.

Drug discovery virtual screening descriptors machine learning and big data analytics frameworks.


Unprecedented growth in biomedical data has been observed in recent years. The ability to analyze a large portion of this data will provide many opportunities that will in turn affect the future of health care. In this age, older storage and processing technologies are not sufficient to meet the demand and hence, computing technologies must scale to handle the huge volume of data. The main difficulty in managing these data is the speed at which they are generated, that is, data generation is much faster than the available computer resources for data analysis. The acquisition and processing of big data is useful for researchers in various fields , such as drug discovery, which involves the searching and identifying of drugs. The process of drug discovery is an extremely long, complicated and expensive process. It may take 12 to 15 years and cost more than $1 billion with the risk of failure . Thousands of molecules must be processed and selected in order to limit the number of candidates . 

The decision-making process is however restrained due to the growth in data generation, which poses a challenge to the development of data-based solutions that can effectively and accurately enhance decision-making in the drug discovery process. High-Throughput Screening (HTS) is an experimental tool widely used in the drug discovery process, where large molecular libraries are screened in fully automated environments. However, the very increasing size of these libraries as well as the expensive cost of HTS leads to the generation of only a few numbers of hits with high false-positive and false-negative rate . As an alternative, Virtual Screening (VS) is a pre-screening technique that is cheaper and faster than HTS. It has successfully been applied to decrease the number of compounds to be screened by generating new drug leads .

There are two virtual screening strategies: Ligand-based virtual screening (LBVS) and structure-based virtual screening (SBVS) . LBVS depends on the existing information about the ligands. It utilizes knowledge about a set of ligands that are known to be active for the given drug target. This type of virtual screening uses mining of big data analytics. Train binary classifier by a small part of ligand can be employed and very large sets of ligands can be easily classified into two classes: dockable ligands and non-dockable ones. SBVS, on the other hand, is used to dock experimentally. However, 3D protein structural information is required .

Machine Learning (ML) plays a vital role in VS for drug discovery. It is a branch of artificial intelligence. Nowadays, adaptive machine learning algorithms can be utilized to model Quantitative Structure-Activity Relationship (QSAR) and illustrate, with high accuracy, how chemical modifications might influence biological behaviour. In chemo-informatics, machine learning is usually utilized to classify molecules as active or inactive against a specific target or against multiple targets . In addition, machine learning algorithms can be utilized in docking molecular methods. Traditional machine learning methods can be used in datasets of small molecules and still give the best result. Nevertheless, due to the increasing number of molecules in the library and the unstructured data format, traditional methods cannot achieve all the set objectives. Therefore, as a promising solution, big data analytics techniques can be used in VS . 

The number of compounds in the chemical libraries has increased significantly. Libraries of molecules contain 1010 records (refers to the “volume” of data) and this value continues to rise. These libraries can be stored in different formats such as sdf or smiles files (refers to the “variety” of data) and the high rate of data generation refers to the “velocity”, “Volume”, “variety” and “velocity” are data characteristics that signify big data . Applying machine learning techniques on massive libraries (big data) in VS process is computationally expensive . There is a rising need to develop sophisticated frameworks of efficient big data analytics . Apache Hadoop and Apache Spark are two of the most popular platforms used for big data analytics. Recently, research in drug discovery has used big data analytics techniques to achieve objectives at high performance and under a short period of time .

This article provides an overview of machine learning and big data analytics for drug discovery. First, it utilizes appropriate feature generation and provides a comprehensive analysis of chemical descriptors and properties that contribute to virtual screening. Second, this paper compares data mining methods that are able to handle large-scale virtual screening for biological activity of compounds . Third, it shows the experimental results of recent articles used traditional machine learning, deep learning and big data frameworks. Finally, this paper highlights the problem of virtual screening and classification algorithms, especially in big datasets, as well as future research directions. In addition, it suggests recommendations to overcome the stated problems within this field.

This paper is organized as follows: Section 2 explains the VS process in drug discovery. In Section 3, literature review on machine learning and big data in VS is discussed. In Section 4, machine learning algorithms in VS are studied as a based solution. Experimental results and performance evaluations are explained in Section 5. Open problems and future areas of research are introduced in Section 6 and conclusions are presented in Section 7.


VS strategies can be classified into structure-based (SBVS) and ligand-based (LBVS), as in. SBVS strategies are physical communications between the compound and a protein target. In recent times, SBVS uses machine learning algorithms and docking software to calculate activity of molecules to a specific target. The difficulty of these techniques is that they require the three-dimensional (3D) structure of the protein, which is not available for all proteins. If the 3D structure of the target is not available, LBVS is used . This approach is generally referred to as similarity searches. Most classification methods use a small number of ligand datasets, which are known to be active or inactive to a specific protein, as training data and predict the unknown, as shown in Figure 1.

Virtual Screening

Figure 1: Taxonomy for 3D structure of virtual screening methods

Usually, libraries of virtual screening datasets, in their chemical form, are saved as SDF or smiles files. Training dataset plays an important role in the classification of molecules as active or inactive. To predict these molecules at a high performance rate, labelled and non-labelled data is available in many public libraries. Until date, numerous databanks have been established to focus on drug-like or non-drug-like ligand and also on molecular docking to predict high and low scoring molecules . There are many libraries of ligands and proteins that are used in virtual screening, as shown in Table 1

Table 1: Compound Library Dataset

  • Compound library No of compound Link
  • ChemSpider ~62 million
  • ChEMBL ~2 million
  • PubChem ~92million
  • Enamine ~1.7 million
  • Chembridge ~1 million
  • Drug Bank ~9591 (D) http://www.drugbank.
  • STITCH ~500 000
  • Binding DB ~ 635 301 .
  • BindingMoad ~12 440
  • KEGG ~18 211
  • ZINC ~35 million
  • eMolecules ~7 million

ChemDv ~1.6 million

A preprocessing step involving descriptors is used to convert the chemical library format to csv file. There are two types of descriptors that describe features of molecules, namely, chemical descriptors and chemical fingerprints. These descriptors are presented in the following subsection as shown in Table 2.

Table 2: Descriptor software

  • Software Description Web Site
  • PaDEL Calculates 1876 molecular descriptors and 12 type fingerprints
  • RDKit RDKit is open source software and a collection of cheminformatics and machine learning software written in C++ and Python.
  • Dragon Computes 5,270 atom descriptors by using various theoretical approaches
  • Open Babel Computes 4 types of fingerprints - FP2, FP3, FP4 and MACCS
  • Kit (CDK) Open source tools that uses Java library to give descriptor and fingerprint of molecules

Chemical Descriptors

Chemical descriptors are numerical attributes extracted from chemical structures for ligand data processing, compound diversity analysis and compound activity prediction [19]. Descriptors may be one, two, three or four-dimensional (1D, 2D, 3D or 4D). One-dimensional descriptors are scalars showing information such as atom counts, bond counts, molecular weight and sum of atomic properties or fragment counts. While 2D descriptors are topological descriptors, which show bonds between atoms of a compound and features like number of atomic bonds, substructure information and molecular connectivity index. Whereas, 3D descriptors are geometrical descriptors for 3D auto-correlation and surface properties, and 4D descriptors are 3D coordinates and conformations. There are several software descriptors that can be used to feed dataset as an input to machine learning algorithms. Some of these software descriptors are PaDEL, Dragon, RDKit and CDK toolkit, as shown in Table 2. These descriptors generate high dimension features based on the software used and the type of descriptor.

Chemical Fingerprints

Chemical fingerprints are high-dimensional features that are normally utilized in chemical metric examination and similarity-based VS applications, the elements of which are chemical descriptor values . Molecular ACCess System (MACCS) substructure fingerprints are 2D binary fingerprints (0 and 1) with 1024 bits, representing the presence or absence of specific substructure keys.

In the next section, we describe the taxonomy of virtual screening-based solutions. It is divided into traditional machine learning-based solution and big data analytics frameworks-based solution. Traditional machine learning-based solutions have several algorithms such as support vector, decision tree, naïve Bayesian, random forest, K-nearest neighbours, artificial neural network and deep learning algorithms. Some of these methods are used in LBVS and others in SBVS. Big data analytics frameworks-based solution employs Spark, Hadoop or MapReduce, as shown in Figure 2.

Figure 2: Taxonomy of virtual screening solutions

Literature Review

There are numerous studies in literature that explore the performance of machine learning methods for virtual examination. Some of these articles used traditional machine learning methods and others used machine learning in big data platforms.

Traditional machine learning based solution

Virtual Screening is typically employed to eliminate unwanted molecules (i.e. inactive or toxic) from a compound library. Machine learning methods can be used for VS by analyzing the structural feature of molecules with well-known activity or inactivity. Support vector machine (SVM), Wrapper Method (WM) and Subset Selection subset (SS) have been used to classify ligand as drug-like and non-drug-like . PaDEL was used as software descriptor to calculate attributes of ligand. Accuracy rates of the models were 88% for SVM, 90% for WM and 91% for SS. In another study, authors use SVM, random forest and deep learning algorithms to classify the compound as active or inactive as well as to compare classifications. Fingerprints were used as descriptors to calculate an accuracy of 94%. In , authors presented a new model of three ensemble classifiers, which were chosen based on voting. The model consists of decision tree (DT), multi-layer perception and support vector machine. In, the author used machine learning algorithms including decision trees, SVM, Random Forest (RF), 

Naive bays, Rotation forest and k-Nearest Neighbour (KNN) ; in addition, the classification accuracy of all these algorithms were compared. These models were utilized to predict drug-likeness using Tree-based Ensemble Classifier. In, data mining procedure was derived using a workflow consisting of two main stages - data visualization using the t-SNE method and six different algorithms, namely, DT, RF, support vector machines, artificial neural network, KNN and linear regression (LG), and one new method, AL Boost. These were used to build the classification models, distinguish drugs and non-drugs and generate three major classes of drug compounds. In, deep learning was used to predict potential drug targets and new drug indications by using a large-scale chemo genomics data while improving the performance of drug-target interaction prediction. This model generated an accuracy rate of 98%. In the same context, in, author introduced a deep learning-based approach that can identify ligands of targets. The performed experiments showed that deep learning outperforms the two widely used methods, which are Auto Dock Vina and Smina. 

The developed method accomplished higher values of Area Under Curve (AUC). In, authors developed a model based on deep learning and deep synergy and predicted, with high accuracy, dozens of synergies of drug combinations for cancer cell lines. Scientists have proven that deep synergy is able to provide the best predictions in the preparation of mutual verification with external test groups, outperforming other methods by a wide margin. Preparation of drug combinations based on deep synergy predictions at AUC of 0.90 can already reduce the time and costs spent on experimental verification. In, deep learning and machine learning methods were introduced to determine the impact of these modern methods in predicting new compounds against specific targets. Prediction and similarity of targets help to examine potential compounds based on already approved drugs.

Big data analytics frame work based solutions

Models for big data analytics are used in many fields such as social media, education, wireless network and drug discovery (especially in virtual screening). Big data refer to a large amount of data that are hard to handle by traditional methods. There are four V's that describe the characteristics of big data - volume, velocity, variety and veracity. There are several articles dealing with VS in big data platform such as Apache Spark and Apache Hadoop. In, deep learning algorithms utilized in Apache SparkH2o on big data set to classify a compound as a drug-like and non-drug-like reached one million ligands at a high accuracy rate. Dataset of drug discovery is highly dimensional. 

Deep learning can handle thousands of dimensions without the need for feature selection; however, it needs very large training datasets. In, authors used five models where the suggested algorithms were designed to be used on big data platforms such as Hadoop/MapReduce random forest, decision trees, naive based, multilayer perceptron and logistic regression classifier. Then, the authors selected three algorithms - random forest, naive based and multilayer Perceptron - to build the ensemble classifier and calculate the activity of ligand. In, authors presented a new approach based on the ensemble learning paradigm and Apache Spark to enhance the performance of large-scale virtual screening processes. They used three classifiers in combination, which include multi-layer perceptron, decision trees and SVM, to establish the ensemble learning model, in which the technique of aggregation had the majority vote

Did you like this example?

Cite this page

Traditional Machine Learning and Big Data Analytics in Virtual Screening: A Comparative Study. (2021, Oct 12). Retrieved July 13, 2024 , from

Save time with Studydriver!

Get in touch with our top writers for a non-plagiarized essays written to satisfy your needs

Get custom essay

Stuck on ideas? Struggling with a concept?

A professional writer will make a clear, mistake-free paper for you!

Get help with your assignment
Leave your email and we will send a sample to you.
Stop wasting your time searching for samples!
You can find a skilled professional who can write any paper for you.
Get unique paper

I'm Amy :)

I can help you save hours on your homework. Let's start by finding a writer.

Find Writer