Ezqsar: An R Package for Developing QSAR Models Directly From Structures

Background: Quantitative Structure Activity Relationship (QSAR) is a difficult computational chemistry approach for beginner scientists and a time consuming one for even more experienced researchers. Method and Materials: Ezqsar which is introduced here addresses both the issues. It considers important steps to have a reliable QSAR model. Besides calculation of descriptors using CDK library, highly correlated descriptors are removed, a provided data set is divided to train and test sets, descriptors are selected by a statistical method, statistical parameter for the model are presented and applicability domain is investigated. Results: Finally, the model can be applied to predict the activities for an extra set of molecules for a purpose of either lead optimization or virtual screening. The performance is demonstrated by an example. Conclusion: The R package, ezqsar, is freely available via https://github.com/shamsaraj/ezqsar, and it runs on Linux and MS-Windows.


BACKGROUND
Quantitative Structure Activity Relationship (QSAR) is an old but still applicable method for the various branches of chemistry. It attempts to find a model that can predict the biological activity of chemical compounds using their structural features [1]. There are several modules available in commercial tools [2] (SYBYL, MOE and Schrodinger suite) that make QSAR studies simpler than ever.

MATERIALS AND METHODS
There are also some open source tools available to facilitate the computation of QSAR models. Some of them are collection of individual tools; each performs specific step of a QSAR procedure such as modeling [3] statistical validation [4,5] and generation of descriptors [6 -7]. Some of them present some tools that can be used to generate CoMFA-like 3D QSAR models such as, Open3DQSAR or PyCoMFA [8]. CORAL is a freeware that is developed to build QSAR PLS models using a specific set of descriptors so-called SMILES notation based optimal descriptors [9 -10] QSARINS is another standalone freeware that can build QSAR MLR with various functions including data processing and partitioning, validation of the model, prediction of a new compound activity and determination of applicability [11]. However, it does not have any built-in descriptor generation ability. Among the available open source tools, the most similar one to ezqsar is another R-package that is called camb [12]. Some of the features are similar between those packages. They are intended to be used mostly by beginners. They provide a single function that can do the entire job starting with a data set and ends with a complete QSAR model that includes descriptor generation, data processing, modeling, internal and external validity assessment, presentation of a QSAR model with the ability to predict the activities of new structures. The advantages of ezqsar over camb are the applicability domain functionality, easier use and interpretability of the developed model. However, the camb package currently has more training options than ezqsar such as machine learning methods. The advanced users are advised to use caret package to model pregenerated descriptors. It wrapped up QSAR tools in several functions and user can tune several parameters for each one, but ezqsar could be used by advanced users to provide an easy and precise look on the modelability of a data set and prediction of the activity of a test set with estimation of applicability domain. They could also use the power of R scripting to enrich the output plots or automate model building for large number of data set or develop several models for a single data set and discover the best model among them. Finally, the selected descriptors by ezqsar for a reliable QSAR model could be used for mechanistic interpretation of the model.
Here, an open source R (R: a language and environment for statistical computing; R Foundation for Statistical Computing, Vienna, Austria; URL http://www.R-project.org/.) package is introduced that could develop Multiple Linear Regression (MLR) QSAR models from 2D or 3D structures and their corresponding activities via a single line command. Then, in an iterative process, the QSAR model could be refined by modifications in, for example, the number of selected descriptors and test set selection. Data set selection and preparation is a first step and the most important step in a QSAR study. The structures should be checked if they are retrieved from public databases. Data set should have the least possible experimental uncertainty. Experimental uncertainty arise from systematic error or in case of single point activity determination. The detection of possible experimental uncertainty in the data set may be detected by statistical methods but it is not easy [13 -15]. The descriptor generation in ezqsar is done using CDK library [16]. It computes 2D and 3D descriptors. They are classified into five groups "topological", "geometrical" "hybrid", "constitutional", and "electronic". If the input structures are in 3D coordinates, the 3D descriptors will be calculated otherwise, the value for the 3D descriptors would be zero. A list of the all-275 CDK descriptors is presented in Table (1). Now, ezqsar only accepts SDF file as an input and the structures should be verified beforehand regarding specific chirality, protonation state and tautomeric form. MLR is a simple, reproducible and easy interpretable method used usually in QSAR. An MLR equation can be expressed generally like the following [17,18]: In the above expression, Y is the dependent variable (here is activity), X 1 , X 2 , ..., X n are independent variables (descriptors) present in the model with the corresponding regression coefficients a 1 , a 2 , ..., a n , respectively, and a is the constant term of the model. The quality of a MLR model is evaluated using the number of metrics as described below [17,18]. ezqsar_f function uses Leave-one-out (LOO) cross-validation method for cross-validation: The determination coefficient (R 2 ) and predictive R 2 (R 2 pred ) are defined in the following manners: The applicability domain is defined as a space constructed (structural, chemical, etc.) by the model that plays a crucial role for estimating the applicability of the model for predicting the activity of new compounds [17]. In other word, the prediction of a property of interest (here, activity) of a new compound using a QSAR model is applicable only if it is within the applicability domain of the QSAR model. The final model could be used to predict the activity of an extra set of compounds that are within the applicability domain of the model. This is checked by finding possible outliers of the standardized values of the descriptors (S ki ) which are present in the model as well as by calculation of Tanimoto similarity index of a compound fingerprint to the each compound in the train set.
Standardized descriptor values can be calculated as follows, that is a simplified form suggested by Roy K. et al. [21]

(6)
Where, K is the number of compounds, i is the number of descriptors, S Ki is the standardized descriptor i for compound K (from the training or test set, X Ki is the original descriptor i for compound K (from the training or test set), is the mean value of the descriptor X i for the training set compounds only, σ xi is the standard deviation of the descriptor X i for the training set compounds only. The above calculation is meant for all descriptor values present in the Tanimoto similarity indexes are calculated as it follows [22 -24]: where N ab is the number of common "1" bits that occur in both fingerprint a and fingerprint b, N a is the number of "1" bits in fingerprint a, N b is the number of "1" bits in fingerprint b. In ezqsar_f, Tanimoto index is computed by fingerprint package.

IMPLEMENTATION
The codes were implemented in a package available at github and can be installed and loaded by the following commands in R environment: install.packages("devtools ") devtools::install_github("shamsaraj/ezqsar") library ("ezqsar") #This will load the package It depends on four packages: caret, fingerprint, leaps and rcdk.

RESULTS AND DISCUSSION
The performance of the ezqsar package in an example data set that is provided by the package after installation was demonstrated in the study. The data set Table (1 and Fig. 1) was taken from a study [25] and a HQSAR model was already available to them [26]. It has a single function called "ezqsar_f". Like other R functions, one can get help for the function by: Fig. (1). General structure for the dataset. help("ezqsar_f") An overview of the ezqsar_f workflow is demonstrated in Fig. (2). All of the molecules were collected in a single SDF file. Activities were provided in a separate csv file rank ordered same as the SDF file. The activities were expressed as pIC 50, however, they also can be expressed as IC 50 . Alternatively, the activities also can be provided in a SDF file with the header of "IC 50 ". Data set contains 24 MMP-12 inhibitors. There is an extra test set that includes three other MMP-12 inhibitors with similar structures. The example can be run by following the command after loading the installed package, ezqsar: example ("ezqsar_f") This will run following the script: file1<-system.file ("extdata", "molecules-3d.sdf", package = "ezqsar") file2<-system.file ("extdata", "IC50.csv", package = "ezqsar") file3<-system.file ("extdata", "newset-3d.sdf", package = "ezqsar") model<-ezqsar_f (SDFfile=file1, activityfile=file2, newdataset=file3, testset=c (4,6,12,22) attributes (model) Fig. (2). The workflow of ezqsar_f function from ezqsar package.

Tanimoto (a, b) = ( + )−
The input files were stored as files 1, 2 and 3 by executing lines 1, 2 and 3. The model is developed by executing line 4. This runs the function with default parameters and stores the results as an object called, model. Highly correlated descriptors (correlation coefficient over a defined threshold) can be removed from the descriptor table before further processing. By default, this threshold is 1 (correlated descriptors are not removed), test set ratio is 20% and the selection is based on the activities; descriptors are selected by forward selecting method before final MLR model development.
The calculated descriptors are reported as an csv file named "descriptors.csv". A plot ("Plot-MLR.pdf ") will also be available in the working directory (can be changed by setwd () in R) after a successful run regarding the observed vs predicted activity values for both train and test sets using the developed model Fig. (3). Available outputs are shown after executing line 6 and some of the most important ones were printed out by remaining lines of the example (lines 6 to 13). The statistical metrics for the example data set are shown in (Table 2).  Fig. (3). Plot of observed versus predicted activities obtained from model1 for training (blue circles) and test (red triangles) sets.