Assignment 2 MSB1015 Making Multivariate Statistics Reuseable

Note

As a prerequisite, this notebook needs a certain Wikidata query R package which you can find here: https://github.com/bearloga/WikidataQueryServiceR

Project Synopsis

Before running a statistical analysis, we began with some WikiData data that needed to be completed, referenced, inspected, processed and cleaned. After this came the analysis and modelling, and, finally, visualisation of the results. The project process taught us that there are many choices to be made about methods, parameters and variables. I aimed to document this process in the right level of detail to render the project reproducible. This includes justifying many of the modelling choices. The following interactive notebook is an explanation of the steps I performed.

Computational Chemistry Background

In 1947 Harry Wiener showed a correlation between the physicochemical properties of organic compounds and their chemical structure. He made a correlation model that linked structural features of compounds with boiling points.

Brief Description of the Assignment

In this assignment we use SPARQL to query physicochemical properties from Wikidata for a series of chemical properties, then make a (PLS) Partial Least Squares model in an R Markdown notebook with RStudio. In order to correctly use the model for prediction it is necessary to split the dataset into a training and a test set. (using 10% or 20% as test set and the remaining for training) and also to use cross-validation (e.g. Leave-one-Out or Leave-10%-Out) An elbow plot will help to estimate the optimal number of latent variables to use in the predictive model. A scatterplot of the model predictions for the test set of the dependent variable - boiling point (the modelled property) versus the experimental (real, observed) value will serve as a visualisation of the accuracy of the model predictions, quantified as an RMSEP in units Kelvin. This assignment is part of weeks 4 and 5 of Scientific Programming on the Master’s Systems Biology at Maastricht University. [https://www.maastrichtuniversity.nl/education/master/master-systems-biology ]

Useful sources

Pharmacology: Rang & Dale, ISBN 9780702053627, in that edition Chapter 7 en 60 are of interest
For general knowledge about QSAR, e.g.: Varnek & Tropsha, Chemoinformatics approaches to virtual screening
Ch. 6 of Kahnâs Recent Trends on QSAR in the Pharmaeutical Perceptions (ebook)
Wiener H. Structural Determination of Paraffin Boiling Points. Journal of the American Chemical Society. 1947 Jan;69(1):17-20.
R Markdown: [ https://rmarkdown.rstudio.com/ ]
PLS package for R:[ https://cran.r-project.org/web/packages/pls/]
Vignette: [ https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf]
rcdk package for R Vignette/tutorial :[ https://cran.r-project.org/web/packages/rcdk/vignettes/using-rcdk.html]
The WikiData SPARQL endpoint is [https://query.wikidata.org/bigdata/namespace/wdq/sparql ]

STEP 0 SETUP

Set up libraries required for the assignment

library("rcdk")

## Loading required package: rcdklibs

## Loading required package: rJava

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

library(WikidataQueryServiceR)

STEP 1 COLLECT and PREPARE DATA

Using the R package WikidataQueryServiceR. Maintaining the SPARQL query syntax. The first query checks for any entries still waiting for their boiling point statements to be entered in WikiData.

WikiData objects and properties used here

P31 is an instance of
P279 is is a subclass of
Q41581 an alkane is an acyclic saturated hydrocarbon
P2102 is the boiling point
P233 is the canonical SMILES

Notes on SMILES

is a text string that represents the compound structure in 2D. ie. it specifies elements, connectivity and bond order, but not the 3D position of each atom. Examples of SMILES: -ethanol CCO; -ethylene C=C; -acetylene C#C; -2-propanol CC(O)C Query WikiData to find if there are many molecules with missing boiling point data.

sparql_query_missingbp <- 'SELECT DISTINCT ?alkane ?alkaneLabel ?SMILES ?boilingpoint
WHERE {
    ?alkane wdt:P31/wdt:P279* wd:Q41581 .
    ?alkane wdt:P233 ?SMILES .
    MINUS {?alkane wdt:P2102 ?boilingpoint .}
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}'
results_missingbp = query_wikidata(sparql_query_missingbp)

## 8 rows were returned by WDQS

During the period of working on the assignment, this number of molecules missing their boiling point data in WikiData fell from 11 to 8.

sparql_query <- 'SELECT DISTINCT ?alkane ?alkaneLabel ?boilingpoint 
WHERE {
    ?alkane wdt:P31/wdt:P279* wd:Q41581 .
    
    ?alkane wdt:P2102 ?boilingpoint .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}'
results_first = query_wikidata(sparql_query)

## 134 rows were returned by WDQS

first eyeball of the data : 131 unique / distinct observations are returned (140 non-unique) Update: 2 October. there are now 134.

range(results_first$boilingpoint)

## [1] -44.00 994.15

In the first week of the project this returned a range -44.00 to 994.15 - units unknown. That range has a suspiciously low value. In Kelvin expect all values non-negative! Let’s check out the units of the boiling points and convert anything non-Kelvin to Kelvin.

Convert units

The following query is altered from the one provided by Egon “set up to deal with the complex structure of statements with units in Wikidata.” ie it reaches right down into the hierarchical structure of the WikiData data to extract the various units entered by different WikiData users, and also a human readable label for those units.

units_query <- 'SELECT DISTINCT ?alkane ?alkaneLabel ?SMILES ?boilingpoint ?bpUnit ?bpUnitLabel WHERE {
  ?alkane wdt:P31/wdt:P279* wd:Q41581 .
  ?alkane wdt:P233 ?SMILES ;
        p:P2102 [
          ps:P2102 ?boilingpoint ;
          psv:P2102/wikibase:quantityUnit  ?bpUnit
        ] .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}'
results_bp = query_wikidata(units_query)

## 134 rows were returned by WDQS

possible_units <- unique(results_bp$bpUnitLabel)
possible_units

## [1] "kelvin"            "degree Fahrenheit" "degree Celsius"

getting the boiling points out of the WikiData DB the problem is that some of them are in Celsius and others in Fahrenheit. Convert units and change unitlabels across the dataset.

results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Celsius"] <- results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Celsius"] + 273.15
results_bp$bpUnitLabel[results_bp$bpUnitLabel == "degree Celsius"] <- "kelvin"
results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Fahrenheit"] <- (results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Fahrenheit"]-32)*5/9 + 273.15
results_bp$bpUnitLabel[results_bp$bpUnitLabel == "degree Fahrenheit"] <- "kelvin"

quick check

unique(results_bp$bpUnitLabel)

## [1] "kelvin"

head(results_bp$boilingpoint)

## [1] 827.15 939.15 390.15 383.25 973.15 391.75

Eyeball the data

range(results_bp$boilingpoint)

## [1] 111.40 994.15

plot.default(results_bp$boilingpoint)

Check the distribution of the input data

hist(results_bp$boilingpoint)

The boiling points are not normally distributed. It will be important to check the distribution of the training set later to verify that it is not biased or skewed.

Chemistry beginner’s question: Does the length of the SMILES string correlate with the boiling point?

plot(nchar(results_bp$SMILES),results_bp$boilingpoint)

The correlation between the boiling point and the string length of the SMILES looks good!!

cor(nchar(results_bp$SMILES),results_bp$boilingpoint)

## [1] 0.9340854

Collect variables which we expect to be informative of boiling point

Translate the data that we have extracted, in the form of SMILES, into what we need out of the CDK database in order to do the regression. For this use R package rcdk. This will give a collection of descriptors which can be used as variables, following that we will then generate latent variables with PLS.

Get Molecule Descriptors with rcdk

A key requirement for the predictive modeling of molecular properties and activities are molecular descriptors - numerical charaterizations of the molecular structure. The CDK implements a variety of molecular descriptors, categorized. rcdk allows us to access the cheminformatics functionality of the CDK from within R. There is a list of descriptors here https://cdk.github.io/cdk/1.5/docs/api/org/openscience/cdk/qsar/descriptors/molecular/package-tree.html I can also learn about descriptors in the rcdk documentation. https://cran.r-project.org/web/packages/rcdk/vignettes/using-rcdk.html

Explore rcdk

The molecule objects (that result from load.molecules) are of class jobjRef (provided by the rJava package). As a result,they are pretty opaque to the user and are really meant to be processed using methods from the rcdk or rJava packages.

Another common way to obtain molecule objects is by parsing SMILES strings. The simplest way to do this is eg

smile <- 'c1ccccc1CC(=O)C(N)CC1CCCCOC1'
mol <- parse.smiles(smile)[[1]]

Usage is more efficient when multiple SMILE are supplied, since then a single SMILES parser object is used to parse all the supplied SMILES. (from the rcdk tutorial -see useful sources) It is possible to evaluate all available descriptors at one go, or evaluate individual descriptors. Each descriptor name (eg “org.openscience.cdk.qsar.descriptors.molecular.WHIMDescriptor”) is a fully qualified Java class name for the corresponding descriptor. These names can be supplied to eval.desc to evaluate a single or multiple descriptors for one or more molecules.

Explore the descriptor categories

dc <- get.desc.categories()
dc

## [1] "hybrid"         "constitutional" "topological"    "electronic"    
## [5] "geometrical"

Choosing suitable variables for the predictive model

Here there are two opposing approaches to building the model: purely data-driven throw the kitchen sink at it, uses all the available descriptors from CDK (how many datapoints is that? - 219 after cleaning), followed by dimension reduction and then predicts based on the optimum number of latent variables. (bp_model_2) vs a more knowledge-based approach (bp_model_1). By eg. searching the descriptor list on “Weiner”, “boiling point”, “complexity” and handpicking likely important variables (this is the approach used in the rcdk vignette). The r pls package PLS method still employs principal component analysis, so our prediction will still run on latent variables anyway.

Generate a set of descriptors for a very simple model

The list of descriptor names used here comes from the rcdk tutorial (https://cran.r-project.org/web/packages/rcdk/vignettes/using-rcdk.html) the example in which the are also interested in predicting boiling points from organic molecules.

mols <- parse.smiles(results_bp$SMILES)
descNames_limited <- c(
 'org.openscience.cdk.qsar.descriptors.molecular.KierHallSmartsDescriptor',
 'org.openscience.cdk.qsar.descriptors.molecular.APolDescriptor',
 'org.openscience.cdk.qsar.descriptors.molecular.HBondDonorCountDescriptor')
descriptors_limited <- eval.desc(mols, descNames_limited)

Note: this gives 81 variables across the molecules list. Some will be NA, nevertheless 81 variables from 3 descriptors is a lot more than I expected. Prompting the question how many variables would there be from the complete set of around 50 descriptors? (answer: more than 200)

anyNA(descriptors_limited)

## [1] FALSE

Confirms that , as mentioned in the tutorial, these descriptors have been chosen so as to not return NAs. Nevertheless, this code, preparatory to building a predictive model from the descriptor matrix, is set up to remove columns with NAs and any constant columns from the matrix.

#remove NAs
descriptors_limited <- descriptors_limited[, !apply(descriptors_limited, 2, function(x) any(is.na(x)) )]
#remove constant columns
descriptors_limited <- descriptors_limited[, !apply( descriptors_limited, 2, function(x) length(unique(x)) == 1 )] #no. of variables drops from 81 to 5 on running this

Add boiling point to this descriptor matrix, this will enable us to use the formula syntax in plsr to build our model.

descriptor_matrix_1 <- cbind(descriptors_limited, results_bp$boilingpoint)
library(plyr)
descriptor_matrix_1 <- rename(descriptor_matrix_1,c("results_bp$boilingpoint" = "boilingpoint"))       #renames awkward column name (requires dplyr package)

STEP 2 MULTIVARIATE STATISTICS Partial Least Squares (PLS) with plsr

Split dataset into completely independent, randomly sampled Training and Test datasets

make a random 80/20 split

number <- nrow(descriptor_matrix_1)
indexes <- c(1:number)
sample_indexes <- sample(indexes, size = (number*0.2))
bp_train <- descriptor_matrix_1[-sample_indexes,]
bp_test <- descriptor_matrix_1[sample_indexes,]

Verify that the ditribution of the training data is not skewed.

hist(bp_train$boilingpoint)

Check the distribution of the boiling points of the training dataset on a histogram. Satisfactory: the training set has a similar distribution to the full set of alkanes.

Make first model

There are 5 descriptor variables Cross validation and PLS results, displayed on an elbow plot, suggest that 1 latent variable is optimum for this model.

bp_model_1 <- plsr(boilingpoint ~. ,ncomp = 3, data = bp_train, validation = "CV")
summary(bp_model_1)

## Data:    X dimension: 108 5 
##  Y dimension: 108 1
## Fit method: kernelpls
## Number of components considered: 3
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps
## CV           261.4    90.73    91.08    89.92
## adjCV        261.4    90.63    90.92    88.56
## 
## TRAINING: % variance explained
##               1 comps  2 comps  3 comps
## X               99.96   100.00   100.00
## boilingpoint    88.27    88.45    90.11

plot(RMSEP(bp_model_1), legendpos = "topright")

What’s the predictive power of this very basic model with a single latent variable?

Predict on the test set

pred_bp_1 <- predict(bp_model_1, ncomp = 1, newdata = bp_test)
plot(pred_bp_1,bp_test$boilingpoint, abline(a= 0, b = 1))

RMSEP(bp_model_1, newdata = bp_test)

## (Intercept)      1 comps      2 comps      3 comps  
##      255.07        82.45        77.56        78.19

This is not a satisfactory model. With a single latent variables RMSEP is somewhere between 90 and 100 Kelvin. It does not perform very well at predicting the boiling points of compounds in the training set. This choice of descriptors appears not to be sufficiently predictive.

Model 2 A data-driven model (bp_model_2)

Get all the descriptors possible from CDK

descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names))) #50 descriptors
dn_all <- eval.desc(mols, descNames) #286 variables

remove NA - containing columns as before

anyNA(dn_all)

## [1] TRUE

#remove NAs
desc_all <- dn_all[, !apply(dn_all, 2, function(x) any(is.na(x)) )]
#results in 219 variables (before was 286 variables)

and create one big decriptor_matrix_2 containing the boiling point

#Add boiling point to this to make the descriptor matrix
descriptor_matrix_2 <-  cbind(desc_all, results_bp$boilingpoint)
#note rename() needs library(plyr)
descriptor_matrix_2 <- rename(descriptor_matrix_2,c("results_bp$boilingpoint" = "boilingpoint"))       #renames awkward column name (uses dplyr package)

Split the training and test sets.

bp_train_2 <- descriptor_matrix_2[-(sample_indexes),]
bp_test_2 <- descriptor_matrix_2[sample_indexes,]

Make and cross-validate the model using plsr

bp_model_2 <- plsr(boilingpoint ~., data = bp_train_2, ncomp = 12, validation = "CV" )
summary(bp_model_2)

## Data:    X dimension: 108 219 
##  Y dimension: 108 1
## Fit method: kernelpls
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           261.4    166.9    79.74    46.09    33.47    32.02    31.32
## adjCV        261.4    166.7    79.65    45.96    33.25    31.89    31.07
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       31.08    31.19    31.01     31.34     32.19     31.99
## adjCV    30.83    30.95    30.77     31.02     31.84     31.58
## 
## TRAINING: % variance explained
##               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X               99.37   100.00   100.00   100.00   100.00   100.00
## boilingpoint    59.94    90.92    97.17    98.59    98.73    98.91
##               7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## X              100.00   100.00   100.00    100.00    100.00    100.00
## boilingpoint    98.95    98.96    98.97     99.01     99.01     99.04

Select the optimum number of latent variables

plot(RMSEP(bp_model_2), legendpos = "topright")

Based on the elbow plot , a model with 3 latent variables appears optimum with good performance without adding too much complexity. What’s the predictive power of this data-driven model with 3 latent variables?

STEP 3 VISUALISATION

Predict boiling points of the test set and generate a Scatterplot of Predicted boiling pt vs Observed boiling point in Test dataset

pred_bp_2 <- predict(bp_model_2, ncomp = 3, newdata = bp_test_2)
plot(pred_bp_2,bp_test_2$boilingpoint, abline(a= 0, b = 1))

This model is more successful than bp_model_1, predicting reasonably well.

STEP 4 CONCLUSIONS

Report prediction errors

RMSEP(bp_model_2, newdata = bp_test_2)

## (Intercept)      1 comps      2 comps      3 comps      4 comps  
##      255.07       154.07        69.42        24.64        15.41  
##     5 comps      6 comps      7 comps      8 comps      9 comps  
##       16.98        18.79        13.40        15.14        14.53  
##    10 comps     11 comps     12 comps  
##       17.92        19.34        17.34

It is interesting to note that, based on the earlier elbow plot depicting the results of the cross-validation RMSEP’s from the training set alone, the choice of 2, 3,4,5 or 6 latent variables for the model looked optimum, with the final model choice of 3 latent variables meeting the criterion ‘As simple as possible, but not simpler than really needed’, Whereas,the error measurement at 10 latent variables (ncomp = 10) gave the lowest actual error on the first particular set of independent randomly sampled test data that I ran, and on another run, was still decreasing at ncomp = 12. That number and the RMSEP will vary depending on the particular test set, and considering the multimodal distribution of the boiling points and the small (c.100) number of distinct molecules used to build the model, it can vary considerably.

Comparison of correlations

At the start of the project, I observed that between SMILES string length and boiling point the correlation was 0.93… Let us compare that to the correlation of bp_model_1

cor(pred_bp_1,bp_test$boilingpoint)

## [1] 0.9464219

On this particular test set, 0.935. (another run: 0.941) only a very small improvement on the SMILES model. Compare: what is now the correlation of my final model (bp_model_2) between predicted and observed boiling points? 0.99 (on another run : 0.979)

cor(pred_bp_2,bp_test_2$boilingpoint)

## [1] 0.9955534

So, the data-driven model with few latent variables created from a large number of variables with a chemical structural meaning is an improvement on the first simple model that I created from handpicked CDK descriptors with only one latent variable. That first model had a similar predicted/oberved correlation to the initial ultra-simple SMILES stringlength vs boiling point comparison.

The Final model (data-driven - 3 latent variables) RMSEP varies between 30 and 56, depending on the randomly selected test set.

Comparing this to literature :

Wiener constructed a relationship with 3 variables n (number of Carbon atoms) and w and p.

Reusebility of the code for other types of molecules

It would be interesting to repeat and apply to more chemical structures possibly alcohols. Currently there is a lack of available boiling point data in WikiData on alcohols and other organic molecules. We need more data to create a good model because we might expect a model trained for more complex organic molecules to be, itself, more complex (require more latent variables) in order to make similarly successful predictions.