As a prerequisite, this notebook needs a certain Wikidata query R package which you can find here: https://github.com/bearloga/WikidataQueryServiceR
Before running a statistical analysis, we began with some WikiData data that needed to be completed, referenced, inspected, processed and cleaned. After this came the analysis and modelling, and, finally, visualisation of the results. The project process taught us that there are many choices to be made about methods, parameters and variables. I aimed to document this process in the right level of detail to render the project reproducible. This includes justifying many of the modelling choices. The following interactive notebook is an explanation of the steps I performed.
In 1947 Harry Wiener showed a correlation between the physicochemical properties of organic compounds and their chemical structure. He made a correlation model that linked structural features of compounds with boiling points.
In this assignment we use SPARQL to query physicochemical properties from Wikidata for a series of chemical properties, then make a (PLS) Partial Least Squares model in an R Markdown notebook with RStudio. In order to correctly use the model for prediction it is necessary to split the dataset into a training and a test set. (using 10% or 20% as test set and the remaining for training) and also to use cross-validation (e.g. Leave-one-Out or Leave-10%-Out) An elbow plot will help to estimate the optimal number of latent variables to use in the predictive model. A scatterplot of the model predictions for the test set of the dependent variable - boiling point (the modelled property) versus the experimental (real, observed) value will serve as a visualisation of the accuracy of the model predictions, quantified as an RMSEP in units Kelvin. This assignment is part of weeks 4 and 5 of Scientific Programming on the Master’s Systems Biology at Maastricht University. [https://www.maastrichtuniversity.nl/education/master/master-systems-biology ]
rcdk package for R Vignette/tutorial :[ https://cran.r-project.org/web/packages/rcdk/vignettes/using-rcdk.html]
The WikiData SPARQL endpoint is [https://query.wikidata.org/bigdata/namespace/wdq/sparql ]
Set up libraries required for the assignment
library("rcdk")
## Loading required package: rcdklibs
## Loading required package: rJava
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
library(WikidataQueryServiceR)
Using the R package WikidataQueryServiceR. Maintaining the SPARQL query syntax. The first query checks for any entries still waiting for their boiling point statements to be entered in WikiData.
is a text string that represents the compound structure in 2D. ie. it specifies elements, connectivity and bond order, but not the 3D position of each atom. Examples of SMILES: -ethanol CCO; -ethylene C=C; -acetylene C#C; -2-propanol CC(O)C Query WikiData to find if there are many molecules with missing boiling point data.
sparql_query_missingbp <- 'SELECT DISTINCT ?alkane ?alkaneLabel ?SMILES ?boilingpoint
WHERE {
?alkane wdt:P31/wdt:P279* wd:Q41581 .
?alkane wdt:P233 ?SMILES .
MINUS {?alkane wdt:P2102 ?boilingpoint .}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}'
results_missingbp = query_wikidata(sparql_query_missingbp)
## 8 rows were returned by WDQS
During the period of working on the assignment, this number of molecules missing their boiling point data in WikiData fell from 11 to 8.
sparql_query <- 'SELECT DISTINCT ?alkane ?alkaneLabel ?boilingpoint
WHERE {
?alkane wdt:P31/wdt:P279* wd:Q41581 .
?alkane wdt:P2102 ?boilingpoint .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}'
results_first = query_wikidata(sparql_query)
## 134 rows were returned by WDQS
first eyeball of the data : 131 unique / distinct observations are returned (140 non-unique) Update: 2 October. there are now 134.
range(results_first$boilingpoint)
## [1] -44.00 994.15
In the first week of the project this returned a range -44.00 to 994.15 - units unknown. That range has a suspiciously low value. In Kelvin expect all values non-negative! Let’s check out the units of the boiling points and convert anything non-Kelvin to Kelvin.
The following query is altered from the one provided by Egon “set up to deal with the complex structure of statements with units in Wikidata.” ie it reaches right down into the hierarchical structure of the WikiData data to extract the various units entered by different WikiData users, and also a human readable label for those units.
units_query <- 'SELECT DISTINCT ?alkane ?alkaneLabel ?SMILES ?boilingpoint ?bpUnit ?bpUnitLabel WHERE {
?alkane wdt:P31/wdt:P279* wd:Q41581 .
?alkane wdt:P233 ?SMILES ;
p:P2102 [
ps:P2102 ?boilingpoint ;
psv:P2102/wikibase:quantityUnit ?bpUnit
] .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}'
results_bp = query_wikidata(units_query)
## 134 rows were returned by WDQS
possible_units <- unique(results_bp$bpUnitLabel)
possible_units
## [1] "kelvin" "degree Fahrenheit" "degree Celsius"
getting the boiling points out of the WikiData DB the problem is that some of them are in Celsius and others in Fahrenheit. Convert units and change unitlabels across the dataset.
results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Celsius"] <- results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Celsius"] + 273.15
results_bp$bpUnitLabel[results_bp$bpUnitLabel == "degree Celsius"] <- "kelvin"
results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Fahrenheit"] <- (results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Fahrenheit"]-32)*5/9 + 273.15
results_bp$bpUnitLabel[results_bp$bpUnitLabel == "degree Fahrenheit"] <- "kelvin"
quick check
unique(results_bp$bpUnitLabel)
## [1] "kelvin"
head(results_bp$boilingpoint)
## [1] 827.15 939.15 390.15 383.25 973.15 391.75
range(results_bp$boilingpoint)
## [1] 111.40 994.15
plot.default(results_bp$boilingpoint)
Check the distribution of the input data
hist(results_bp$boilingpoint)
The boiling points are not normally distributed. It will be important to check the distribution of the training set later to verify that it is not biased or skewed.
Chemistry beginner’s question: Does the length of the SMILES string correlate with the boiling point?
plot(nchar(results_bp$SMILES),results_bp$boilingpoint)
The correlation between the boiling point and the string length of the SMILES looks good!!
cor(nchar(results_bp$SMILES),results_bp$boilingpoint)
## [1] 0.9340854
Translate the data that we have extracted, in the form of SMILES, into what we need out of the CDK database in order to do the regression. For this use R package rcdk. This will give a collection of descriptors which can be used as variables, following that we will then generate latent variables with PLS.
A key requirement for the predictive modeling of molecular properties and activities are molecular descriptors - numerical charaterizations of the molecular structure. The CDK implements a variety of molecular descriptors, categorized. rcdk allows us to access the cheminformatics functionality of the CDK from within R. There is a list of descriptors here https://cdk.github.io/cdk/1.5/docs/api/org/openscience/cdk/qsar/descriptors/molecular/package-tree.html I can also learn about descriptors in the rcdk documentation. https://cran.r-project.org/web/packages/rcdk/vignettes/using-rcdk.html
The molecule objects (that result from load.molecules) are of class jobjRef (provided by the rJava package). As a result,they are pretty opaque to the user and are really meant to be processed using methods from the rcdk or rJava packages.
Another common way to obtain molecule objects is by parsing SMILES strings. The simplest way to do this is eg
smile <- 'c1ccccc1CC(=O)C(N)CC1CCCCOC1'
mol <- parse.smiles(smile)[[1]]
Usage is more efficient when multiple SMILE are supplied, since then a single SMILES parser object is used to parse all the supplied SMILES. (from the rcdk tutorial -see useful sources) It is possible to evaluate all available descriptors at one go, or evaluate individual descriptors. Each descriptor name (eg “org.openscience.cdk.qsar.descriptors.molecular.WHIMDescriptor”) is a fully qualified Java class name for the corresponding descriptor. These names can be supplied to eval.desc to evaluate a single or multiple descriptors for one or more molecules.
Explore the descriptor categories
dc <- get.desc.categories()
dc
## [1] "hybrid" "constitutional" "topological" "electronic"
## [5] "geometrical"
Here there are two opposing approaches to building the model: purely data-driven throw the kitchen sink at it, uses all the available descriptors from CDK (how many datapoints is that? - 219 after cleaning), followed by dimension reduction and then predicts based on the optimum number of latent variables. (bp_model_2) vs a more knowledge-based approach (bp_model_1). By eg. searching the descriptor list on “Weiner”, “boiling point”, “complexity” and handpicking likely important variables (this is the approach used in the rcdk vignette). The r pls package PLS method still employs principal component analysis, so our prediction will still run on latent variables anyway.
The list of descriptor names used here comes from the rcdk tutorial (https://cran.r-project.org/web/packages/rcdk/vignettes/using-rcdk.html) the example in which the are also interested in predicting boiling points from organic molecules.
mols <- parse.smiles(results_bp$SMILES)
descNames_limited <- c(
'org.openscience.cdk.qsar.descriptors.molecular.KierHallSmartsDescriptor',
'org.openscience.cdk.qsar.descriptors.molecular.APolDescriptor',
'org.openscience.cdk.qsar.descriptors.molecular.HBondDonorCountDescriptor')
descriptors_limited <- eval.desc(mols, descNames_limited)
Note: this gives 81 variables across the molecules list. Some will be NA, nevertheless 81 variables from 3 descriptors is a lot more than I expected. Prompting the question how many variables would there be from the complete set of around 50 descriptors? (answer: more than 200)
anyNA(descriptors_limited)
## [1] FALSE
Confirms that , as mentioned in the tutorial, these descriptors have been chosen so as to not return NAs. Nevertheless, this code, preparatory to building a predictive model from the descriptor matrix, is set up to remove columns with NAs and any constant columns from the matrix.
#remove NAs
descriptors_limited <- descriptors_limited[, !apply(descriptors_limited, 2, function(x) any(is.na(x)) )]
#remove constant columns
descriptors_limited <- descriptors_limited[, !apply( descriptors_limited, 2, function(x) length(unique(x)) == 1 )] #no. of variables drops from 81 to 5 on running this
Add boiling point to this descriptor matrix, this will enable us to use the formula syntax in plsr to build our model.
descriptor_matrix_1 <- cbind(descriptors_limited, results_bp$boilingpoint)
library(plyr)
descriptor_matrix_1 <- rename(descriptor_matrix_1,c("results_bp$boilingpoint" = "boilingpoint")) #renames awkward column name (requires dplyr package)
make a random 80/20 split
number <- nrow(descriptor_matrix_1)
indexes <- c(1:number)
sample_indexes <- sample(indexes, size = (number*0.2))
bp_train <- descriptor_matrix_1[-sample_indexes,]
bp_test <- descriptor_matrix_1[sample_indexes,]
hist(bp_train$boilingpoint)
Check the distribution of the boiling points of the training dataset on a histogram. Satisfactory: the training set has a similar distribution to the full set of alkanes.
There are 5 descriptor variables Cross validation and PLS results, displayed on an elbow plot, suggest that 1 latent variable is optimum for this model.
bp_model_1 <- plsr(boilingpoint ~. ,ncomp = 3, data = bp_train, validation = "CV")
summary(bp_model_1)
## Data: X dimension: 108 5
## Y dimension: 108 1
## Fit method: kernelpls
## Number of components considered: 3
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps
## CV 261.4 90.73 91.08 89.92
## adjCV 261.4 90.63 90.92 88.56
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps
## X 99.96 100.00 100.00
## boilingpoint 88.27 88.45 90.11
plot(RMSEP(bp_model_1), legendpos = "topright")
Predict on the test set
pred_bp_1 <- predict(bp_model_1, ncomp = 1, newdata = bp_test)
plot(pred_bp_1,bp_test$boilingpoint, abline(a= 0, b = 1))
RMSEP(bp_model_1, newdata = bp_test)
## (Intercept) 1 comps 2 comps 3 comps
## 255.07 82.45 77.56 78.19
This is not a satisfactory model. With a single latent variables RMSEP is somewhere between 90 and 100 Kelvin. It does not perform very well at predicting the boiling points of compounds in the training set. This choice of descriptors appears not to be sufficiently predictive.
descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names))) #50 descriptors
dn_all <- eval.desc(mols, descNames) #286 variables
remove NA - containing columns as before
anyNA(dn_all)
## [1] TRUE
#remove NAs
desc_all <- dn_all[, !apply(dn_all, 2, function(x) any(is.na(x)) )]
#results in 219 variables (before was 286 variables)
and create one big decriptor_matrix_2 containing the boiling point
#Add boiling point to this to make the descriptor matrix
descriptor_matrix_2 <- cbind(desc_all, results_bp$boilingpoint)
#note rename() needs library(plyr)
descriptor_matrix_2 <- rename(descriptor_matrix_2,c("results_bp$boilingpoint" = "boilingpoint")) #renames awkward column name (uses dplyr package)
Split the training and test sets.
bp_train_2 <- descriptor_matrix_2[-(sample_indexes),]
bp_test_2 <- descriptor_matrix_2[sample_indexes,]
Make and cross-validate the model using plsr
bp_model_2 <- plsr(boilingpoint ~., data = bp_train_2, ncomp = 12, validation = "CV" )
summary(bp_model_2)
## Data: X dimension: 108 219
## Y dimension: 108 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 261.4 166.9 79.74 46.09 33.47 32.02 31.32
## adjCV 261.4 166.7 79.65 45.96 33.25 31.89 31.07
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 31.08 31.19 31.01 31.34 32.19 31.99
## adjCV 30.83 30.95 30.77 31.02 31.84 31.58
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 99.37 100.00 100.00 100.00 100.00 100.00
## boilingpoint 59.94 90.92 97.17 98.59 98.73 98.91
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00
## boilingpoint 98.95 98.96 98.97 99.01 99.01 99.04
Select the optimum number of latent variables
plot(RMSEP(bp_model_2), legendpos = "topright")
Based on the elbow plot , a model with 3 latent variables appears optimum with good performance without adding too much complexity. What’s the predictive power of this data-driven model with 3 latent variables?
Predict boiling points of the test set and generate a Scatterplot of Predicted boiling pt vs Observed boiling point in Test dataset
pred_bp_2 <- predict(bp_model_2, ncomp = 3, newdata = bp_test_2)
plot(pred_bp_2,bp_test_2$boilingpoint, abline(a= 0, b = 1))
This model is more successful than bp_model_1, predicting reasonably well.
RMSEP(bp_model_2, newdata = bp_test_2)
## (Intercept) 1 comps 2 comps 3 comps 4 comps
## 255.07 154.07 69.42 24.64 15.41
## 5 comps 6 comps 7 comps 8 comps 9 comps
## 16.98 18.79 13.40 15.14 14.53
## 10 comps 11 comps 12 comps
## 17.92 19.34 17.34
It is interesting to note that, based on the earlier elbow plot depicting the results of the cross-validation RMSEP’s from the training set alone, the choice of 2, 3,4,5 or 6 latent variables for the model looked optimum, with the final model choice of 3 latent variables meeting the criterion ‘As simple as possible, but not simpler than really needed’, Whereas,the error measurement at 10 latent variables (ncomp = 10) gave the lowest actual error on the first particular set of independent randomly sampled test data that I ran, and on another run, was still decreasing at ncomp = 12. That number and the RMSEP will vary depending on the particular test set, and considering the multimodal distribution of the boiling points and the small (c.100) number of distinct molecules used to build the model, it can vary considerably.
At the start of the project, I observed that between SMILES string length and boiling point the correlation was 0.93… Let us compare that to the correlation of bp_model_1
cor(pred_bp_1,bp_test$boilingpoint)
## [1] 0.9464219
On this particular test set, 0.935. (another run: 0.941) only a very small improvement on the SMILES model. Compare: what is now the correlation of my final model (bp_model_2) between predicted and observed boiling points? 0.99 (on another run : 0.979)
cor(pred_bp_2,bp_test_2$boilingpoint)
## [1] 0.9955534
So, the data-driven model with few latent variables created from a large number of variables with a chemical structural meaning is an improvement on the first simple model that I created from handpicked CDK descriptors with only one latent variable. That first model had a similar predicted/oberved correlation to the initial ultra-simple SMILES stringlength vs boiling point comparison.
The Final model (data-driven - 3 latent variables) RMSEP varies between 30 and 56, depending on the randomly selected test set.
Wiener constructed a relationship with 3 variables n (number of Carbon atoms) and w and p.
It would be interesting to repeat and apply to more chemical structures possibly alcohols. Currently there is a lack of available boiling point data in WikiData on alcohols and other organic molecules. We need more data to create a good model because we might expect a model trained for more complex organic molecules to be, itself, more complex (require more latent variables) in order to make similarly successful predictions.