SVM with recursive feature elimination in R

The support vector machine (SVM) is a very powerful classifier due to its inherent regularization properties as well as its ability to handle decision boundaries of arbitrary complexity by its formulation as a kernel method. However, suppose that we’re not so much interested in the decision boundary itself, but more in the relative importance of the features in specifying this decision boundary. For example, given two sets (i.e. ‘classes’) of samples, one from tumor tissue and one from normal tissue, and associated RNA-seq gene expression data, one may be interested in ranking genes (i.e. ‘features’) according to their importance in specifying the difference (i.e. ‘decision boundary’) between normal and tumor tissue. This represents an important problem in the field of machine learning, and is called feature ranking. Guyon et al. proposed combining SVMs with recursive feature elimination (SVM-RFE) for feature ranking. Recursive feature elimination in its simplest formulation starts with the complete set of features, and then repeats the following three steps until no more features are left:

1. Train a model (in the present case, an SVM).
2. Compute a ranking of features.
3. Remove the feature with the worst rank.

For step 2, ranking the features, Guyon et al. proposed to take the coefficients to the SVM primal problem (i.e. the coefficients to the SVM hyperplane) as feature weights. Since SVMs are computed by solving the dual problem using quadratic programming, these hyperplane coefficients first need to be computed from the coefficients to the dual problem. In R, for a matrix X with samples in the rows and features in the columns, and a factor y containing the classes, this looks as follows:

# single-step SVM-RFE
svm_rfe__single_step <- function(X, y, ...) {

library(kernlab)

# keep track of the iteration during which
# a feature was eliminated
ii <- rep(NA, ncol(X))
i <- 0
while ( any(is.na(ii)) ) {
# indices of remaining features
not_elim_yet <- which(is.na(ii))
# train the classifier on the remaining features
fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
# compute the primal problem coefficients from the dual
# problem coefficients
sv_i <- alphaindex(fit)[]
w <- t( coef(fit)[] ) %*% X[ sv_i, not_elim_yet ]
# eliminate the feature with the smallest squared weight
to_elim <- not_elim_yet[ head(order( w * w ), 1) ]
ii[to_elim] <- i
i <- i + 1
}
# convert iterations into ranks
i - ii

}

The above algorithm eliminates one feature (i.e. gene) per iteration. Typical RNA-seq expression datasets have hundreds to tens of thousands of features, so a call to the function above also to needs to train an SVM hundreds to tens of thousands of times. Needless to say, this can take quite a while. Note that we’re most likely mainly interested in the top-ranking genes, and less so in lower ranking genes. Therefore, we could also eliminate relatively many features during early iterations, and then decrease the number of features to be eliminated as the iteration progresses. Taking the approach of exponential decrease (i.e. eliminate a fixed fraction of remaining features), this looks as follows:

# multi-step SVM-RFE
# default elim_frac corresponds to single-step
svm_rfe <- function(X, y, elim_frac = 1 / ncol(X), ...) {

library(kernlab)

# keep track of the iteration during which
# a feature was eliminated
ii <- rep(NA, ncol(X))
i <- 0
while ( any(is.na(ii)) ) {
# indices of remaining features
not_elim_yet <- which(is.na(ii))
# number of features to eliminate
n_to_elim <- ceiling ( elim_frac * length(not_elim_yet) )
# train the classifier on the remaining features
fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
# compute the primal problem coefficients from the dual
# problem coefficients
sv_i <- alphaindex(fit)[]
w <- t( coef(fit)[] ) %*% X[ sv_i, not_elim_yet ]
# eliminate the features with the smallest squared weights
to_elim <- not_elim_yet[ head(order( w * w ), n_to_elim) ]
ii[to_elim] <- i
i <- i + 1
}
# convert iterations into ranks
i - ii

}

Let’s see how this works in practice. For this, we’ll use the ‘cervical’ data from the Bioconductor package MLseq. This dataset describes the expression levels as measured by RNA-seq of 714 miRNAs in 29 tumor and 29 normal human cervical samples. We can load and process the data, and run the above functions, using the following code:

library(MLSeq)

data(cervical)

# preprocess
y <- factor(grepl("^T[0-9].*", colnames(cervical)))
levels(y) <- c("normal", "tumor")
X <- scale(t(data.matrix(cervical)))

# SVM-RFE single-step
system.time(I <- svm_rfe(X, y))
# SVM-RFE multi-step
system.time(J <- svm_rfe(X, y, 0.1))

In the example above, using default values removes one feature at a time and thus trains 714 SVMs, one for each feature. On the other hand, setting elim_frac to 0.1 (i.e. at each iteration, 10% of features are eliminated), needs to train only 47 SVMs. Although setting elim_frac at 0.1 trains roughly 15 times fewer SVMs, it runs approximately 30 times as fast (~1 second vs. ~33 seconds on my laptop). This is because the number of SVMs trained on many features is relatively more constrained by the multi-step approach than the number of SVMs trained on fewer features. So, we lose some resolution in the relatively uninteresting lower ranking features, but gain a lot in terms of execution time. This gain in execution time is visualized in the following plot. Execution time of SVM-RFE in seconds (left) and number of SVMs trained (right); as a function of the fraction of features eliminated per iteration.

Faster Hamming distance in R (2)

In my last post, I proposed a very fast and simple matrix multiplication approach for calculating the Hamming distance between all columns of a matrix X. This approach was restricted to binary matrices:

hamming_binary <- function(X, Y = NULL) {
if (is.null(Y)) {
D <- t(1 - X) %*% X
D + t(D)
} else {
t(1 - X) %*% Y + t(X) %*% (1 - Y)
}
}

Since it only relies on low-level matrix algebra routines, the above function is extremely fast, much faster for example than the hamming.distance function from the e1071 package, which is implemented as an R-level nested loop and is also restricted to binary matrices. However, what if we would like to calculate the Hamming distance between strings of characters, such as DNA sequences, which are composed of As, Cs, Gs and Ts? The nice thing is, that matrix multiplication can also be used for matrices with more than two possible values, and different types of elements. Furthermore, the approach below is generalized to calculating the Hamming distances between all pairs of columns of two matrices X and Y:

hamming <- function(X, Y = NULL) {
if (is.null(Y)) {
uniqs <- unique(as.vector(X))
U <- X == uniqs
H <- t(U) %*% U
for ( uniq in uniqs[-1] ) {
U <- X == uniq
H <- H + t(U) %*% U
}
} else {
uniqs <- union(X, Y)
H <- t(X == uniqs) %*% (Y == uniqs)
for ( uniq in uniqs[-1] ) {
H <- H + t(X == uniq) %*% (Y == uniq)
}
}
nrow(X) - H
}

The function above calculates the Hamming distance between all columns of a matrix X, or two matrices X and Y. Again matrix multiplication is used, this time for counting, between two columns x and y, the number of cases in which corresponding elements have the same value (e.g. A, C, G or T). This counting is done for each of the possible values individually, while iteratively adding the results. The end result of the iterative adding is the sum of all corresponding elements that are the same, i.e. the inverse of the Hamming distance. Therefore, the last step is to subtract this end result H from the maximum possible distance, which is the number of rows of matrix X.

Interestingly, such a "multi-level" Hamming distance can also be quite elegantly expressed as a sum of multiple binary Hamming distances:

hamming <- function(X, Y = NULL) {
if (is.null(Y)) {
uniqs <- unique(as.vector(X))
H <- hamming_binary(X == uniqs)
for ( uniq in uniqs[-1] ) {
H <- H + hamming_binary(X == uniq)
}
} else {
uniqs <- union(X, Y)
H <- hamming_binary(X == uniqs, Y == uniqs)
for ( uniq in uniqs[-1] ) {
H <- H + hamming_binary(X == uniq, Y == uniq)
}
}
H / 2
}

Note that at the end, we need to divide by two, because by summing the individual binary Hamming distances, we are counting each element twice.

Also note that the function is not dependent on the mode of the matrix. In other words, the matrix can consist of characters, integers, and any other type of element you can store in an R matrix.

Fast Hamming distance in R using matrix multiplication

This post is strictly about binary matrices. For matrices of any data type, have a look at this post.

The hamming distance is perhaps the most frequently used distance measure for binary vectors. It is defined on two vectors of equal length as the number of corresponding elements that differ between the two vectors. A special case of the Hamming distance, is the Hamming distance between two binary vectors, which is often used e.g. in telecommunication to count the number of bit flips in a signal, and in bioinformatics as a similarity measure for hierarchical clustering. While this binary Hamming distance calculation between only two vectors is simple and computationally cheap, calculating all pairwise distances between the rows of a matrix quickly becomes prohibitive, as the computation time scales quadratically with the number of rows in a matrix. The following R function represents a basic implementation of calculating the Hamming distance between all rows of a matrix, using a nested for-loop:

hamming_loop <- function(X) {
d <- matrix(0, nrow = nrow(X), ncol = nrow(X))
for ( i in 1:(nrow(X) - 1) ) {
for ( j in (i + 1):nrow(X) ) {
d[i,j] <- d[j,i] <- sum(X[i,] != X[j,])
}
}
d
}

However, in R, the Hamming distance between the rows of a binary matrix can be computed much faster, by exploiting the fact that the dot product of two binary vectors x and (1 – y) counts the corresponding elements where x has a 1 and y has a 0:

hamming <- function(X) {
D <- (1 - X) %*% t(X)
D + t(D)
}

To demonstrate how much faster the matrix multiplication function is than the nested for-loop function, I ran some simulations. In addition to the above two functions, I included the function hamming.distance from the R-package e1071: Hamming distance computation time in seconds, as a function of number of rows, while keeping the number of columns at 100.

The relatively naive nested for-loop approach is faster than the hamming.distance function from the e1071 package. This is surprising, since the e1071 hamming.distance function is also implemented as a nested loop. Strikingly, the matrix multiplication-based function is by far the fastest function for computing the Hamming distance, the reason being that it does not use any expensive looping at R-level, only efficient lower level linear algebra routines. Note that the above approach is restricted to binary matrices. However, quite conveniently, matrix multiplication can also be used for computing the Hamming distance for matrices with more than two possible values, and other types of elements (character, numeric, …), as explained in this post.