SVM with recursive feature elimination in R

The support vector machine (SVM) is a very powerful classifier due to its inherent regularization properties as well as its ability to handle decision boundaries of arbitrary complexity by its formulation as a kernel method. However, suppose that we’re not so much interested in the decision boundary itself, but more in the relative importance of the features in specifying this decision boundary. For example, given two sets (i.e. ‘classes’) of samples, one from tumor tissue and one from normal tissue, and associated RNA-seq gene expression data, one may be interested in ranking genes (i.e. ‘features’) according to their importance in specifying the difference (i.e. ‘decision boundary’) between normal and tumor tissue. This represents an important problem in the field of machine learning, and is called feature ranking. Guyon et al. proposed combining SVMs with recursive feature elimination (SVM-RFE) for feature ranking. Recursive feature elimination in its simplest formulation starts with the complete set of features, and then repeats the following three steps until no more features are left:

  1. Train a model (in the present case, an SVM).
  2. Compute a ranking of features.
  3. Remove the feature with the worst rank.

For step 2, ranking the features, Guyon et al. proposed to take the coefficients to the SVM primal problem (i.e. the coefficients to the SVM hyperplane) as feature weights. Since SVMs are computed by solving the dual problem using quadratic programming, these hyperplane coefficients first need to be computed from the coefficients to the dual problem. In R, for a matrix X with samples in the rows and features in the columns, and a factor y containing the classes, this looks as follows:


# single-step SVM-RFE
svm_rfe__single_step <- function(X, y, ...) {

    library(kernlab)

    # keep track of the iteration during which
    # a feature was eliminated
    ii <- rep(NA, ncol(X))
    i <- 0
    while ( any(is.na(ii)) ) {
        # indices of remaining features
        not_elim_yet <- which(is.na(ii))
        # train the classifier on the remaining features
        fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
        # compute the primal problem coefficients from the dual
        # problem coefficients
        sv_i <- alphaindex(fit)[[1]]
        w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ]
        # eliminate the feature with the smallest squared weight
        to_elim <- not_elim_yet[ head(order( w * w ), 1) ]
        ii[to_elim] <- i
        i <- i + 1
    }
    # convert iterations into ranks
    i - ii

}

The above algorithm eliminates one feature (i.e. gene) per iteration. Typical RNA-seq expression datasets have hundreds to tens of thousands of features, so a call to the function above also to needs to train an SVM hundreds to tens of thousands of times. Needless to say, this can take quite a while. Note that we’re most likely mainly interested in the top-ranking genes, and less so in lower ranking genes. Therefore, we could also eliminate relatively many features during early iterations, and then decrease the number of features to be eliminated as the iteration progresses. Taking the approach of exponential decrease (i.e. eliminate a fixed fraction of remaining features), this looks as follows:


# multi-step SVM-RFE
# default elim_frac corresponds to single-step
svm_rfe <- function(X, y, elim_frac = 1 / ncol(X), ...) {

    library(kernlab)

    # keep track of the iteration during which
    # a feature was eliminated
    ii <- rep(NA, ncol(X))
    i <- 0
    while ( any(is.na(ii)) ) {
        # indices of remaining features
        not_elim_yet <- which(is.na(ii))
        # number of features to eliminate
        n_to_elim <- ceiling ( elim_frac * length(not_elim_yet) )
        # train the classifier on the remaining features
        fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
        # compute the primal problem coefficients from the dual
        # problem coefficients
        sv_i <- alphaindex(fit)[[1]]
        w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ]
        # eliminate the features with the smallest squared weights
        to_elim <- not_elim_yet[ head(order( w * w ), n_to_elim) ]
        ii[to_elim] <- i
        i <- i + 1
    }
    # convert iterations into ranks
    i - ii

}

Let’s see how this works in practice. For this, we’ll use the ‘cervical’ data from the Bioconductor package MLseq. This dataset describes the expression levels as measured by RNA-seq of 714 miRNAs in 29 tumor and 29 normal human cervical samples. We can load and process the data, and run the above functions, using the following code:


library(MLSeq)

# load RNA-seq expression data
data(cervical)

# preprocess
y <- factor(grepl("^T[0-9].*", colnames(cervical)))
levels(y) <- c("normal", "tumor")
X <- scale(t(data.matrix(cervical)))

# SVM-RFE single-step
system.time(I <- svm_rfe(X, y))
# SVM-RFE multi-step
system.time(J <- svm_rfe(X, y, 0.1))

In the example above, using default values removes one feature at a time and thus trains 714 SVMs, one for each feature. On the other hand, setting elim_frac to 0.1 (i.e. at each iteration, 10% of features are eliminated), needs to train only 47 SVMs. Although setting elim_frac at 0.1 trains roughly 15 times fewer SVMs, it runs approximately 30 times as fast (~1 second vs. ~33 seconds on my laptop). This is because the number of SVMs trained on many features is relatively more constrained by the multi-step approach than the number of SVMs trained on fewer features. So, we lose some resolution in the relatively uninteresting lower ranking features, but gain a lot in terms of execution time. This gain in execution time is visualized in the following plot.

Execution time of SVM-RFE in seconds (left) and number of SVMs trained (right); as a function of the fraction of features eliminated per iteration.
Execution time of SVM-RFE in seconds (left) and number of SVMs trained (right); as a function of the fraction of features eliminated per iteration.

Rcpp to integrate C with R

If my R script needs a speed boost, I normally use R’s .C() function to call a function written in C from within R. However, after having read many good things about the Rcpp R-package lately, I decided to give this package a try. In principle, Rcpp was developed to integrate C++ with R. However, C++ is almost a superset of C, and therefore integrating Rcpp with C instead of C++ is mostly just a matter of knowing the Rcpp interface and writing some C code.

As a first step in getting to know the Rcpp interface, I set myself the goal of designing a function myOrder() that returns the ordering of the elements of a given numeric vector. In R, this would normally be done by calling the function order(). For example, ordering the vector [2.1, 4.5, 1] would return [3, 1, 2]. I wrote two functions for doing this type of ordering and stored these in a text file myOrder.cpp, with the extension .cpp for C++:


#include <Rcpp.h>
using namespace Rcpp;

int pointer_compare(
    const void* a,
    const void* b
) {
    /* Two double-pointers p1 and p2 */
    double* p1 = *(double**)a;
    double* p2 = *(double**)b;

    /* Compare pointers based on the value they point to */
    if ( *p1 < *p2 ) {
        return -1;
    } else if ( *p2 < *p1 ) {
        return 1;
    }
    return 0;
}

// [[Rcpp::export]]
IntegerVector myOrder(
    DoubleVector x
) {

    int i, n = x.size();
    IntegerVector ret(n); 

    double** p = (double**) malloc(n * sizeof(double*));

    /*
    Retrieve the addresses of the elements in x,
    and store these in the pointer array p.
    */
    for ( i = 0; i < n; i++ ) {
        p[i] = &x[i];
    }

    /*
    Sort the pointers based on their de-referenced
    values. (i.e. the values of the elements in x)
    */
    qsort(p, n, sizeof(double*), pointer_compare);

    /*
    'Normalize' the pointers to a range of [1...n]
    by subtracting the address of the first element
    in x.
    */
    for ( i = 0; i < n; i++ ) {
        ret[i] = p[i] - &x[0] + 1; // add 1 for R
    }

    free(p);

    return ret;
}

The above functions are C code, except for only a few lines:

  • Line 2: defines a namespace for the code, such that anything defined in the header file Rcpp.h that is used in the above code, does not need to be prepended with “Rcpp::”. For example, type DoubleVector is defined in Rcpp.h, and the DoubleVector parameter of function myOrder() would normally need to be declared as “Rcpp::DoubleVector x”. However, by using the namespace, it is sufficient to write “DoubleVector x”.
  • Line 21: specifies that function myOrder() needs to be callable from within R.
  • Lines 22, 23, 27: IntegerVector and DoubleVector are Rcpp-specific C++ type definitions, corresponding to integer and numeric vectors in R, respectively.
  • Line 26: size() is a method for a vector object, which returns the length of that vector.

The above functions can be compiled, linked and called by running the following R code:


# load the library
library(Rcpp)
# compile, link and load the myOrder() function
sourceCpp("myOrder.cpp")
# generate some data
x <- rnorm(10)
# return the ordering of the data
myOrder(x)

Note that by calling the sourceCpp() function, the Rcpp package takes care of compiling and linking, and after that the C function can conveniently be called from within R.

It is interesting to compare the performance of R’s internal order() function with this custom-made Rcpp function.

order
Comparing the running time of R’s internal order() function to the myOrder() function for ordering numeric vectors.

You can see that the Rcpp function is quite a bit faster than R’s internal function. Obviously, R’s internal order() function is more generic than myOrder(). Also, myOrder() does not deal with NAs and only allows for non-decreasing ordering, to name just two things. Still, the above figure shows that, with relatively little effort and quite painlessly compared to R’s .C interface, it is possible to speed up R code substantially, even when considering low-level implemented R functions such as order().