The support vector machine (SVM) is a very powerful classifier due to its inherent regularization properties as well as its ability to handle decision boundaries of arbitrary complexity by its formulation as a kernel method. However, suppose that we’re not so much interested in the decision boundary itself, but more in the relative importance of the features in specifying this decision boundary. For example, given two sets (i.e. ‘classes’) of samples, one from tumor tissue and one from normal tissue, and associated RNA-seq gene expression data, one may be interested in ranking genes (i.e. ‘features’) according to their importance in specifying the difference (i.e. ‘decision boundary’) between normal and tumor tissue. This represents an important problem in the field of machine learning, and is called feature ranking. Guyon et al. proposed combining SVMs with recursive feature elimination (SVM-RFE) for feature ranking. Recursive feature elimination in its simplest formulation starts with the complete set of features, and then repeats the following three steps until no more features are left:

- Train a model (in the present case, an SVM).
- Compute a ranking of features.
- Remove the feature with the worst rank.

For step 2, ranking the features, Guyon et al. proposed to take the coefficients to the SVM primal problem (i.e. the coefficients to the SVM hyperplane) as feature weights. Since SVMs are computed by solving the dual problem using quadratic programming, these hyperplane coefficients first need to be computed from the coefficients to the dual problem. In R, for a matrix X with samples in the rows and features in the columns, and a factor y containing the classes, this looks as follows:

# single-step SVM-RFE svm_rfe__single_step <- function(X, y, ...) { library(kernlab) # keep track of the iteration during which # a feature was eliminated ii <- rep(NA, ncol(X)) i <- 0 while ( any(is.na(ii)) ) { # indices of remaining features not_elim_yet <- which(is.na(ii)) # train the classifier on the remaining features fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...) # compute the primal problem coefficients from the dual # problem coefficients sv_i <- alphaindex(fit)[[1]] w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ] # eliminate the feature with the smallest squared weight to_elim <- not_elim_yet[ head(order( w * w ), 1) ] ii[to_elim] <- i i <- i + 1 } # convert iterations into ranks i - ii }

The above algorithm eliminates one feature (i.e. gene) per iteration. Typical RNA-seq expression datasets have hundreds to tens of thousands of features, so a call to the function above also to needs to train an SVM hundreds to tens of thousands of times. Needless to say, this can take quite a while. Note that we’re most likely mainly interested in the top-ranking genes, and less so in lower ranking genes. Therefore, we could also eliminate relatively many features during early iterations, and then decrease the number of features to be eliminated as the iteration progresses. Taking the approach of exponential decrease (i.e. eliminate a fixed fraction of remaining features), this looks as follows:

# multi-step SVM-RFE # default elim_frac corresponds to single-step svm_rfe <- function(X, y, elim_frac = 1 / ncol(X), ...) { library(kernlab) # keep track of the iteration during which # a feature was eliminated ii <- rep(NA, ncol(X)) i <- 0 while ( any(is.na(ii)) ) { # indices of remaining features not_elim_yet <- which(is.na(ii)) # number of features to eliminate n_to_elim <- ceiling ( elim_frac * length(not_elim_yet) ) # train the classifier on the remaining features fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...) # compute the primal problem coefficients from the dual # problem coefficients sv_i <- alphaindex(fit)[[1]] w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ] # eliminate the features with the smallest squared weights to_elim <- not_elim_yet[ head(order( w * w ), n_to_elim) ] ii[to_elim] <- i i <- i + 1 } # convert iterations into ranks i - ii }

Let’s see how this works in practice. For this, we’ll use the ‘cervical’ data from the Bioconductor package MLseq. This dataset describes the expression levels as measured by RNA-seq of 714 miRNAs in 29 tumor and 29 normal human cervical samples. We can load and process the data, and run the above functions, using the following code:

library(MLSeq) # load RNA-seq expression data data(cervical) # preprocess y <- factor(grepl("^T[0-9].*", colnames(cervical))) levels(y) <- c("normal", "tumor") X <- scale(t(data.matrix(cervical))) # SVM-RFE single-step system.time(I <- svm_rfe(X, y)) # SVM-RFE multi-step system.time(J <- svm_rfe(X, y, 0.1))

In the example above, using default values removes one feature at a time and thus trains 714 SVMs, one for each feature. On the other hand, setting elim_frac to 0.1 (i.e. at each iteration, 10% of features are eliminated), needs to train only 47 SVMs. Although setting elim_frac at 0.1 trains roughly 15 times fewer SVMs, it runs approximately 30 times as fast (~1 second vs. ~33 seconds on my laptop). This is because the number of SVMs trained on many features is relatively more constrained by the multi-step approach than the number of SVMs trained on fewer features. So, we lose some resolution in the relatively uninteresting lower ranking features, but gain a lot in terms of execution time. This gain in execution time is visualized in the following plot.