\documentclass[a4paper]{article}
\usepackage{hyperref, graphicx, color, alltt,a4wide}
\usepackage{Sweave}
\newcommand{\pkg}[1]{\texttt{#1}}
\definecolor{Red}{rgb}{0.7,0,0}
\definecolor{Blue}{rgb}{0,0,0.8}
\definecolor{hellgrau}{rgb}{0.55,0.55,0.55}
\newenvironment{smallexample}{\begin{alltt}\small}{\end{alltt}}
\begin{document}
\SweaveOpts{concordance=TRUE}
%\VignetteIndexEntry{svm() internals}
%\VignetteDepends{xtable}
%\VignetteKeywords{classification, regression, machine learning, benchmarking, support vector machines}
%\VignettePackage{e1071}
\SweaveOpts{engine=R,eps=FALSE}
\setkeys{Gin}{width=0.8\textwidth}
\title{\texttt{svm()} internals\\
\large Some technical notes about the \texttt{svm()} in package \pkg{e1071}}
\author{by David Meyer\\
FH Technikum Wien, Austria\\
\url{David.Meyer@R-Project.org}
}
\maketitle
\sloppy
This document explains how to use the parameters in an object returned
by \texttt{svm()} for own prediction functions.
\section{Binary Classifier}
For class prediction in the binary case, the class of a new data vector
$n$ is usually given by \emph{the sign} of
\begin{equation}
\sum_i{a_i y_i K(x_i, n)} + \rho
\end{equation}
\noindent where $x_i$ is the $i$-th support vector, $y_i$ the
corresponding label, $a_i$ the corresponding coefficiant, and $K$ is
the kernel (for example the linear one, i.e. $K(u,v) = u ^{\top} v$).
Now, the \texttt{libsvm} library interfaced by the \texttt{svm()} function
actually returns $a_i y_i$ as $i$-th coefficiant and the
\emph{negative} $\rho$, so in fact uses the formula:
\[
\sum_i{\mathrm{coef}_i K(x_i, n)} - \rho
\]
\noindent where the training examples (=training data) are labeled \{1,-1\} (!).
A simplified \textsf{R} function for prediction with linear kernel would be:
\begin{smallexample}
svmpred <- function (m, newdata, K=crossprod)
\{
## this guy does the computation:
pred.one <- function (x)
sign(sum(sapply(1:m$tot.nSV, function (j)
K(m$SV[j,], x) * m$coefs[j]
)
) - m$rho
)
## this is just for convenience:
if (is.vector(newdata))
newdata <- t(as.matrix(x))
sapply (1:nrow(newdata),
function (i) pred.one(newdata[i,]))
\}
\end{smallexample}
\noindent where \texttt{pred.one()} does the actual prediction for one
new data vector, the remainder is just a convenience for prediction of
multiple new examples. It is easy to extend this to other kernels,
just replace \texttt{K()} with the appropriate function (see the help
page for the formulas used) and supply the additional constants.
As we will see in the next section, the multi-class prediction is more
complicated, because the coefficiants of the diverse binary SVMs are
stored in a compressed format.
\section{Multiclass-classifier}
To handle $k$ classes, $k>2$, \texttt{svm()} trains all binary subclassifiers
(one-against-one-method) and then uses a voting mechanism to determine
the actual class.
Now, this means $k(k-1)/2$ classifiers, hence in principle $k(k-1)/2$
sets of SVs, coefficiants and rhos. These are stored in a compressed format:
\begin{enumerate}
\item Only one SV is stored in case it were used by several classifiers. The
\texttt{model\$SV-matrix}
is ordered by classes, and you find the starting indices
by using \texttt{nSV} (number of SVs):
\begin{smallexample}
start <- c(1, cumsum(model$nSV))
start <- start[-length(start)]
\end{smallexample}
\texttt{sum(nSV)} equals the total number of (distinct) SVs.
\item The coefficients of the SVs are stored in the
\texttt{model\$coefs}-matrix,
grouped by classes. Because the separating hyperplanes found by the
SVM algorithm has SVs on both sides, you will have two sets of
coefficients per binary classifier, and e.g., for 3 classes, you could
build a \emph{block}-matrix like this for the classifiers $(i, j)$
($i$,$j$=class numbers):
\begin{table}[h]
\center
\begin{tabular}{|c|c|c|c|}
\hline
i $\backslash$ j & 0 & 1 & 2 \\\hline
0 & X & set (0, 1)& set (0, 2)\\\hline
1 & set (1, 0) & X & set (1, 2)\\\hline
2 & set (2, 0) & set (2, 1) & X\\\hline
\end{tabular}
\end{table}
\noindent where set(i, j) are the coefficients for the classifier
(i,j), lying on the side of class j.
Because there are no entries for (i, i), we can save the diagonal and
shift up the lower triangular matrix to get
\begin{table}[h]
\center
\begin{tabular}{|c|c|c|c|}
\hline
i $\backslash$ j & 0 & 1 & 2 \\\hline
0 & set (1,0) & set (0,1) & set (0,2) \\\hline
1 & set (2,0) & set (2,1) & set (1,2) \\\hline
\end{tabular}
\end{table}
\noindent Each set (., j) has length \texttt{nSV[j]}, so of course,
there will be some filling 0s in some sets.
\texttt{model\$coefs} is the \emph{transposed}
of such a matrix, therefore for a data set with, say, 6 classes,
you get 6-1=5 columns.
The coefficients of (i, j) start at \texttt{model\$coefs[start[i],j]}
and those of (j, i) at \texttt{model\$coefs[start[j],i-1]}.
\item The $k(k-1)/2$ rhos are just linearly stored in the vector \texttt{model\$rho}.
\end{enumerate}
\newpage
\noindent The following code shows how to use this for prediction:
\begin{smallexample}
## Linear Kernel function
K <- function(i,j) crossprod(i,j)
predsvm <- function(object, newdata)
\{
## compute start-index
start <- c(1, cumsum(object$nSV)+1)
start <- start[-length(start)]
## compute kernel values
kernel <- sapply (1:object$tot.nSV,
function (x) K(object$SV[x,], newdata))
## compute raw prediction for classifier (i,j)
predone <- function (i,j)
\{
## ranges for class i and j:
ri <- start[i] : (start[i] + object$nSV[i] - 1)
rj <- start[j] : (start[j] + object$nSV[j] - 1)
## coefs for (i,j):
coef1 <- object$coefs[ri, j-1]
coef2 <- object$coefs[rj, i]
## return raw values:
crossprod(coef1, kernel[ri]) + crossprod(coef2, kernel[rj])
\}
## compute votes for all classifiers
votes <- rep(0,object$nclasses)
c <- 0 # rho counter
for (i in 1 : (object$nclasses - 1))
for (j in (i + 1) : object$nclasses)
if (predone(i,j) > object$rho[c <- c + 1])
votes[i] <- votes[i] + 1
else
votes[j] <- votes[j] + 1
## return winner (index with max. votes)
object$levels[which(votes %in% max(votes))[1]]
\}
\end{smallexample}
In case data were scaled prior fitting the model (note that this is the
default for \texttt{svm()}, the new data needs to be scaled as well
before applying the predition functions, for example using the
following code snipped (object is an object returned by \texttt{svm()},
\texttt{newdata} a data frame):
\begin{smallexample}
if (any(object$scaled))
newdata[,object$scaled] <-
scale(newdata[,object$scaled, drop = FALSE],
center = object$x.scale$"scaled:center",
scale = object$x.scale$"scaled:scale"
)
\end{smallexample}
\noindent For regression, the response needs to be scaled as well
before training, and the predictions need to be scaled back accordingly.
\end{document}