ST504 Prediction and Classification Computer lab notes week 7

ST504 Prediction and Classification
Computer lab notes week 7
This document illustrates how classification can be performed in the R software. The methodology is illustrated on the mower-owner data set.
1
Exploratory analysis
• Read the data from file
example <- read.table("t11-1.txt",header=TRUE)
• Split the data set in two new data sets according to the class variable y. Note: y indicates
the type of family, with the values 1 and 2 representing the riding-mower owners and
non-owners, respectively.
First, create ‘selection’ arrays:
sowner <- example$y==1
snowner <- example$y==2
Check the content of one of these arrays by typing its name in the command line e.g.
sowner. Next, apply these ‘selection’ arrays to the first index of the array example.
The rows corresponding to a TRUE in sowner and snowner will be retained, and those
corresponding to a FALSE ignored.
owner <- example[sowner,]
nowner <- example[snowner,]
Plot the data for the two classes using a different symbol.
plot(owner$x1,owner$x2,xlim=c(20,120),ylim=c(10,25),
xlab="Income in thousands of dollars",
ylab="Lot size in thousands of square feet")
points(nowner$x1,nowner$x2,pch=19)
The result is shown in Figure 1.
1
25
20
15
10
Lot size in thousands of square feet
20
40
60
80
100
120
Income in thousands of dollars
Figure 1: Mower-owner example: scatter plot of lot size versus income.
2
Linear discriminant analysis - plotting the discriminant function
We now supplement the scatter plot constructed above with the linear classification function,
obtained from the multivariate normal based classification rule with equal misclassification costs
and equal prior probabilities.
• Select the X matrix for the owners and non-owners:
ownerx <- data.frame(owner$x1,owner$x2)
nownerx <- data.frame(nowner$x1,nowner$x2)
• Compute the sample mean vectors, the sample covariance matrices and the pooled covariance matrix:
m1 <- as.vector(mean(ownerx))
m2 <- as.vector(mean(nownerx))
s1 <- cov(ownerx)
s2 <- cov(nownerx)
s <- (s1+s2)/2
• Invert the pooled covariance matrix:
si <- solve(s)
2
• Compute the components of the linear classification function
â! x − m̂
where
â = S −1
(x̄ − x̄2 )
pooled 1
and
m̂ =
1
(x̄1 − x̄2 )! S −1
(x̄ + x̄2 ).
pooled 1
2
R commands:
a <- t(m1-m2)%*%si
m <- t(m1-m2)%*%si%*%(m1+m2)/2
• Plot the data and the linear classification function:
plot(owner$x1,owner$x2,xlim=c(20,120),ylim=c(10,25),
xlab="Income (in thousands of dollars)",
ylab="Lot size (in thousands of square feet)")
points(nowner$x1,nowner$x2,pch=19)
text(110,22,"R1",col="blue")
text(30,15,"R2",col="blue")
abline(m/a[2],-a[1]/a[2],lwd=2,col="blue")
The result is shown in Figure 2.
3
Quadratic discriminant analysis - plotting the discriminant
function
Similarly to the above described linear case, we can also add the quadratic classification function to the scatter plot. We consider the case of equal misclassification costs and equal prior
probabilities. The decision rule is in this case given by:
Assign x0 to π1 if
1
−1
! −1
! −1
− x!0 (S −1
1 − S 2 )x0 + (x̄1 S 1 − x̄2 S 2 )x0 − k ≥ 0
2
where
k=
1
|S 1 | 1 ! −1
ln
+ (x̄ S x̄1 − x̄!2 S −1
2 x̄2 ),
2
|S 2 | 2 1 1
and assign x0 to π2 otherwise.
3
(1)
25
20
15
R2
10
Lot size (in thousands of square feet)
R1
20
40
60
80
100
120
Income (in thousands of dollars)
Figure 2: Mower-owner example: scatter plot of lot size versus income with the equal cost equal prior linear classification function.
• Invert the sample covariance matrices for classes 1 and 2:
s1i <- solve(s1)
s2i <- solve(s2)
• Compute the constant k:
k <- log(det(s1)/det(s2))/2+(t(m1)%*%s1i%*%m1-t(m2)%*%s2i%*%m2)/2
• Evaluate the left-hand-side of the classifier (1) on a grid of (x1 , x2 ) values:
x1 <- seq(20,120,2)
x2 <- seq(10,26,2)
n1 <- length(x1)
n2 <- length(x2)
f <- matrix(0,n1,n2)
for (i in 1:n1){
for (j in 1:n2){
xc <- c(x1[i],x2[j])
f[i,j] <- -t(xc)%*%(s1i-s2i)%*%xc/2+(t(m1)%*%s1i-t(m2)%*%s2i)%*%xc-k
}}
4
• Plot the data and the boundary of the regions R1 and R2 . The boundary function is in
our case contour 0 of the left-hand-side of (1).
contour(x1,x2,f,levels=c(0),col="blue",lwd=2,
xlab="Income (in thousands of dollars)",
ylab="Lot size (in thousands of square feet)")
points(owner$x1,owner$x2)
points(nowner$x1,nowner$x2,pch=19)
text(110,22,"R1",col="blue")
text(30,15,"R2",col="blue")
15
20
R1
R2
10
Lot size (in thousands of square feet)
25
The result is shown in Figure 3.
20
40
60
80
100
120
Income (in thousands of dollars)
Figure 3: Mower-owner example: scatter plot of lot size versus income with the equal cost equal prior quadratic classification function.
4
Linear discriminant analysis - R function lda
The multivariate normal based linear and quadratic classification rules are implemented in the R
functions lda and qda, respectively. These functions are part of the MASS package, which needs
to be activated to make the functions available. This can be done by simply entering
library(MASS)
5
To perform a linear discriminant analysis, call the function lda, for instance
result <- lda(y∼x1+x2,example,prior=c(.5,.5))
The first argument of the function is a model expression, similar to the model expressions
used with the function lm. The left-hand-side of the model expression contains the dependent/grouping variable, and the right-hand-side the set of predictors (explanatory variables).
The second argument is the data frame containing the variables used in the model expression.
The argument priors allows to specify the prior probabilities. If this command is omitted, the
function uses the proportions of π1 and π2 objects in the sample. Check the content by entering
result in the command line. Predictions/classifications can be made by the function predict,
for instance
examplex <- example[,c(1,2)]
pr <- predict(result,examplex)
The first argument to be supplied to the function predict is an R object resulting from an lda
call; the second argument is a data frame that needs to be classified. In the above we supplied
the frame examplex, which will result in a classification of the observations in the training data
set. The function works of course also on validation data (group membership known) and on
other new data (group membership unknown) that require classification. The object pr contains
• class: predicted class membership based on maximum posterior probability,
• posterior: the posterior probability of group membership,
• x: a score on a linear discriminant.
The confusion matrix for the training sample can be obtained as follows:
y <- example$y
yhat <- pr$class
table(y,yhat)
The result is
y
yhat
1 2
1 11 1
2 2 10
giving an apparent error rate of 3/24 = 0.125. The function lda also has cross validation (or
hold-out) functionality:
result <- lda(y∼x1+x2,example,prior=c(.5,.5),CV=TRUE)
6
The object result contains, among others, the cross-validation classifications (to obtain these,
submit result$class) and the cross-validation posterior probabilities of group membership
(submit result$posterior). The cross-validation confusion matrix can be obtained as
yhatcv <- result$class
table(y,yhatcv)
The result is
y
yhatcv
1 2
1 10 2
2 3 9
from which we easily derive the proportion misclassified observations: 5/24 = 0.208.
5
Quadratic discriminant analysis - R function qda
The functionality of qda is completely analogous to that of lda. Therefore only the major steps
are explained.
• Call qda
result <- qda(y∼x1+x2,example,prior=c(.5,.5))
• Obtain a classification of the training sample and compute the confusion matrix
pr <- predict(result,examplex)
yhat <- pr$class
table(y,yhat)
Result:
y
yhat
1 2
1 11 1
2 2 10
• Perform a cross-validation and compute the confusion matrix
result <- qda(y∼x1+x2,example,prior=c(.5,.5),CV=TRUE)
yhatcv <- result$class
7
table(y,yhatcv)
Cross validation confusion matrix:
y
yhatcv
1 2
1 10 2
2 3 9
8

Download Report

ST504 Prediction and Classification Computer lab notes week 7

Paperzz.com

Your Paperzz