SSL Chapter 4

SSL Chapter 4
Risk of Semi-supervised Learning: How
Unlabeled Data Can Degrade
Performance of Generative Classifiers
Amount of the data
Is it better to have more unlabeled data?
 Literature presents the positive value of
unlabeled data
 Unlabeled data should certainly not be
discarded (O’Neill, 1978)

Model selection: Correct Model
Assume Xv and Yv sampled from Xv and
Yv.
 Suppose we know there exist parameter
set s.t.

P(Xv,Yv|Q) = P(Xv,Yv) => “Correct model”
 Extra labeled/unlabeled data will reduce the
error.
 Labeled data is more effective.

Detailed Analysis
Shahshahani & Landgrebe
 Unlabeled data degrade the performance
of NB with Gaussian variables.



Deviations from modeling assumptions
Unlabeled data should used when the
labeled data alone produce poor
performance. (suggestion)
Detailed Analyze
Nigam et al.(2000)
 Reasons of poor performance

Numerical problems in the learning method
 Mismatch between natural clusters and
actual labels

There are various studies that presents the
addition of unlabeled data degrades
accuracy of classification.
Empirical Study

Notation and assumptions
Binary classification
 Xv is an instance of data while Xvi is an
attribute of Xv
 All classifiers use EM to maximize likelihood

Empirical Study

Bayes classifier with
increasing number of
unlabeled data.
 Generated randomly.
 Xi and Xj is
independent given
class label
 Correct model
Empirical Study

Tree-augmented NB
is used.


Each attr. directly
dependent on the
class and at most
another attr.
The model is
incorrect.
Empirical Study

More complex model
with TAN
assumptions.
 With few labeled
data the
performance
improves.
 Still model is
incorrect.
Empirical Study

NB classifier
 Real data with binary
classes (UCI rep.)
 Better when the size
of labeled data is
small.
 Similar with previous
case.
Empirical Study
Summary of first part
Correct model => guarantee benefits
from unlabeled data
 Incorrect model => may degrade
performance

Characteristics of the distribution of data
and classes do not match.
 How we know that the priori is the “correct”
one?

Asymptotic Bias
AL : Asymptotic bias of labeled data
 Au : Asymptotic bias of unlabeled data
 AL and Au can be different
 Scenario

Train with labeled data s.t. the result is close
to AL.
 Add huge amount of unlabeled data.


The result may be tending to Au
Toy Problem : Gender Prediction
G: Girl B: Boy
 Mother craved chocolate C: Yes or No
 Mother’s weight gain W: More or Less
 W and G conditionally independent on C
 G->C->W
 P(G,C,W) = P(G) P(C|G) P(W|C)

Toy Problem : Gender Prediction
P(G = Boy) = 0.5
 P(C = No | G = Boy) = 0.1
 P(C = No | G = Boy) = 0.8
 P(W = Less | C = No) = 0.7
 P(W = Less | C = Yes) = 0.2
 We can compute

P(W = Less | G = Boy) = 0.25
 P(W = Less | G = Girl) = 0.6

Toy Problem : Gender Prediction

From the independence assumption
P(G = Girl | C = No) = 0.89
 P(G = Boy | C = No) = 0.11
 P(G = Girl | C = Yes) = 0.18
 P(G = Boy | C = Yes) = 0.82
 So, if C = No choose G = Girl else G = Boy

Toy Problem : Gender Prediction
Incorrect Model

C <- G -> W

C and W are independent
P(G,C,W) = P(G)P(C|G)P(W|G)
 Suppose “oracle” gave us



P(C|G)
We need to estimate P(G) and P(W|G)
Toy Problem : Gender Prediction
Incorrect Model

Only labeled data
Unbiased mean and variance inversely
proportional to the size of the DL.
 Even small sized DL will produce good
estimates

Toy Problem : Gender Prediction
Incorrect Model

P(G) ~ 0.5
 P(W = Less | G = Girl) ~ 0.6
 P(W = Less | G = Boy) ~ 0.25
C=No,W=Less
C=No,W=More
C=Yes,W=Less
C=No,W=More
P(G=Girl|C,W)
0.95
0.81
0.35
0.11
P(G=Boy|C,W)
0.05
0.19
0.65
0.89
Toy Problem : Gender Prediction
Incorrect Model

Classify with the maximum a posteriori
value of G
The “bias” from “true” a posteriori in not zero
 Produce the same optimal Bayes rule with
the previous case.
 Classifier likely to yield to minimum
classification error

Toy Problem : Gender Prediction
Incorrect Model + Unlabeled Data
DL / Du -> 0
 P(G = Boy) = 0.5
 P(W = Less | G = Girl) = 0.78
 P(W = Less | G = Boy) = 0.07

Toy Problem : Gender Prediction
Incorrect Model + Unlabeled Data

The a posteriori probabilities for G
C=No,W=Less
C=No,W=More
C=Yes,W=Less
C=No,W=More
P(G=Girl|C,W)
0.99
0.55
0.71
0.05
P(G=Boy|C,W)
0.01
0.45
0.29
0.95
Toy Problem : Gender Prediction
Incorrect Model + Unlabeled Data
3 out of 4 times classifier chooses Girl
against Boy.
 Prediction has changed from the optimal
 Expected error rate increases.
 What Happened?

Unlabeled data changed the asymptotic limit
 When model is incorrect the affect of
unlabeled data is important

Asymptotic Analysis
(Xv,Yv): Instance vector, class label
 Binary classes with values -1 and +1
 Assume 0-1 loss
 Apply Bayes rule to get Bayes Error
 n independent samples
 l: labeled u: unlabeled samples
n=l+u

Asymptotic Analysis
With probability (1 – h) a sample is
unlabeled
 With probability h a sample is labeled
 P(Xv,Yv | Q) is the parametric form
 Use EM

Asymptotic Analysis
l
u
i 1
j 1
L(Q)   P( xi , yi | Q) P( xj | Q)
P( xi | Q)  P( xi | yi  1, Q) P( yi  1 | Q) 
P( xi | yi  1, Q) P( yi  1 | Q)

Likelihood of labeled and unlabeled data
Asymptotic Analysis

Parameter estimation
n
^

log p ( zi | Q )
Qn obtained by maximizing the 
i 1
^

Qn  Q as n->infinity, it maximizes
*
EP ( z ) [log p( z | Q)]
Theorem on Asymptotic Analysis

The limiting value of Q* of maximum-likelihood
estimates is
arg max (h  EP( Xv ,Yv ) [log P( Xv, Yv | Q)]  (1  h) EP ( Xv,Yv ) [log P( Xv | Q)])
Q
Theorem on Asymptotic Analysis
Qh* is the value of Q that maximizes the
previous theorem.
 Ql* optimum of labeled data
 Qu* optimum of unlabeled data

Theorem on Asymptotic Analysis

Model is correct.
P(Xv,Yv|QT ) = P(Xv,Yv) for some QT.
 QT =Ql*= Qu*= Qh*
 In this case asymptotic bias is zero.

Theorem on Asymptotic Analysis

Model is correct.
Assume P(Xv,Yv) does not belong to
P(Xv,Yv|Q)
 e(Q) is the classification error with
parameter Q
 Assume e(Ql*) < e(Qu*)

Theorem on Asymptotic Analysis
Labeled data will train the model such
that error will be e(Ql*)
 As we added unlabeled data the error
will be closer to the e(Qu*)


So using only labeled data will result a
smaller classification error