Learning the Learning Rate for Prediction with

Learning the Learning Rate
for Prediction with Expert Advice
Wouter M. Koolen
Tim van Erven
Peter D. Grünwald
Lorentz Workshop Leiden, Thursday 20th November, 2014
Online Learning Algorithms
n
rk i
Wo tice
c
Pra
?
The
o
Per retica
l
form
Gua
anc
rant
e
ees
Learning as a Game
worst-case safe algorithm
high (bad)
regret
minimax
problem instances
0 (perfect)
Practice is not Adversarial
worst-case safe algorithm
special-purpose algorithm
high (bad)
regret
minimax
problem instances
0 (perfect)
Luckiness
worst-case safe algorithm
special-purpose algorithm
?
high (bad)
regret
minimax
problem instances
0 (perfect)
Fundamental model for learning: Hedge setting
I
K experts
...
Fundamental model for learning: Hedge setting
I
K experts
...
I
In round t = 1, 2, . . .
I
I
I
Learner plays distribution wt = (wt1 , . . . , wtK ) on experts
K
Adversary reveals expert losses `t = (`1t , . . . , `K
t ) ∈ [0, 1]
Learner incurs loss wt| `t
Fundamental model for learning: Hedge setting
I
K experts
...
I
In round t = 1, 2, . . .
I
I
I
I
Learner plays distribution wt = (wt1 , . . . , wtK ) on experts
K
Adversary reveals expert losses `t = (`1t , . . . , `K
t ) ∈ [0, 1]
Learner incurs loss wt| `t
Evaluation criterion is the regret:
RT :=
T
X
wt| `t − min
k
t=1
| {z }
Learner
|
T
X
`kt
t=1
{z
}
best expert
Canonical algorithm for the Hedge setting
Hedge algorithm with learning rate η:
k
wtk
e −ηLt−1
:= P
−ηLkt−1
ke
where
Lkt−1 =
t−1
X
s=1
`ks .
Canonical algorithm for the Hedge setting
Hedge algorithm with learning rate η:
k
wtk
e −ηLt−1
:= P
−ηLkt−1
ke
The tuning η = η worst case :=
where
Lkt−1 =
t−1
X
s=1
q
RT ≤
8 ln K
T
results in
p
T /2 ln K
and we have matching lower bounds.
`ks .
Canonical algorithm for the Hedge setting
Hedge algorithm with learning rate η:
k
wtk
e −ηLt−1
:= P
−ηLkt−1
ke
The tuning η = η worst case :=
where
Lkt−1 =
t−1
X
s=1
q
RT ≤
8 ln K
T
results in
p
T /2 ln K
and we have matching lower bounds.
Case closed?
`ks .
Gap between Theory and Practice
Practitioners report that tuning η η worst case works much
better. [DGGS13]
Theory and Practice getting closer
Practitioners report that tuning η η worst case works much
better. [DGGS13]
Series of worst-case data-dependent improvements
RT ≤
p
T /2 ln K
T
L∗T
VT
∆T
Theory and Practice getting closer
Practitioners report that tuning η η worst case works much
better. [DGGS13]
Series of worst-case data-dependent improvements
RT ≤
p
T /2 ln K
T
L∗T
VT
∆T
and extension to scenarios where Follow-the-Leader (η = ∞)
shines (IID losses)
case
RT ≤ min Rworst
, R∞
T
T
Theory and Practice getting closer
Practitioners report that tuning η η worst case works much
better. [DGGS13]
Series of worst-case data-dependent improvements
RT ≤
p
T /2 ln K
T
L∗T
VT
∆T
and extension to scenarios where Follow-the-Leader (η = ∞)
shines (IID losses)
case
RT ≤ min Rworst
, R∞
T
T
Case closed?
Menu
Grand goal: be almost as good as best learning rate η
RT ≈ min RηT .
η
I
Example problematic data
I
Key ideas
Current η tunings miss the boat
T = 100000
800
700
600
RηT
500
400
300
200
100
0
0.001
0.01
0.1
1
η
10
100
Current η tunings miss the boat
T = 100000
800
Bad expert
700
600
RηT
500
expert
0:
1
1
rounds
1
1
1
...
400
300
200
100
0
0.001
0.01
0.1
1
η
10
100
Current η tunings miss the boat
T = 100000
800
Bad expert
FTL worst case
600
RηT
500
expert
700
1:
2:
0
1
1
0
rounds
0
1
0
1
0
1
...
...
400
300
200
100
0
0.001
0.01
0.1
1
η
10
100
Current η tunings miss the boat
T = 100000
800
Bad expert
FTL worst case
WC-eta killer
700
600
400
300
200
100
0
0.001
expert
RηT
500
0.01
1:
2:
0 ···0
1 ···1
1
0
0.1
rounds
0
1
1
0
1
η
0
1
...
...
10
100
Current η tunings miss the boat
T = 100000
800
Bad expert
FTL worst case
WC-eta killer
Combined
700
600
RηT
500
400
300
200
100
0
0.001
0.01
0.1
1
η
10
100
LLR algorithm in a nutshell
LLR
I
I
maintains a finite grid η 1 , . . . , η imax , η ah
cycles over the grid. For each η i :
I
I
I
I
Play the η i Hedge weights
Evaluate η i by its mixability gap
Until its budget doubled
adds next lower grid point on demand
LLR algorithm in a nutshell
LLR
I
I
maintains a finite grid η 1 , . . . , η imax , η ah
cycles over the grid. For each η i :
I
I
I
I
Play the η i Hedge weights
Evaluate η i by its mixability gap
Until its budget doubled
adds next lower grid point on demand
Resources:
I
Time: O(K ) per round (same as Hedge).
I
Memory: O(ln T ) → O(1).
Unavoidable notation
ht = wt| `t ,
−1 X k −η`kt
ln
wt e
mt =
,
η
(Hedge loss)
(Mix loss)
k
δt = ht − mt .
(Mixability gap)
Unavoidable notation
ht = wt| `t ,
−1 X k −η`kt
ln
wt e
mt =
,
η
(Hedge loss)
(Mix loss)
k
δt = ht − mt .
(Mixability gap)
And capitals denote cumulatives
∆T =
T
X
t=1
δt , . . .
Key Idea 1: Monotone regret lower bound
Problem: Regret RηT is not increasing with T .
But we have a monotone lower bound:
RηT ≥ ∆ηT
Proof:
Rηt = HT − L∗T = HT − MT + MT − L∗T
| {z }
| {z }
mixability gap
mix loss regret
Now use
MT
X 1
−1
k
=
ln
e −ηLT
η
K
k
!
ln K
∈ L∗T + 0,
η
Upshot: measure quality of each η using cumulative mixability gap.
Key Idea 2: Grid of η suffices
For γ ≥ 1:
δtγη ≤ γe (γ−1)(ln K +η) δtη
I.e. δtη cannot be much better than δtγη .
Exponentially spaced grid of η suffices.
Key Idea 3: Lowest η is “AdaHedge”
AdaHedge:
:=
η ah
t
Result:
RT ≤
imax
X
i=1
ln K
∆ah
t−1
∆iT + c∆ah
T
Key Idea 4: Budgeted timesharing
Active grid points
η1,
η2,
...,
η imax ,
η ah
t
π imax ,
π ah
with (heavy-tailed) prior distribution
π1,
π2,
...,
LLR maintains invariant:
i
∆Tmax
∆2T
∆ah
∆1T
T
≈
≈
≈
.
.
.
≈
π1
π2
π ah
π imax
Run each η i in turn until its cumulative mixability gap
imax
X
i=1
∆iT =
imax
X
i=1
πi
∆iT
πi
imax
∆jT X i
∆jT
∆iT
≈
π
≤
πi
πj
πj
i=1
doubled.
Putting it all together
Two bounds:
RT ≤ Õ

1 η

ln K ln η RT

R∞
T
for all η ∈ [η ah
t ∗ , 1]
Run on synthetic data (T = 2 · 107 )
8000
Worst−case bound and worst−case η
7000
Hedge(η)
AdaHedge
FlipFlop ah
LLR and ηt*
6000
regret
5000
4000
3000
2000
1000
0 −4
10
−2
10
0
10
learning rate (η)
2
10
Conclusion
I
Higher learning rates often achieve lower regret
I
I
I
In practice
Constructed data
Learning the Learning Rate (LLR) algorithm
I
Performance close to best learning rate in hindsight
Conclusion
I
Higher learning rates often achieve lower regret
I
I
I
In practice
Constructed data
Learning the Learning Rate (LLR) algorithm
I
Performance close to best learning rate in hindsight
Open problems:
I
LLR as PoC
Can we do it simpler, prettier, smoother and tighter?
Thank you!
Marie Devaine, Pierre Gaillard, Yannig Goude, and Gilles
Stoltz.
Forecasting electricity consumption by aggregating specialized
experts; a review of the sequential aggregation of specialized
experts, with an application to Slovakian and French
country-wide one-day-ahead (half-)hourly predictions.
Machine Learning, 90(2):231–260, February 2013.