Learning the Learning Rate
for Prediction with Expert Advice
Wouter M. Koolen
Tim van Erven
Peter D. Grünwald
Lorentz Workshop Leiden, Thursday 20th November, 2014
Online Learning Algorithms
n
rk i
Wo tice
c
Pra
?
The
o
Per retica
l
form
Gua
anc
rant
e
ees
Learning as a Game
worst-case safe algorithm
high (bad)
regret
minimax
problem instances
0 (perfect)
Practice is not Adversarial
worst-case safe algorithm
special-purpose algorithm
high (bad)
regret
minimax
problem instances
0 (perfect)
Luckiness
worst-case safe algorithm
special-purpose algorithm
?
high (bad)
regret
minimax
problem instances
0 (perfect)
Fundamental model for learning: Hedge setting
I
K experts
...
Fundamental model for learning: Hedge setting
I
K experts
...
I
In round t = 1, 2, . . .
I
I
I
Learner plays distribution wt = (wt1 , . . . , wtK ) on experts
K
Adversary reveals expert losses `t = (`1t , . . . , `K
t ) ∈ [0, 1]
Learner incurs loss wt| `t
Fundamental model for learning: Hedge setting
I
K experts
...
I
In round t = 1, 2, . . .
I
I
I
I
Learner plays distribution wt = (wt1 , . . . , wtK ) on experts
K
Adversary reveals expert losses `t = (`1t , . . . , `K
t ) ∈ [0, 1]
Learner incurs loss wt| `t
Evaluation criterion is the regret:
RT :=
T
X
wt| `t − min
k
t=1
| {z }
Learner
|
T
X
`kt
t=1
{z
}
best expert
Canonical algorithm for the Hedge setting
Hedge algorithm with learning rate η:
k
wtk
e −ηLt−1
:= P
−ηLkt−1
ke
where
Lkt−1 =
t−1
X
s=1
`ks .
Canonical algorithm for the Hedge setting
Hedge algorithm with learning rate η:
k
wtk
e −ηLt−1
:= P
−ηLkt−1
ke
The tuning η = η worst case :=
where
Lkt−1 =
t−1
X
s=1
q
RT ≤
8 ln K
T
results in
p
T /2 ln K
and we have matching lower bounds.
`ks .
Canonical algorithm for the Hedge setting
Hedge algorithm with learning rate η:
k
wtk
e −ηLt−1
:= P
−ηLkt−1
ke
The tuning η = η worst case :=
where
Lkt−1 =
t−1
X
s=1
q
RT ≤
8 ln K
T
results in
p
T /2 ln K
and we have matching lower bounds.
Case closed?
`ks .
Gap between Theory and Practice
Practitioners report that tuning η η worst case works much
better. [DGGS13]
Theory and Practice getting closer
Practitioners report that tuning η η worst case works much
better. [DGGS13]
Series of worst-case data-dependent improvements
RT ≤
p
T /2 ln K
T
L∗T
VT
∆T
Theory and Practice getting closer
Practitioners report that tuning η η worst case works much
better. [DGGS13]
Series of worst-case data-dependent improvements
RT ≤
p
T /2 ln K
T
L∗T
VT
∆T
and extension to scenarios where Follow-the-Leader (η = ∞)
shines (IID losses)
case
RT ≤ min Rworst
, R∞
T
T
Theory and Practice getting closer
Practitioners report that tuning η η worst case works much
better. [DGGS13]
Series of worst-case data-dependent improvements
RT ≤
p
T /2 ln K
T
L∗T
VT
∆T
and extension to scenarios where Follow-the-Leader (η = ∞)
shines (IID losses)
case
RT ≤ min Rworst
, R∞
T
T
Case closed?
Menu
Grand goal: be almost as good as best learning rate η
RT ≈ min RηT .
η
I
Example problematic data
I
Key ideas
Current η tunings miss the boat
T = 100000
800
700
600
RηT
500
400
300
200
100
0
0.001
0.01
0.1
1
η
10
100
Current η tunings miss the boat
T = 100000
800
Bad expert
700
600
RηT
500
expert
0:
1
1
rounds
1
1
1
...
400
300
200
100
0
0.001
0.01
0.1
1
η
10
100
Current η tunings miss the boat
T = 100000
800
Bad expert
FTL worst case
600
RηT
500
expert
700
1:
2:
0
1
1
0
rounds
0
1
0
1
0
1
...
...
400
300
200
100
0
0.001
0.01
0.1
1
η
10
100
Current η tunings miss the boat
T = 100000
800
Bad expert
FTL worst case
WC-eta killer
700
600
400
300
200
100
0
0.001
expert
RηT
500
0.01
1:
2:
0 ···0
1 ···1
1
0
0.1
rounds
0
1
1
0
1
η
0
1
...
...
10
100
Current η tunings miss the boat
T = 100000
800
Bad expert
FTL worst case
WC-eta killer
Combined
700
600
RηT
500
400
300
200
100
0
0.001
0.01
0.1
1
η
10
100
LLR algorithm in a nutshell
LLR
I
I
maintains a finite grid η 1 , . . . , η imax , η ah
cycles over the grid. For each η i :
I
I
I
I
Play the η i Hedge weights
Evaluate η i by its mixability gap
Until its budget doubled
adds next lower grid point on demand
LLR algorithm in a nutshell
LLR
I
I
maintains a finite grid η 1 , . . . , η imax , η ah
cycles over the grid. For each η i :
I
I
I
I
Play the η i Hedge weights
Evaluate η i by its mixability gap
Until its budget doubled
adds next lower grid point on demand
Resources:
I
Time: O(K ) per round (same as Hedge).
I
Memory: O(ln T ) → O(1).
Unavoidable notation
ht = wt| `t ,
−1 X k −η`kt
ln
wt e
mt =
,
η
(Hedge loss)
(Mix loss)
k
δt = ht − mt .
(Mixability gap)
Unavoidable notation
ht = wt| `t ,
−1 X k −η`kt
ln
wt e
mt =
,
η
(Hedge loss)
(Mix loss)
k
δt = ht − mt .
(Mixability gap)
And capitals denote cumulatives
∆T =
T
X
t=1
δt , . . .
Key Idea 1: Monotone regret lower bound
Problem: Regret RηT is not increasing with T .
But we have a monotone lower bound:
RηT ≥ ∆ηT
Proof:
Rηt = HT − L∗T = HT − MT + MT − L∗T
| {z }
| {z }
mixability gap
mix loss regret
Now use
MT
X 1
−1
k
=
ln
e −ηLT
η
K
k
!
ln K
∈ L∗T + 0,
η
Upshot: measure quality of each η using cumulative mixability gap.
Key Idea 2: Grid of η suffices
For γ ≥ 1:
δtγη ≤ γe (γ−1)(ln K +η) δtη
I.e. δtη cannot be much better than δtγη .
Exponentially spaced grid of η suffices.
Key Idea 3: Lowest η is “AdaHedge”
AdaHedge:
:=
η ah
t
Result:
RT ≤
imax
X
i=1
ln K
∆ah
t−1
∆iT + c∆ah
T
Key Idea 4: Budgeted timesharing
Active grid points
η1,
η2,
...,
η imax ,
η ah
t
π imax ,
π ah
with (heavy-tailed) prior distribution
π1,
π2,
...,
LLR maintains invariant:
i
∆Tmax
∆2T
∆ah
∆1T
T
≈
≈
≈
.
.
.
≈
π1
π2
π ah
π imax
Run each η i in turn until its cumulative mixability gap
imax
X
i=1
∆iT =
imax
X
i=1
πi
∆iT
πi
imax
∆jT X i
∆jT
∆iT
≈
π
≤
πi
πj
πj
i=1
doubled.
Putting it all together
Two bounds:
RT ≤ Õ
1 η
ln K ln η RT
R∞
T
for all η ∈ [η ah
t ∗ , 1]
Run on synthetic data (T = 2 · 107 )
8000
Worst−case bound and worst−case η
7000
Hedge(η)
AdaHedge
FlipFlop ah
LLR and ηt*
6000
regret
5000
4000
3000
2000
1000
0 −4
10
−2
10
0
10
learning rate (η)
2
10
Conclusion
I
Higher learning rates often achieve lower regret
I
I
I
In practice
Constructed data
Learning the Learning Rate (LLR) algorithm
I
Performance close to best learning rate in hindsight
Conclusion
I
Higher learning rates often achieve lower regret
I
I
I
In practice
Constructed data
Learning the Learning Rate (LLR) algorithm
I
Performance close to best learning rate in hindsight
Open problems:
I
LLR as PoC
Can we do it simpler, prettier, smoother and tighter?
Thank you!
Marie Devaine, Pierre Gaillard, Yannig Goude, and Gilles
Stoltz.
Forecasting electricity consumption by aggregating specialized
experts; a review of the sequential aggregation of specialized
experts, with an application to Slovakian and French
country-wide one-day-ahead (half-)hourly predictions.
Machine Learning, 90(2):231–260, February 2013.
© Copyright 2025 Paperzz