ST5224: Advanced Statistical Theory II
Chen Zehua
Department of Statistics & Applied Probability
2:00-5:00 pm, Thursday, January 12, 2011
Chen Zehua
ST5224: Advanced Statistical Theory II
Lecture 1: Bayesian Analysis
X is from a population in a parametric family P = {Pθ : θ ∈ Θ},
where Θ ⊂ Rk for a fixed integer k ≥ 1
The elements of Bayesian analysis
I
θ is viewed as a realization of a random vector θ̃ ∈ Θ whose
prior distribution is Π
I
Prior distribution: past experience, past data, or a
statistician’s belief (subjective)
I
Sample X ∈ X : from Pθ = Px|θ , the conditional distribution
of X given θ̃ = θ
I
Posterior distribution: updated prior distribution using the
sample X = x
Chen Zehua
ST5224: Advanced Statistical Theory II
Construction of posterior distribution
Theorem 4.1 (Bayes formula)
Assume P = {Px|θ : θ ∈ Θ} is dominated by a σ-finite measure ν
and fθ (x) = dPx|θ /dν is a Borel function on
(X × Θ, σ(BX × BΘ )).R Let Π be a prior distribution on Θ.
Suppose that m(x) = Θ fθ (x)dΠ > 0.
(i) The posterior distribution Pθ|x Π and
dPθ|x /dΠ = fθ (x)/m(x)
(ii) If Π λ and dΠ/dλ = π(θ) for a σ-finite measure λ, then
dPθ|x /dλ = fθ (x)π(θ)/m(x)
Proof:
Result (ii) follows from result (i) and Proposition 1.7(iii)
Chen Zehua
ST5224: Advanced Statistical Theory II
Proof for (i)
Z
Z Z
Z Z
m(x)dν =
X
fθ (x)dΠdν =
X
Θ
fθ (x)dνdΠ = 1
Θ
X
The second equality follows from Fubini’s theorem
m(x) is integrable w.r.t. ν and m(x) < ∞ a.e. ν
Because of this, m(x) is called the marginal p.d.f. of X w.r.t. ν
Without loss of generality we may assume m(x) > 0
If m(x) = 0 for an x ∈ X , then fθ (x) = 0 a.s. Π
Either x should be eliminated from X or the prior Π is incorrect
and a new prior should be specified
For x ∈ X with m(x) < ∞, define
Z
1
fθ (x)dΠ,
P(B, x) =
m(x) B
B ∈ BΘ
Then P(·, x) is a probability measure on Θ a.e. ν.
Chen Zehua
ST5224: Advanced Statistical Theory II
By Theorem 1.7, it remains to show that
P(B, x) = P(θ̃ ∈ B|X = x)
By Fubini’s theorem, P(B, ·) is a measurable function of x
Let Px,θ denote the “joint” distribution of (X , θ̃)
For any A ∈ σ(X ),
Z
Z Z
IB (θ)dPx,θ =
fθ (x)dνdΠ
A×Θ
A B
Z
Z Z
fθ (x)
=
dΠ
fθ (x)dΠ dν
m(x)
Θ
ZA Z BZ
fθ (x)
=
dΠ fθ (x)dνdΠ
Θ A
B m(x)
Z
=
P(B, x)dPx,θ
A×Θ
where the third equality follows from Fubini’s theorem
This completes the proof.
Chen Zehua
ST5224: Advanced Statistical Theory II
Discrete X and θ̃: The Bayes formula in elementary probability
P(X = x|θ̃ = θ)P(θ̃ = θ)
θ∈Θ P(X = x|θ̃ = θ)P(θ̃ = θ)
P(θ̃ = θ|X = x) = P
Remarks on the Bayesian approach
I
The posterior Pθ|x contains all the information we have about
θ
I
Statistical decisions and inference should be made based on
Pθ|x , conditional on the observed X = x
I
In estimating θ, Pθ|x can be viewed as a randomized decision
rule under the approach discussed in §2.3
After X = x is observed, Pθ|x is a randomized rule, which is a
probability distribution on the action space A = Θ
I
The Bayesian method can be applied iteratively
Chen Zehua
ST5224: Advanced Statistical Theory II
Bayes action and generalized Bayes action
Definition 4.1 (Bayes action)
Let A be an action space in a decision problem and L(θ, a) ≥ 0 be
a loss function
For any x ∈ X , a Bayes action w.r.t. Π is any δ(x) ∈ A such that
E [L(θ̃, δ(x))|X = x] = min E [L(θ̃, a)|X = x]
a∈A
where the expectation is w.r.t. the posterior distribution Pθ|x
Remarks
I
The Bayes action minimizes the posterior expected loss
I
x is fixed, although δ(x) depends on x
I
The Bayes action depends on the prior
I
The Bayes action depends on the loss function
Chen Zehua
ST5224: Advanced Statistical Theory II
Proposition 4.1 (Existence and uniqueness of Bayes actions)
Assume the conditions in Theorem 4.1; L(θ, a) is convex in a for
each fixed θ; for each x ∈ X , E [L(θ̃, a)|X = x] < ∞ for some a.
(i) If A ⊂ Rp is compact, a Bayes action exists for each x ∈ X .
(ii) If A = Rp and L(θ, a) tends to ∞ as kak → ∞ uniformly in
θ ∈ Θ0 ⊂ Θ with Π(Θ0 ) > 0, a Bayes action exists for each x ∈ X .
(iii) In (i) or (ii), if L(θ, a) is strictly convex in a for each fixed θ,
then the Bayes action is unique.
Proof
The convexity of L implies that E [L(θ̃, a)|X = x] as a function of a
with any fixed x is convex and continuous. Result (i) follows from
the fact that any continuous function on a compact set attains its
minimum. Result (ii) follows from the fact that
Z
lim E [L(θ̃, a)|X = x] ≥ lim
L(θ, a)dPθ|x = ∞
kak→∞
kak→∞ Θ0
Result (iii) holds because E [L(θ̃, a)|X = x] is strictly convex in a.
Chen Zehua
ST5224: Advanced Statistical Theory II
Example 4.1: the estimation of ϑ = g (θ)
R
Assume Θ [g (θ)]2 dΠ < ∞, A = the range of g (θ), and
L(θ, a) = [g (θ) − a]2 (squared error loss).
Using the argument in Example 1.22, we obtain the Bayes action
R
R
g (θ)fθ (x)dΠ
Θ g (θ)fθ (x)dΠ
δ(x) =
= ΘR
,
m(x)
Θ fθ (x)dΠ
which is the posterior expectation of g (θ̃), given X = x.
A more specific case
g (θ) = θj for some integer j ≥ 1
fθ (x) = e −θ θx I{0,1,2,...} (x)/x! (the Poisson distribution) with θ > 0
Π has a Lebesgue p.d.f. π(θ) = θα−1 e −θ/γ I(0,∞) (θ)/[Γ(α)γ α ]
(the gamma distribution Γ(α, γ) with known α > 0 and γ > 0)
Then, for x = 0, 1, 2, ..., and some function c(x),
fθ (x)π(θ)/m(x) = c(x)θx+α−1 e −θ(γ+1)/γ I(0,∞) (θ),
This is the gamma distribution Γ(x + α, γ/(γ + 1)).
Chen Zehua
ST5224: Advanced Statistical Theory II
Without actually working out the integral m(x), we know that
c(x) = (1 + γ −1 )x+α /Γ(x + α),
Z ∞
θj+x+α−1 e −θ(γ+1)/γ dθ.
δ(x) = c(x)
0
The integrand is proportional to the p.d.f. of the gamma
distribution Γ(j + x + α, γ/(γ + 1)). Hence
δ(x) = c(x)Γ(j + x + α)/(1 + γ −1 )j+x+α
= (j + x + α − 1) · · · (x + α)/(1 + γ −1 )j .
In particular, δ(x) = (x + α)γ/(γ + 1) when j = 1.
Conjugate prior
In the example above, the prior and posterior are in the same
parametric family. In general, if a prior is in the same parametric
family as the posterior, it is called a conjugate prior.
Chen Zehua
ST5224: Advanced Statistical Theory II
Remarks
I
Whether a prior is conjugate involves a pair of families; one is
the family P = {fθ : θ ∈ Θ} and the other is the family from
which Π is chosen.
I
Example 4.1 shows that the Poisson family and the gamma
family produce conjugate priors.
I
Many pairs of families in Table 1.1 (page 18) and Table 1.2
(pages 20-21) produce conjugate priors.
I
Under a conjugate prior, Bayes actions often have explicit
forms (in x) when the loss function is simple.
I
Even under a conjugate prior, the integral in δ(x) in Example
4.1 involving a general g may not have an explicit form.
I
In general, numerical methods have to be used in evaluating
the integrals in δ(x) under general loss functions.
Chen Zehua
ST5224: Advanced Statistical Theory II
Generalized Bayes action
The minimization in Definition 4.1 is the same as minimizing
Z
Z
L(θ, δ(x))fθ (x)dΠ = min
L(θ, a)fθ (x)dΠ
a∈A
Θ
Θ
This is still defined even if Π is not a probability measure but a
σ-finite measure on Θ, in which case m(x) may not be finite.
The δ(x) above which attains the minimum is called a generalized
Bayes action.
If Π(Θ) 6= 1, Π is called an improper prior.
A prior with Π(Θ) = 1 is then called a proper prior.
In many cases, one has no past information and has to choose a
prior subjectively.
In such cases, one would like to select a noninformative prior that
tries to treat all parameter values in Θ equitably.
A noninformative prior is often improper.
Chen Zehua
ST5224: Advanced Statistical Theory II
Example 4.3
Suppose that X = (X1 , ..., Xn ) and Xi ’s are i.i.d. from N(µ, σ 2 ),
where µ ∈ Θ ⊂ R is unknown and σ 2 is known.
Consider the estimation of ϑ = µ under the squared error loss.
If Θ = [a, b] with −∞ < a < b < ∞, then a noninformative prior
that treats all parameter values equitably is the uniform
distribution on [a, b].
If Θ = R, however, the corresponding “uniform distribution” is the
Lebesgue measure on R, which is an improper prior.
If Π is the Lebesgue measure on R, then
)
( n
Z ∞
X (xi − µ)2
dµ < ∞.
(2πσ 2 )−n/2
µ2 exp −
2σ 2
−∞
i=1
By differentiating a in
( n
)
Z ∞
X (xi − µ)2
2 −n/2
2
(2πσ )
(µ − a) exp −
dµ
2σ 2
−∞
i=1
Chen Zehua
ST5224: Advanced Statistical Theory II
Example 4.3 (continued)
and
that
Pn using the 2factP
n
2
2
−
µ)
=
(x
i=1 (xi − x̄) + n(x̄ − µ) , where x̄ is the
i=1 i
sample mean of the observations x1 , ..., xn , we obtain that
R∞
2
2
−∞ µ exp −n(x̄ − µ) /(2σ ) dµ
δ(x) = R ∞
= x̄.
2
2
−∞ exp {−n(x̄ − µ) /(2σ )} dµ
Thus, the sample mean is a generalized Bayes action under the
squared error loss.
From Example 2.25 and Exercise 91 in §2.6, if Π is N(µ0 , σ02 ), then
the Bayes action is
µ∗ (x) =
σ2
nσ02
µ
+
x̄
0
nσ02 + σ 2
nσ02 + σ 2
and
c2 =
σ02 σ 2
nσ02 + σ 2
Note that in this case x̄ is a limit of µ∗ (x) as σ02 → ∞.
Chen Zehua
ST5224: Advanced Statistical Theory II
Empirical and hierarchical Bayes actions
Hyperparameters and empirical Bayes
A Bayes action depends on the chosen prior with a vector ξ of
parameters called hyperparameters.
So far, hyperparameters are assumed to be known.
If the hyperparameter ξ is unknown, one way to solve the problem
is to estimate ξ using some historical data; the resulting Bayes
action is called an empirical Bayes action.
If there is no historical data, we may estimate ξ using data x and
the resulting Bayes action is also called an empirical Bayes action.
The simplest empirical Bayes method is to estimate ξ by viewing x
as a “sample” from the marginal distribution
Z
Px|ξ (A) =
Px|θ (A)dΠθ|ξ ,
A ∈ BX ,
Θ
where Πθ|ξ
R is a prior depending on ξ or from the marginal p.d.f.
m(x) = Θ fθ (x)dΠ, if Px|θ has a p.d.f. fθ .
Chen Zehua
ST5224: Advanced Statistical Theory II
Example 4.4
Let X = (X1 , ..., Xn ) and Xi ’s be i.i.d. from N(µ, σ 2 ) with an
unknown µ ∈ R and a known σ 2 . Consider the prior
Πµ|ξ = N(µ0 , σ02 ) with ξ = (µ0 , σ02 ).
To obtain a moment estimate of ξ, we need to calculate
Z
Z
x1 m(x)dx and
x12 m(x)dx, x = (x1 , ..., xn ).
Rn
Rn
These can be obtained without calculating m(x). Note that
Z
Z Z
Z
x1 m(x)dx =
x1 fµ (x)dxdΠµ|ξ =
µdΠµ|ξ = µ0
Rn
and
Z
Rn
Rn
Θ
x12 m(x)dx
Z Z
=
Θ
Rn
R
x12 fµ (x)dxdΠµ|ξ
2
Z
=σ +
µ2 dΠµ|ξ
R
= σ 2 + µ20 + σ02
Chen Zehua
ST5224: Advanced Statistical Theory II
Example 4.4: (continued)
Thus, by viewing x1 , ..., xn as a sample from m(x), we obtain the
moment estimates
n
µ̂0 = x̄
and σ̂02 =
1X
(xi − x̄)2 − σ 2 ,
n
i=1
where x̄ is the sample mean of xi ’s.
Replacing µ0 and σ02 in
µ∗ (x) =
σ2
nσ02
µ
+
x̄
0
nσ02 + σ 2
nσ02 + σ 2
and
c2 =
σ02 σ 2
nσ02 + σ 2
(Example 2.25) by µ̂0 and σ̂02 , respectively, we find that the
empirical Bayes action under the squared error loss is simply the
sample mean x̄ (which is the generalized Bayes action in Example
4.3).
Note that σ̂02 in Example 4.4 can be negative.
Chen Zehua
ST5224: Advanced Statistical Theory II
Hierarchical Bayes
Instead of estimating hyperparameters, in the hierarchical Bayes
approach we put a prior on hyperparameters.
Let Πθ|ξ be a (first-stage) prior with a hyperparameter vector ξ
and let Λ be a prior on Ξ, the range of ξ. Then the “marginal”
prior for θ is defined by
Z
Π(B) =
Πθ|ξ (B)dΛ(ξ),
B ∈ BΘ .
Ξ
If the second-stage prior Λ also depends on some unknown
hyperparameters, then one can go on to consider a third-stage
prior.
In most applications, however, two-stage priors are sufficient, since
misspecifying a second-stage prior is much less serious than
misspecifying a first-stage prior (Berger, 1985, §4.6). In addition,
the second-stage prior can be noninformative (improper).
Bayes actions can be obtained in the same way as before.
Chen Zehua
ST5224: Advanced Statistical Theory II
Remarks
I
Empirical Bayes methods deviate from the Bayes method
since x is used to estimate hyperparameters.
I
The hierarchical Bayes method is generally better than
empirical Bayes methods.
Suppose that X has a p.d.f. fθ (x) w.r.t. a σ-finite measure ν and
Πθ|ξ has a p.d.f. πθ|ξ (θ) w.r.t. a σ-finite measure κ.
Then the prior Π has a p.d.f. (w.r.t. κ)
Z
π(θ) =
πθ|ξ (θ)dΛ(ξ)
Ξ
and
Z Z
fθ (x)πθ|ξ (θ)dΛdκ.
m(x) =
Θ
Ξ
Let Pθ|x,ξ be the posterior distribution of θ̃ given x and ξ and
Z
mx|ξ (x) =
fθ (x)πθ|ξ (θ)dκ,
Θ
which is the marginal of X given ξ.
Chen Zehua
ST5224: Advanced Statistical Theory II
Then the posterior distribution Pθ|x has a p.d.f.
dPθ|x
fθ (x)π(θ)
=
dκ
m(x)
Z
fθ (x)πθ|ξ (θ)
dΛ(ξ)
=
m(x)
Ξ
Z
fθ (x)πθ|ξ (θ) mx|ξ (x)
=
dΛ(ξ)
mx|ξ (x)
m(x)
Ξ
Z
dPθ|x,ξ
=
dPξ|x ,
dκ
Ξ
where Pξ|x is the posterior distribution of ξ given x.
Thus, under the estimation problem considered in Example 4.1, the
(hierarchical) Bayes action is
Z
δ(x) =
δ(x, ξ)dPξ|x ,
Ξ
where δ(x, ξ) is the Bayes action when ξ is known.
Chen Zehua
ST5224: Advanced Statistical Theory II
Example 4.5
Consider Example 4.4 again.
Suppose that µ0 in the first-stage prior N(µ0 , σ02 ), is unknown and
σ02 is known.
Let the second-stage prior for ξ = µ0 be the Lebesgue measure on
R (improper prior).
From Example 2.25,
δ(x, ξ) =
σ2
nσ02
ξ
+
x̄.
nσ02 + σ 2
nσ02 + σ 2
To obtain the Bayes action δ(x), it suffices to calculate Eξ|x (ξ),
where the expectation is w.r.t. Pξ|x .
Note that the p.d.f. of Pξ|x is proportional to
Z
∞
ψ(ξ) =
−∞
n
2
−
exp − n(x̄−µ)
2σ 2
Chen Zehua
(µ−ξ)2
2σ02
o
dµ.
ST5224: Advanced Statistical Theory II
Example 4.5 (continued)
Using the properties of normal distributions, one can show that
−1 2
ξ
ξ2
nx̄
n
1
ψ(ξ) = C1 exp
+ 2σ2
+ 2σ2 − 2σ2
2σ 2
2σ 2
0
0
0
n
o
2
= C2 exp − 2(nσnξ2 +σ2 ) + nσnx̄ξ
2
2
0
0 +σ
n
o
n(ξ−x̄)2
= C3 exp − 2(nσ
2 +σ 2 ) ,
0
where C1 , C2 , and C3 are quantities not depending on ξ.
Hence Eξ|x (ξ) = x̄.
The (hierarchical) generalized Bayes action is then
δ(x) =
σ2
nσ02
E
(ξ)
+
x̄ = x̄.
ξ|x
nσ02 + σ 2
nσ02 + σ 2
Chen Zehua
ST5224: Advanced Statistical Theory II
Bayes estimators
I
If a Bayes action δ(x) is a measurable function of x, then
δ(X ) is a nonrandomized decision rule. It also minimizes the
Z
Bayes risk
rT (Π) =
RT (θ)dΠ
Θ
over all decision rules T , RT (θ) = E [L(θ, T (X ))] is the risk
function of T .
I
Thus, δ(X ) is a Bayes rule defined in §2.3.2. In an estimation
problem, a Bayes rule is called a Bayes estimator. Generalized
Bayes risks, generalized Bayes rules (or estimators), and
empirical Bayes rules (or estimators) can be defined similarly.
I
In view of the discussion in §2.3.2, even if we do not adopt
the Bayesian approach, the method described in §4.1.1 can be
used as a way of generating decision rules.
Chen Zehua
ST5224: Advanced Statistical Theory II
Frequentist properties of Bayes rules/estimators
Admissibility
Given RT (θ) = E [L(T (X ), θ)], T is =-admissible iff there is no
T0 ∈ = with RT0 (θ) ≤ RT (θ) for all θ and RT0 (θ) < RT (θ) for
some θ. Admissible = =-admissible with = = { all rules }
Bayes rules are typically admissible: If T is better than a Bayes rule
δ, then T has the same Bayes risk as δ and is itself a Bayes rule.
Theorem 4.2 (Admissibility of Bayes rules)
In a decision problem, let δ(X ) be a Bayes rule w.r.t. a prior Π.
(i) If δ(X ) is a unique Bayes rule, then δ(X ) is admissible.
(ii) If Θ is a countable set, the Bayes risk rδ (Π) < ∞, and Π gives
positive probability to each θ ∈ Θ, then δ(X ) is admissible.
(iii) Let = be the class of decision rules having continuous risk
functions. If δ(X ) ∈ =, rδ (Π) < ∞, and Π gives positive
probability to any open subset of Θ, then δ(X ) is =-admissible.
Chen Zehua
ST5224: Advanced Statistical Theory II
Generalized Bayes rules or estimators are not necessarily admissible.
Many generalized Bayes rules are limits of Bayes rules (Examples
4.3 and 4.7), which are often admissible.
Theorem 4.3
Suppose that Θ is an open set of Rk . In a decision problem, let =
be the class of decision rules having continuous risk functions. A
decision rule T ∈ = is =-admissible if there exists a sequence {Πj }
of (possibly improper) priors such that (a) the generalized Bayes
risks rT (Πj ) are finite for all j; (b) for any θ0 ∈ Θ and η > 0,
lim
rT (Πj ) − rj∗ (Πj )
j→∞
Πj (Oθ0 ,η )
= 0,
where rj∗ (Πj ) = inf T ∈= rT (Πj ) and Oθ0 ,η = {θ ∈ Θ : kθ − θ0 k < η}
with Πj (Oθ0 ,η ) < ∞ for all j.
Chen Zehua
ST5224: Advanced Statistical Theory II
Proof
Suppose that T is not =-admissible.
Then there exists T0 ∈ = such that RT0 (θ) ≤ RT (θ) for all θ and
RT0 (θ0 ) < RT (θ0 ) for a θ0 ∈ Θ.
From the continuity of the risk functions, we conclude that
RT0 (θ) < RT (θ) − θ ∈ Oθ0 ,η
for some constants > 0 and η > 0.
Then, for any j,
rT (Πj ) − rj∗ (Πj ) ≥ rT (Πj ) − rT0 (Πj )
Z
≥
[RT (θ) − RT0 (θ)]dΠj (θ)
Oθ0 ,η
≥ Πj (Oθ0 ,η ),
which contradicts condition (b). Hence, T is =-admissible.
Chen Zehua
ST5224: Advanced Statistical Theory II
Example 4.6 (An application of Theorem 4.3)
Let X1 , ..., Xn be iid from N(µ, σ 2 ) with unknown µ and known σ 2 .
Consider the squared error loss.
By Theorem 2.1, the risk function of any decision rule is
continuous in µ if the risk is finite.
Apply Theorem 4.3 to the sample mean X̄ and let Πj = N(0, j).
Since RX̄ (µ) = σ 2 /n, rX̄ (Πj ) = σ 2 /n for any j. Hence, condition
(a) in Theorem 4.3 is satisfied.
From Example 2.25, the Bayes estimator w.r.t. Πj is
nj
δj (X ) =
X̄
nj + σ 2
Thus,
σ 2 nj 2 + σ 4 µ2
Rδj (µ) =
(nj + σ 2 )2
and (continued on next slide)
Chen Zehua
ST5224: Advanced Statistical Theory II
rj∗ (Πj )
Z
=
Rδj (µ)dΠj =
σ2j
.
nj + σ 2
For any Oµ0 ,η = {µ : |µ − µ0 | < η},
2ηΦ0 (ξj )
µ0 − η
µ0 + η
√
√
√
Πj (Oµ0 ,η ) = Φ
−Φ
=
j
j
j
√
√
for some ξj satisfying (µ0 − η)/ j ≤ ξj ≤ (µ0 + η)/ j, where Φ is
the standard normal c.d.f. and Φ0 is its derivative.
Since Φ0 (ξj ) → Φ0 (0) = (2π)−1/2 ,
rX̄ (Πj ) − rj∗ (Πj )
Πj (Oµ0 ,η )
√
σ4 j
=
→0
2ηΦ0 (ξj )n(nj + σ 2 )
as j → ∞. Thus, condition (b) in Theorem 4.3 is satisfied. Hence,
Theorem 4.3 applies and the sample mean X̄ is admissible.
Chen Zehua
ST5224: Advanced Statistical Theory II
Bias
A Bayes estimator is usually biased.
Consistency
Bayes estimators are usually consistent and approximately
unbiased.
When Bayes estimators have explicit forms, it is usually easy to
check directly whether Bayes estimators are consistent and
approximately unbiased (Examples 4.7-4.9).
Bayes estimators also have some other good asymptotic properties,
which are studied in §4.5.3.
Chen Zehua
ST5224: Advanced Statistical Theory II
Markov chain Monte Carlo (MCMC)
Often, Bayes actions or estimators have to be computed
numerically. Typically we need to compute
Z
Ep (g ) =
g (θ)p(θ)dν
Θ
with some function g , where p(θ) is a p.d.f. w.r.t. a σ-finite
measure ν on (Θ, BΘ ) and Θ ⊂ Rk . If g is an indicator function of
A ∈ BΘ and p(θ) is the posterior p.d.f. of θ given X = x, then
Ep (g ) is the posterior probability of A.
The simple Monte Carlo method
Generate i.i.d. θ(1) , ..., θ(m) from a p.d.f. h(θ) > 0 w.r.t. ν.
By the SLLN, as m → ∞,
Z
m
1 X g (θ(j) )p(θ(j) )
g (θ)p(θ)
Êp (g ) =
→
h(θ)dν = Ep (g ).
a.s.
(j) )
m
h(θ)
h(θ
Θ
j=1
Hence Êp (g ) is a numerical approximation to Ep (g ).
Chen Zehua
ST5224: Advanced Statistical Theory II
The simple Monte Carlo method may not work well because
I
the convergence of Êp (g ) is very slow when k (the dimension
of Θ) is large
I
generating a random vector from some k-dimensional
distribution is difficult, if not impossible.
More sophisticated MCMC methods
Different from the simple Monte Carlo in two aspects:
I
generating random vectors can be done using distributions
whose dimensions are much lower than k
I
θ(1) , ..., θ(m) are not independent, but form a homogeneous
Markov chain.
Many MCMC methods were developed in the last 20 years
We only consider one of them as an example
Chen Zehua
ST5224: Advanced Statistical Theory II
Gibbs sampler
Let y = (y1 , y2 , ..., yk ). (yj ’s may be vectors with different
dimensions)
At step t = 1, 2, ..., given y (t−1) , generate
(t)
(t−1)
(t−1)
y1 from P(y1 |y2
),...,
, ..., yk
(t)
(t)
(t)
(t−1)
(t−1)
yj from P(yj |y1 , ..., yj−1 , yj+1 , ..., yk
),...,
(t)
(t)
(t)
yk from Pk (yk |y1 , ..., yk−1 ).
The notation P(yj |x denotes the conditional distribution of yj
given that X = x.
Example 4.10
Consider a linear model
Xij = β τ Zi + ij ,
j = 1, ..., ni , i = 1, ..., k,
p
where β ∈ R is unknown, Zi ’s are known vectors, ij ’s are
independent, and ij is N(0, σi2 ), j = 1, ..., ni , i = 1, ..., k.
Let X be the sample vector containing all Xij ’s.
The parameter vector is θ = (β, ω), ω = (ω1 , ..., ωk ) and
ωi = (2σi2 )−1 .
Chen Zehua
ST5224: Advanced Statistical Theory II
Example 4. 10 (continued)
Assume the prior for θ has the Lebesgue p.d.f.
k
Y
ωiα e −ωi /γ ,
c π(β)
i=1
where α > 0, γ > 0, and c > 0 are known constants and π(β) is a
known Lebesgue p.d.f. on Rp .
The posterior p.d.f. of θ is then proportional to
k
Y
n /2+α −[γ −1 +vi (β)]ωi
h(X , θ) = π(β)
ωi i
e
,
i=1
Pni
where vi (β) = j=1 (Xij − β τ Zi )2 .
To apply a Gibbs sampler with y = θ, y1 = β, and y2 = ω, we
need to generate random vectors from the posterior of β, given x
and ω, and the posterior of ω, given x and β.
Chen Zehua
ST5224: Advanced Statistical Theory II
Example 4.10 (continued)
The posterior of ω = (ω1 , ..., ωk ), given x and β, is a product of
marginals of ωi ’s that are the gamma distributions
Γ(α + 1 + ni /2, [γ −1 + vi (β)]−1 ), i = 1, ..., k.
Assume now that π(β) ≡ 1 (noninformative prior for β).
The posterior p.d.f. of β, given x and ω, is proportional to
k
Y
e −ωi vi (β) ∝ e −kW
1/2 Z β−W 1/2 X k2
,
i=1
where WPis the diagonal block matrix whose ith block is ωi Ini .
Let n = ki=1 ni .
The posterior of W 1/2 Z β, given X and ω, is Nn (W 1/2 X , 2−1 In )
and the posterior of β, given X and ω, is
Np ((Z τ WZ )−1 Z τ WX , 2−1 (Z τ WZ )−1 ) (Z τ WZ is assumed of full
rank for simplicity), since β = [(Z τ WZ )−1 Z τ W 1/2 ]W 1/2 Z β.
Random generation using these two posterior distributions is easy.
Chen Zehua
ST5224: Advanced Statistical Theory II
© Copyright 2026 Paperzz