An Algorithm for Unbiased Random Sampling
Janno Ernvall and OUi Nevalainen
Department of Computer Science, University of Turku, SF-20500 TURKU 50, Finland
An algorithm for selection of an unbiased sample out of a set of possible elements without replacement is given. The
storage space depends linearly on the sample size; also the running time of the algorithm is low.
INTRODUCTION
The problem of random sampling occurs in many
different contexts. For example, we may wish to study
experimentally the behaviour of a new data structure for
searching. Then the easiest way is to generate a set of
data elements, construct the corresponding structure and
then perform searching of some elements belonging (or
not belonging) to the data structure. A usual approach is
to use a set of data elements where the elements are
randomly sampled from a universal set. If we allow
multiple occurrences of elements (sampling with replacement) we have no difficulties: we repeat m times a step
where each of the possible n elements is chosen with
equal probability 1/n (m being the size of the final
multiset sample).
If we demand in contrast to the above that each
element occurs at most once in the sample (random
sampling without replacement) the sampling may be very
time- and space-consuming. Goodman and Hedetniemi
give and analyse four sampling algorithms for this case.'
The most effective of those, called SELECT, needs an
O(m) running time for actual sampling, O(n) running
time for preprocessing and O(«) storage space. The
significance of an algorithm of this kind becomes evident
if we recall a result of Ref. 1: if we sample elements with
replacement and accept only those which have not yet
been selected, then for finding m different elements we
have on average to sample n-22=n-m+i 1/^ elements.
The value of this expression may be very large in
comparison to m. In addition, the decision whether or
not a new element is really accepted demands sorting or
O(n) storage.
In the following we give an algorithm using the basic
idea of SELECT but consuming only O(m) storage. The
running time of the algorithm is on average proportional
to m and in the worst case proportional to m1. In addition,
a version demanding O(/w log m) time, both on average
and in the worst case, is pointed out.
THE SAMPLING ALGORITHM
The following algorithm creates a random sample of m
elements without replacement out of a set of n elements
E° = {ei,e2, • • • ,en}. At the termination of the algorithm
array s(l :m) contains the indexesof the selected elements.
We denote by E the set of unselected elements and by
last the number of the elements in E. Initially let E = E°
and last = n. Further let operartor o(/) give the £°-index
of the fth element of E. Thus initially we have o(0 = i, for
i = l , 2 , . . . , last.
The first sample element is found by choosing a
random integer ze[l, last] and writing the index o(z)
down into the array s. The element chosen is e^, and it
is deleted from E. The value of o(z) is redefined to o(last)
and last is reduced by one. The same actions are then
repeated for each sample element. It is easy to see that an
unbiased random sample is really found by the above
method.1
Goodman and Hedetniemi implement the above
algorithm by using a vector of n elements holding the
current indexes of the E set.' This makes the algorithm
very simple and the actual sampling very fast.
In certain cases n may be intolerably large, however,
though the size of the sample is not. Let us suppose, for
example, that we are generating an artificial file to study
the behaviour of a new accessing method. We want to
generate 5000 different records, each with 10 fields and
with 6 possible different values in a field. The demand
that duplicates do not occur follows from the nature of
the accessing method. Now we have a natural correspondence between positive integers and records: the first
possible record is of the form (1, 1, . . . , 1) and the last
(6, 6, . . . , 6). The number of possible records is
60 466 176. To store the indexes of the records is
impossible in most present-day computers. Of course we
could sample the records with replacement but this would
possibly not give the right file size in one phase.
In the next algorithm we omit the explicit list of
indexes of unselected elements by keeping track only of
the changes in the indexes in E. To do this we need a data
structure supporting INSERT, UPDATE and SEARCH
operations. Further, we know that finally at most m - 1
nodes are stored in the data structure. To obtain a good
running time we use hashing with separate chaining in
our algorithm (see Knuth,2 p. 514). The primary area
serves only as a set of pointers to the linearly linked
collision lists. For each random integer which has been
selected there exists a list element of the form (link, z,
0(2)). We use z as the searching key in the hash structure.
As an address transformation function we apply the
remainder of division technique. Because of the random
nature of index selection the technique gives a rather
ideal scatter function. Furthermore, the size of the hash
table is not critical; table size m works well.
CCC-0010-4620/82/0025-0045 $01.50
© Heyden & Son Ltd, 1982
THE COMPUTER JOURNAL, VOL. 25, NO. 1, 1982 4 5
J. ERNVALL AND O. NEVALAINEN
The algorithm is as follows:
SUBROUTINE HSEL<M.N,IS,IH,LINK,IA.IB)
THIS PROGRAM SELECTS H DIFFERENT ELEMENTS OUT OF N
POSSIBLE AND STORES THEIR INDEXES INTO AN ARRAY ISCI).
ARRAY IH(I) SERV£5 AS A PRIMARY AREA OF THE HASH
TABLEi COLLISION LISTS ARE IMPLEMENTED BY THE
ARRAYS (LINK(I>tIA(I)iIB(I>). THE CALLER HUST CONTAIN
DIMENSION IS<M).IHCM).LINK(M)iIA(H).IB(H). RAN(Y)
SENERATES A PSEUDORANDOM NUMBER FROM THE INTERVAL 0 TO 1.
Y IS A DUMMY VARIABLE.
DIMENSION IHCM).LINK CM),IACM),JBCM).IS(M>
C
INITIALIZATIONS
LAST=N
NEU=O
DO 1 L = I.M
1
IH<L)=Q
C
THE MAIN LOOP
DO 2 L=1iM
KSI=LAST«RAN(Y)*1
C
SEARCH FROM THE HASH TABLE A NODE WITH KEY=LAST AND
C
STORE THE ORISINAL INDEX NUMBER IN MLAST. IF SUCH
C , NODE IS NOT FOUNOi LET MLAST=LAST.
I=MOOCLAST.M)+1
J=IHCI)
i,
IF(J.EO.O) GO TO 3
IF(IACJ).EQ.LAST) GO TO 3
J=LINKCJ>
GO TO 4
3
MLAST=LAST
IF(J.NE.O) MLAST=IB(J>
C
SEARCH FROM THE HASH TABLE A NODE WITH KEY=KSI.
C
IF SUCH NODE (IA(J)>IB<J)) IS FOUND LET THE NEW
C
VALUE OF IB(J) BE MLAST. OTHERWISE CREATE A NEW NODE
C
CONTAINING CKSI.MLAST).
I=MODCKSI.M>+1
J=IHCI>
9
IF(J.ES.O) GO TO 7
IFtlA(J).EQ.KSI) GO TO 8
J=LINK(J>
GO TO 9
C
UPDATE THE NODE
B
IS<L)=IB(J)
IB(J)=MLAST
GO TO 2
C
CREATE A NEW NODE.
7
ISCL)=KSI
NEW=NEW+1
IACNEW)=KSI
IBCNEW)=MLAST
L1NK<NEW)=IH(I)
IHCI)=NEW
2
LAST=LAST-1
RETURN
ENO
C
C
C
C
C
C
C
C
MAIN PROGRAM
DIMENSION ISC 100)<IH(100)<LINK(100)iIA(lOO)11B(100)
N=30000
H = 100
CALL H S E L C N . N U S . I H a i N K i I A . I B )
WRITE(5.70) ( I S ( I > . I = 1 . H >
70
FORMATdOIi)
STOP
END
demand of the algorithm SELECT. One can leave out
the array of the selected indexes by collecting them at the
end of the original index vector.
The essential part of the program when examining the
execution time of HSEL is the DO 2-loop. The loop is
executed m times. Each time a function call and two hash
table searches are made. Because the running time of the
function RAN is constant we need study only the
execution times of the hash table operations. Furthermore
these operations are very much alike; hence we consider
only the first one (lines 22 up to 29 of HSEL), which
searches from the hash table a node with matching key.
The running time of this operation is dependent on the
number of key comparisons made while following the
collision list.
The load factor a is defined as the ratio of the number
x of nodes in the hash table and the number of positions
m in the hash table. In the course of the algorithm x is
between 0 and m — 1. For large n and m the expected
value of x before choosing the mth sample element is
x=
integers)
0 is in the set of m - 1 first generated random
The probability that i is in the set of m - 1firstgenerated
random integers is equal for i = 1, 2, . . . , n — (m — 1) +
1. For the values i = n — (m - 1) + 2,. .., n the probability depends on i, because the variable last decreases. In
total we have
n-m + 2
C
n —m + 2
= I
m-1
n-i+l
I
=n-m+3
EFFICIENCY
As noted in the previous section the auxiliary storage
needed (for the selected indexes, hash table and collision
lists) is about m + m + 3m = 5m storage locations. Hence
the demand of the storage space is less for HSEL than
that for SELECT if 4m < n. Furthermore for HSEL the
need for the storage space is proportional to m.
We can reduce further the size of the collision lists to
about 2 • m storage locations by applying the abbreviation
technique proposed by Knuth for a different hash table
organization (see Ref. 2, exercise 6.4.13). When using the
remainder of division technique (with m as a divisor) we
may thus 'instead of key z' store only the value a = |z/mj.
Because at the final step (L = M) we could omit the
modifications in the hash table (i.e. after statement '8'
the line 'IB(J) = MLAST' and after statement '7' the lines
'NEW - N E W + 1' up to 'IH(I) = NEW'), the size of
the list area is at most m — 1 list elements. Then, instead
of the link and key fields we can store a single integer of
the form am + b, where b is the displacement inside the
linking area pointing to the next node in the same chain
and b = 0 in the case of the last element in the chain.
Note that it is easy also to reduce somewhat the storage
46
THE COMPUTER JOURNAL, VOL. 25, NO. 1, 1982
= m - 1 - (m - 2){m - l)/(2n).
Thus the expected load factor at that moment is
a. = 1 - 1/m - (m - l)/(2«) + (m - l)/(wn)
(1)
On the other hand we know that for a hash table
organization like that in the section on the sampling
algorithm the expected number of comparisons for an
unsuccessful search is
c'E ~
a
and for a successful search
cE =* 1 + a/2,
2
(see Knuth, p. 518). In addition the probability p\ of
making an unsuccessful search when determining the /th
item is p[ = 1 - (/ — l)/n. Thus the expected number of
comparisons when searching for the mth element is
m-1
(1 + a/2)
where a is as in (1).
Because a < 1, the expected number of comparisons
needed in the hash table for the /th sample element is in
the case of unsuccessful search
c'E < 1 + l/e
© Heyden & Son Ltd, 1982
AN ALGORITHM FOR UNBIASED RANDOM SAMPLING
and in the case of successful search
CONCLUDING REMARKS
cE < 1 + i
From this we conclude that the expected number of
comparisons needed in the hash table for all elements is
less than
m(c'E + c E ) < 3m.
Hence the expected running time of the algorithm HSEL
is proportional to m.
In the worst case the length of the hash table lists is
proportional to m and hence the running time in the
worst case is proportional to the square of m.
In a DEC-10 with a KAlO-processor we observed for
n = 30 000; m = 1000, 2000, 3000, 4000, and 5000, the
running times 0.42, 0.74, 1.24, 1.52 and 2.00 sec,
respectively. The dependency is here linear to m. For a
sample size < 1700 HSEL is even faster than SELECT
(n = 30 000).
It has been shown how hashing techniques can be used
for reducing the working storage space of a sampling
algorithm. The space used by our algorithm is smaller
than that of SELECT of Ref. 1 for a sample size m < n/4.
The opposite is true when me[n/4, n/2]. For m > n/2 the
roles of 'selected' and 'unselected' can be interchanged.
In the uncommon situation where m > 3n/4 we can again
sample by HSEL by determining the list of the elements
to be left outside the sample. In all cases the expected
running time of the new algorithm is low, proportional to
m.
Instead of hashing we could use several other data
structures to maintain the information of changed
indexes in the set of unselected elements. For example,
by applying search trees we could find a sampling
technique with storage O(m) and running time, both on
average and in the worst case, O(m log m).
REFERENCES
1. S. Goodman and S. Hedetniemi, Introduction to the Design and
Analysis of Algorithms. McGraw-Hill (1977).
2. D. Knuth, The Art of Computer Programming. Vol. 3, 2nd Edn,
Addison-Wesley, Reading, Massachusetts (1975).
©Heyden& Son Ltd, 1982
Received October 1980
© Heyden & Son Ltd, 1982
THE COMPUTER JOURNAL, VOL. 25, NO. 1, 1982
47
© Copyright 2025 Paperzz