BladeLet: Rand
Introduction:
~~~~~~~~~~~~
This BladeLet implements several Random number generation UDRs. The
idea is to be able to generate random numbers that follow several
different distributions; Uniform, Binomial and Normal (or Gaussian.)
These are particularly useful generating large data sets for testing,
and for certain simple kinds of sampling.
Overview:
The Blade consists of a number of UDRs (user-defined routines) that are
designed to return sequences of random variables that comply with
one or another distribution function. The most common such distribution
is simply Uniform, which means the random function returns values in
a range, all outcomes equally probable. Another set of UDRs deal with
Binomial distributions (sequences of trials of fixed probability and
boolean result) and Normal (or Gaussian) distributions (distribution
around a mean of known variance).
For an overview of the mathematics behind this I recommend an introduction
to statistical theory. For an overview of the algorithms used in this
bladelet I would go to the source[1]. This blade uses its own random
number generation algorithms and is therefore platform neutral.
The following sections explain each of the blade UDRs in turn, and
demonstrates their usage.
Functions:
Note: All of the examples below illustrate how to use the function
in queries involving a table called 'Numbers', which contains a list
of all integers between 0 and 100000. The
Rand() RETURNING REAL
This function returns a random value between 0 and 1.0. It uses the
current time for a random seed, so that if you use it twice in
a given query, or in different queries that run in rapid
succession, the sequence of values it returns will be the same. In
fact, this can be desirable behavior. The series of values it
returns comply with UNIFORM RANDOM properties.
Usage:
SELECT COUNT(*) -- This query will return
FROM Customers C -- a count that is about
WHERE Rand() < 0.5; -- half of the rows in the table.
Rand( INTEGER ) RETURNING INTEGER
This function returns a random value in the range 0 to one less than the
argument passed in.
SeededRand( INTEGER ) RETURNING REAL
This variation on the Rand() function allows you to set the seed.
If you use this function twice in the same query, but use different
seed values, then each function will return a different series of
values. This function is useful in circumstances where you are flinging
the same query at the ORDBMS in very rapid succession, because the
'default' seed value used by the other functions is the current time to
second.
SeededRand( INTEGER, INTEGER ) RETURNING INTEGER
SELECT * -- This query returns a random
FROM Customers C -- customer, assuming that the
WHERE C.Id = SeededRand(101, 10000); -- range of Ids is 0 to 10000.
-- The trick is that if you substitute
SELECT * -- the Rand(10000) function you will
FROM Customers C -- get one Customer for each second.
WHERE C.Id = SeededRand(102, 10000); -- The example here overcomes this.
Binomial ( INTEGER, REAL ) RETURNS INTEGER
This function returns a series of integer values that follow a
BINOMIAL DISTRIBUTION. To understand what this means, consider
tossing a coin ten times and counting the number of heads you
get. If you repeat this experiement many times, and count up
the number of times you get each number of heads (0 through 10)
then the number of numbers will follow a BINOMIAL DISTRIBUTION.
In this function, the first argument corresponds to the number
of trials (10) and the second to the probability of 'success'
(.5 in our case: the chance of getting heads).
SeededBinomial( INTEGER, REAL, INTEGER )
This function is the same as the binomial, only like the
SeededRand() function is takes a seed value, allowing you to
generate two different sequences of results in the one query.
The best way to use this is when you are generating data for
testing. For example, I use Binomials when generating spatial
data for testing. You can also **kind of** use Binomial distributions
with small success probability and resonably large samples to
approximate a ZipF distribution, and other queuing stuff.
INSERT INTO Customers
SELECT .
sp2Pnt( X(T.Location) + (Binomial(0.333, 1000) / 100.0),
Y(T.Location) + (Binomial(0.333, 1000) / 100.0)
)
.
FROM Towns T;
This example illustrates how to get a binomial distribution around
the location of the Town center.
Binomial distributions are useful for generating skewed
data sets. For example, consider the distribution you achieve
with 100 trials, and a probability of 0.3. As with the Rand()
UDR, by default Biomial() uses the current time (to second).
Normal() RETURNING REAL
This function uses the current time as a seed value, and returns
a series of values that correspond to a NORMAL or GAUSSIAN
distribution. i.e. the distribution has a mean of 0.0 and a
variance of 1.0. In theory the resulting values range from
large negative to very large positive.
SeededNormal( INTEGER ) RETURNING REAL
Same as Normal only for user-specified seed value.
SeededNormal( REAL, REAL, INTEGER ) RETURNING REAL
This function returns values with a mean of the first arguments, and
a variance of the second argument. The third integer argument is
the seed.
Usage:
INSERT INTO OtherTable SELECT Normal( 100, 10 ) FROM Students;
[1] Press, William H., Teukolsky, Saul A., Vetterling, William T.,
and Flannery, Brian P. _Numerical_Recipes_in_C_:_The_Art_of_
Scientific_Computing_Second_Edition_ Cambridge University Press,
New York. 1992.