Simulating sampling in PL/SQL
The standard PL/SQL package dbms_random only allows for drawing uniformly distributed (all numbers equally likely) random numbers. As far as I know, other discrete probability distributions have to be simulated. This is easily accomplished using a little elementary probability theory.
The following code allows for drawing a sample of a size to be specified at runtime from a discrete probability distribution on an arbitrary (though unfortunately necessarily finite) set of positive integers. It is just meant for illustrating a standard math trick and is neither robust nor optimized (nor wellformed
. It’s just a quick hack. Perhaps I will look into this.
The sample space used and its discrete probability distribution are presented as the table prob_tab (num number, weight number), where the first column represents the set numbers to be drawn from and the second the probability of each element presented as a weighing factor. Hence the law is also relatively arbitrary. The sample drawn is inserted into a table sample_tab (num number,freq number), where freq counts the occurences of num. I assume both tables to be present at runtime.
CREATE OR REPLACE<br />PROCEDURE draw_sample (p_size IN NUMBER) AS -- define datatypes to store the probability distribution<br /> type prob_rec is record (weight number, distrib number);<br /> type prob_tab is table of prob_rec index by pls_integer;<br /> <br /> prec prob_rec;<br /> prob prob_tab; total_weight number; -- set max.precision for the uniform distribution on 0<=x<=1<br /> -- dbms_random.random produces -2^31 <= integer <= 2^31<br /> uni_prec constant number := power(2,31);<br /> uni number;<br /> <br /> -- load probability data <br /> cursor c_prob is <br /> select * from prob_dist order by num asc;<br /> cursor c_sample (p_num number) is <br /> select 'x' from sample_tab where num=p_num;<br /> <br /> i number;<br /> cumul_weight number;<br /> <br />BEGIN<br /> delete sample_tab;<br /> select sum(weight) into total_weight from prob_dist; for r_prob in c_prob loop<br /> -- initialize sample_tab table <br /> insert into sample_tab values (r_prob.num,0);<br /> <br /> -- we need the prob.distribution function<br /> select sum(weight) into cumul_weight<br /> from prob_dist<br /> where num <= r_prob.num;<br /> <br /> prec.weight:=r_prob.weight;<br /> prec.distrib:=cumul_weight/total_weight; <br /> prob(r_prob.num) := prec; end loop; -- draw the sample<br /> for t in 1..p_size loop<br /> -- draw a unif.distributed random number<br /> uni := mod(abs(dbms_random.random),uni_prec)/uni_prec;<br /> <br /> -- find the corresp.subinterval of [0,1]<br /> i := prob.first; <br /> loop<br /> prec:=prob(i); <br /> exit when uni <= prec.distrib;<br /> i := prob.next(i);<br /> exit when (i is null);<br /> end loop;<br /> <br /> -- store the sample<br /> if (i is not null) then<br /> update sample_tab set freq = freq + 1 where num = i;<br /> end if;<br /> end loop;<br /> commit;<br />END;<br />
The trick consists of partitioning the unit interval into subintervals of length corresponding to the weight of the numbers in the sample space and determining in which subinterval a dbms_random gererated uniformly distributed number falls. The index of the subinterval yields the number drawn.
Put a little more mathematically, what is essentially does is apply the generalized or Penrose inverse of the desired probabilty distribution function to a (continuous) uniformly distributed (on 0<=x<=1) random variable. This yields a random variable with the correct distribution.
The code can of course be generalized eg.to allow for sampling from a set of strings but my point should be clear.
Recent Comments