Simulating sampling in PL/SQL


The standard PL/SQL package dbms_random only allows for drawing uniformly distributed (all numbers equally likely) random numbers. As far as I know, other discrete probability distributions have to be simulated. This is easily accomplished using a little elementary probability theory.

The following code allows for drawing a sample of a size to be specified at runtime from a discrete probability distribution on an arbitrary (though unfortunately necessarily finite) set of positive integers. It is just meant for illustrating a standard math trick and is neither robust nor optimized (nor wellformed :-(. It’s just a quick hack. Perhaps I will look into this.....

The sample space used and its discrete probability distribution are presented as the table prob_tab (num number, weight number), where the first column represents the set numbers to be drawn from and the second the probability of each element presented as a weighing factor. Hence the law is also relatively arbitrary. The sample drawn is inserted into a table sample_tab (num number,freq number), where freq counts the occurences of num. I assume both tables to be present at runtime.

  &nbsp; -- define datatypes to store the probability distribution<br />&nbsp; type&nbsp; prob_rec is record (weight number, distrib number);<br />&nbsp; type&nbsp; prob_tab is table of prob_rec index by pls_integer;<br />&nbsp; <br />&nbsp; prec&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; prob_rec;<br />&nbsp; prob&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; prob_tab;
  &nbsp; total_weight&nbsp; number;
&nbsp; -- set max.precision for the uniform distribution on 0&lt;=x&lt;=1<br />&nbsp; -- dbms_random.random produces -2^31 &lt;= integer &lt;= 2^31<br />&nbsp; uni_prec&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; constant number := power(2,31);<br />&nbsp; uni&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; number;<br />&nbsp; <br />&nbsp; -- load probability data <br />&nbsp; cursor c_prob is <br />&nbsp;&nbsp;&nbsp; select * from prob_dist order by num asc;<br />&nbsp; cursor c_sample (p_num number) is <br />&nbsp;&nbsp;&nbsp; select 'x' from sample_tab where num=p_num;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />&nbsp; i&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; number;<br />&nbsp; cumul_weight&nbsp; number;<br />&nbsp; <br />BEGIN<br />&nbsp; delete sample_tab;<br />&nbsp; select sum(weight) into total_weight from prob_dist;
&nbsp; for r_prob in c_prob loop<br />&nbsp;&nbsp;&nbsp; -- initialize sample_tab table <br />&nbsp;&nbsp;&nbsp; insert into sample_tab values (r_prob.num,0);<br />&nbsp; <br />&nbsp;&nbsp;&nbsp; -- we need the prob.distribution function<br />&nbsp;&nbsp;&nbsp; select sum(weight) into cumul_weight<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; from prob_dist<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where num &lt;= r_prob.num;<br />&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; prec.weight:=r_prob.weight;<br />&nbsp;&nbsp;&nbsp; prec.distrib:=cumul_weight/total_weight;&nbsp; <br />&nbsp;&nbsp;&nbsp; prob(r_prob.num) := prec;
  &nbsp; end loop;
  &nbsp; -- draw the sample<br />&nbsp; for t in 1..p_size loop<br />&nbsp;&nbsp;&nbsp; -- draw a unif.distributed random number<br />&nbsp;&nbsp;&nbsp; uni := mod(abs(dbms_random.random),uni_prec)/uni_prec;<br />&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; -- find the corresp.subinterval of [0,1]<br />&nbsp;&nbsp;&nbsp; i := prob.first; <br />&nbsp;&nbsp;&nbsp; loop<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; prec:=prob(i); <br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; exit when uni &lt;= prec.distrib;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i :=;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; exit when (i is null);<br />&nbsp;&nbsp;&nbsp; end loop;<br />&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; -- store the sample<br />&nbsp;&nbsp;&nbsp; if (i is not null) then<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; update sample_tab set freq = freq + 1 where num = i;<br />&nbsp;&nbsp;&nbsp; end if;<br />&nbsp; end loop;<br />&nbsp; commit;<br />END;<br />

The trick consists of partitioning the unit interval into subintervals of length corresponding to the weight of the numbers in the sample space and determining in which subinterval a dbms_random gererated uniformly distributed number falls. The index of the subinterval yields the number drawn.

Put a little more mathematically, what is essentially does is apply the generalized or Penrose inverse of the desired probabilty distribution function to a (continuous) uniformly distributed (on 0<=x<=1) random variable. This yields a random variable with the correct distribution.

The code can of course be generalized allow for sampling from a set of strings but my point should be clear.


About Author

Comments are closed.