Simulating sampling in PL/SQL Oracle Headquarters Redwood Shores1 e1698667100526

Simulating sampling in PL/SQL

The standard PL/SQL package dbms_random only allows for drawing uniformly distributed (all numbers equally likely) random numbers. As far as I know, other discrete probability distributions have to be simulated. This is easily accomplished using a little elementary probability theory.

The following code allows for drawing a sample of a size to be specified at runtime from a discrete probability distribution on an arbitrary (though unfortunately necessarily finite) set of positive integers. It is just meant for illustrating a standard math trick and is neither robust nor optimized (nor wellformed :-(. It’s just a quick hack. Perhaps I will look into this.

The sample space used and its discrete probability distribution are presented as the table prob_tab (num number, weight number), where the first column represents the set numbers to be drawn from and the second the probability of each element presented as a weighing factor. Hence the law is also relatively arbitrary. The sample drawn is inserted into a table sample_tab (num number,freq number), where freq counts the occurences of num. I assume both tables to be present at runtime.

CREATE OR REPLACE
PROCEDURE draw_sample (p_size IN NUMBER) AS
    -- define datatypes to store the probability distribution
  type  prob_rec is record (weight number, distrib number);
  type  prob_tab is table of prob_rec index by pls_integer;
  
  prec        prob_rec;
  prob        prob_tab;
    total_weight  number;
  -- set max.precision for the uniform distribution on 0<=x<=1
  -- dbms_random.random produces -2^31 <= integer <= 2^31
  uni_prec      constant number := power(2,31);
  uni           number;
  
  -- load probability data 
  cursor c_prob is 
    select * from prob_dist order by num asc;
  cursor c_sample (p_num number) is 
    select 'x' from sample_tab where num=p_num;
      
  i             number;
  cumul_weight  number;
  
BEGIN
  delete sample_tab;
  select sum(weight) into total_weight from prob_dist;
  for r_prob in c_prob loop
    -- initialize sample_tab table 
    insert into sample_tab values (r_prob.num,0);
  
    -- we need the prob.distribution function
    select sum(weight) into cumul_weight
      from prob_dist
        where num <= r_prob.num;
    
    prec.weight:=r_prob.weight;
    prec.distrib:=cumul_weight/total_weight;  
    prob(r_prob.num) := prec;
    end loop;
    -- draw the sample
  for t in 1..p_size loop
    -- draw a unif.distributed random number
    uni := mod(abs(dbms_random.random),uni_prec)/uni_prec;
    
    -- find the corresp.subinterval of [0,1]
    i := prob.first; 
    loop
      prec:=prob(i); 
      exit when uni <= prec.distrib;
      i := prob.next(i);
      exit when (i is null);
    end loop;
    
    -- store the sample
    if (i is not null) then
      update sample_tab set freq = freq + 1 where num = i;
    end if;
  end loop;
  commit;
END;

The trick consists of partitioning the unit interval into subintervals of length corresponding to the weight of the numbers in the sample space and determining in which subinterval a dbms_random gererated uniformly distributed number falls. The index of the subinterval yields the number drawn.

Put a little more mathematically, what is essentially does is apply the generalized or Penrose inverse of the desired probabilty distribution function to a (continuous) uniformly distributed (on 0<=x<=1) random variable. This yields a random variable with the correct distribution.

The code can of course be generalized eg.to allow for sampling from a set of strings but my point should be clear.