Hidden PL/SQL Gem in 10g: DBMS_FREQUENT_ITEMSET for PL/SQL based Data Mining

Data Mining in PL/SQL applications – Supplied Package DBMS_FREQUENT_ITEMSET (new in 10gR1)

Resource: Oracle 10g Supplied Packages – dbms_frequent_itemset.
This package exposes some part of the Data Mining capability of the Oracle 10g Server through PL/SQL and indirectly SQL. It allows us to quickly scan sets of data to find out if certain items frequently appear in combination. “frequent itemsets provides an efficient mechanism for counting how often multiple events occur together”. The example often used for this is the (super)market basket; upon investigation you will probably find out that pizza and parmezan cheese are often bought together, as are steak, sausages, spareribs, charcoal and barbecue-sauce. The fun of data mining is of course that a) you may find things you did not expect and b) you can apply this technique to many other aspects of life as well.

For example:

  • the occurrence of a certain disease in combination with behavioral or diet-related characteristics,
  • the preference for a certain type of car in combination with properties such as income, profession, preferred make of father/neighbour/friend, age
  • the preference for a political candidate or party; you can think of some of the associated properties…(religion, income, skincolor, corporate associations etc.)
  • the occurrence of a software bug given some of the characteristics of the application, the organization, the technology and the programmer

Of course when you find out that there is a strong likelyhood of certain combinations, the next step is predicting the occurrence of one item as a function of the others.

The resource referred to gives a good code example of applying this package dbms_frequent_itemset to a very simple basket-analysis.

Other resources:
Computing Frequent Itemsets inside Oracle 10G – white paper – (“somewhat scientific, interesting as background but not your best introduction”)
Frequent Itemsets in Oracle10g An Oracle White Paper (August 2003) (“very good overview of what it is and does, how it fits in with Data Mining, and where Data Mining can take you a step further”).

Using Frequent Itemset to spot patterns in Movie Preferences or: How to please a 5-year old?

In this example, I will apply PL/SQL based Data Mining exposed through dbms_frequent_itemset to a collection of movie-preference data. The following very simple data model shows three tables that hold respectively data on AGE_CATEGORIES, MOVIES and the Movie Top 3 selections made by a bunch of fictive people.
Hidden PL/SQL Gem in 10g: DBMS_FREQUENT_ITEMSET for PL/SQL based Data Mining FrequentItemSet PopularMovies.png
I have collected 27 statements from people of both genders in five different age categories, each picking their favorite three movies from a list with 10 fairly recent movies. From this rather small data-set for performing data-mining, I will nevertheless demonstrate the use and power of Frequent Itemset.

Note: you can download the following files to install the sample-tables and demo-data: MovieTables and MovieData.txt

A brief look into each table:

AGE_CATEGORIES:

 ID DESCRIPTION
--- ----------------
  1 2-6 years
  2 7-12 years
  3 12-15 years
  4 16-21 years
  5 21+ years

MOVIES:

 ID TITLE
--- ------------------------------------------
  1 Shrek 2
  2 Monster Inc
  3 Sharktale
  4 Bridget Jones - On the edge of reason
  5 Love Actually
  6 Dodgeball
  7 The Matrix Reloaded
  8 K3 en het betoverde medaillon
  9 Lord of The Rings
 10 Finding Nemo

MOVIE_PREFERENCES:

   ID AGE_CATEGORY G   MVE_ID_1   MVE_ID_2   MVE_ID_3
----- ------------ - ---------- ---------- ----------
    1            5 M          9          7          5
    2            5 M          9         10          2
    3            5 M         10          7          9
    4            4 M          6          7          9
    5            4 M          6          7          9
    6            4 F          4          7          9
    7            4 F          4          3          9
    8            4 F          4          5          6
    9            5 F          4          5          3
   10            5 F          4          5          3
   ..... 26 records

It is important to realize exactly what we see in the MOVIE_PREFERENCES table: 26 people, categorized by age and gender, pick their three favorite movies out of 10. For example, record with id=3: A Male person, highest age category and fan of Finding Nemo, The Matrix Reloaded and Lord of the Rings. We will investigate whether certain movies often occur together in the personal top three AND we will see whether certain movies frequently pair up with specific gender or age categories.

The most basic of analysis is made with a query such as this one. It first requires the creating of a Nested Table type in the database:

CREATE TYPE fi_number_nt AS TABLE OF NUMBER;

Then we ask dbms_frequent_itemset for a first rough analysis:

SELECT CAST (itemset as FI_number_NT) itemset
,      support
,      length
,      total_tranx
FROM   table( DBMS_FREQUENT_ITEMSET.FI_TRANSACTIONAL
              (  cursor
                 ( SELECT id, mve_id_1 FROM movie_preferences mpe
                   union
                   SELECT id, mve_id_2 FROM movie_preferences mpe
                   union
                   SELECT id, mve_id_3 FROM movie_preferences mpe
                   union
                   SELECT id, 1000* age_category FROM movie_preferences mpe
                   union
                   SELECT id, case mpe.gender when 'M' then -1 else -2 end FROM movie_preferences mpe
                 )
              , 0.05
              , 4
              , 6
              , NULL
              , NULL
              )
            )
order
by     support desc
,      length  desc

I have requested a list of all frequently occurring combinations where frequently is defined as 5% or more of the cases (support_threshold = 0.05). The combinations I am interested in should contain at least 4 and no more than 6 items. (itemset_length_min = 4 and itemset_length_max = 6). I have not excluded items from the analysis (excluding_items = null), nor have I listed any items as required in sets that are to be considered (including_items = null).

Note that I had to use the CASE..WHEN construction to turn Gender (values M or F ) into a numeric value: all values in the transaction have to be of the same type – either VARCHAR2 or NUMBER. Furthermore, since all values are considered equal and position in a transaction plays no part, I must ensure that the numeric values used for Gender and Age_Category do not overlap with the values used for MVE_ID. I multiply the age_category (acy_id) by 1000 to make sure those values are in an entirely different range from the mve_id values, and therefore easily recognizable. For the same reason I assign negative numbers for the Gender values.

We used the dbms_frequent_itemset.FI_TRANSACTIONAL procedure, which is defined as follows in the documentation:

DBMS_FREQUENT_ITEMSET.FI_TRANSACTIONAL (
   tranx_cursor         IN    SYSREFCURSOR, -- The cursor parameter that the user will supply when calling the function. It should return two columns in its returning row, the first column being the transaction id, the second column being the item id. The item id must be number or character type (for example, VARCHAR2(n)).
   support_threshold    IN    NUMBER, -- A fraction number of total transaction count. An itemset is termed "frequent" if [the number of transactions it occurs in] divided by [the total number of transactions] exceed the fraction. The parameter must be a NUMBER.
   itemset_length_min   IN    NUMBER, -- The minimum length for interested frequent itemset. The parameter must be a NUMBER between 1 and 20, inclusive
   itemset_length_max   IN    NUMBER, -- The maximum length for interested frequent itemset. This parameter must be a NUMBER between 1 and 20, inclusive, and must not be less than itemset_length_min.
   including_items      IN    SYS_REFCURSOR DEFAULT NULL, -- A cursor from which a list of items can be fetched. At least one item from the list must appear in frequent itemsets that will be returned. The default is NULL.
   excluding_items      IN    SYS_REFCURSOR DEFAULT NULL) -- A cursor from which a list of items can be fetched. No item from the list can appear in frequent itemsets that will returned. The default is NULL.
  RETURN TABLE OF ROW (
     itemset [Nested Table of Item Type DERIVED FROM tranx_cursor], -- A collection of items which is computed as frequent itemset. This will be returned as a nested table of item type which is the item column type of the input cursor.
     support        NUMBER, -- The number of transactions in which a frequent itemset occurs. This will be returned as a NUMBER.
     length         NUMBER, -- Number of items in a frequent itemset. This will be returned as a NUMBER
     total_tranx    NUMBER);-- The total transaction count. This will be returned as a NUMBER, and will be the same for all returned rows, similar to a reporting aggregate.
Usage Notes

The results:

ITEMSET                                     SUPPORT   LENGTH      TOTAL_TRANX
----------------------------------------   ---------- ---------- -----------
FI_NUMBER_NT(-2, 3, 5, 5000)                 3          4          26
FI_NUMBER_NT(-2, 3, 10, 3000)                3          4          26
FI_NUMBER_NT(-2, 4, 5, 5000)                 3          4          26
FI_NUMBER_NT(1, 3, 10, 3000)                 3          4          26
FI_NUMBER_NT(-1, 2, 3, 1000)                 3          4          26
FI_NUMBER_NT(-2, 1, 3, 10, 3000)             2          5          26
... 38 records

This tells us for example that 3 out of 26 transactions contained a combination of: Female, 21+ years, Sharktale, Love Actually. We also found 3 transactions that supported the combination of Female, 12-16 years, Sharktale and Finding Nemo. There is one combination of five values that occurs more than once: Female, 12-16 years, Sharktale, Shrek 2 and Finding Nemo..

To present the data in better way, we have to rewrite the query somewhat:

SELECT present_items( CAST (itemset as FI_number_NT)) itemset
,      support
,      length
,      total_tranx
FROM   table( DBMS_FREQUENT_ITEMSET.FI_TRANSACTIONAL
              (  cursor
                 ( SELECT id, mve_id_1 FROM movie_preferences mpe
                   union
                   SELECT id, mve_id_2 FROM movie_preferences mpe
                   union
                   SELECT id, mve_id_3 FROM movie_preferences mpe
                   union
                   SELECT id, 1000* age_category FROM movie_preferences mpe
                   union
                   SELECT id, case mpe.gender when 'M' then -1 else -2 end FROM movie_preferences mpe
                 )
              , 0.05
              , 4
              , 6

              , NULL
              , NULL
              )
            )
order
by     support desc
,      length  desc

The results this time are more meaningful:

ITEMSET                                                                          SUPPORT     LENGTH TOTAL_TRANX
-------------------------------------------------------------------------------- ---------- ---------- -----------
Female, Sharktale, Love Actually, 21+ years                                        3          4          26
Female, Sharktale, Finding Nemo, 12-15 years                                       3          4          26
Female, Bridget Jones - On the edge of reason, Love Actually, 21+ years            3          4          26
Shrek 2, Sharktale, Finding Nemo, 12-15 years                                      3          4          26
Female, Shrek 2, Sharktale, Finding Nemo, 12-15 years                              2          5          26
Male, Dodgeball, The Matrix Reloaded, Lord of The Rings, 16-21 years               2          5          26
Female, Sharktale, Bridget Jones - On the edge of reason, Love Actually, 21+ years 2          5          26
...

The function present_items that I created to support this query is defined as follows:

create or replace
function present_items
( p_itemset in FI_NUMBER_NT
) return varchar2
is
  l_return_value varchar2(1000):='';
  l_help varchar2(100):='';

  procedure add(p_string in varchar2)
  is
  begin
    l_return_value:= l_return_value||p_string||', ';
  end add;

begin
  for i in 1..p_itemset.count loop
    if p_itemset(i) < 0
    then
      case p_itemset(i) when '-1' then add('Male'); else add('Female'); end case;
    elsif p_itemset(i) > 1000
    then
      select description
      into   l_help
      from   age_categories
      where  id = p_itemset(i)/1000
      ;
      add(l_help);
    else
      select title
      into   l_help
      from   movies
      where  id = p_itemset(i)
      ;
      add(l_help);
    end if;
  end loop;
  return substr(l_return_value, 1, length(l_return_value)-2);
end;

siness Problem – What movie to rent to entertain my nephew?

Now we can try to attack a more interesting business issue: Our nephew is visiting; he is 5 years old and he enjoyed Finding Nemo. What movie should we rent for him?

DBMS_FREQUENT_ITEMSET can find out for us which combinations of movie-preferences occur for Males in the youngest age-category that also include Finding Nemo:

select items.column_value "Recommended Movie"
from   ( SELECT CAST(itemset as FI_varchar_NT) itemset
         ,      support
         ,      length
         ,      total_tranx
         FROM   table ( DBMS_FREQUENT_ITEMSET.FI_TRANSACTIONAL
                        ( cursor
                          ( SELECT mpe.id, title FROM movie_preferences mpe, movies mve where mve.id = mpe.mve_id_1
                            union
                            SELECT mpe.id, title FROM movie_preferences mpe, movies mve where mve.id = mpe.mve_id_2
                            union
                            SELECT mpe.id, title FROM movie_preferences mpe, movies mve where mve.id = mpe.mve_id_3
                            union
                            SELECT mpe.id, acy.description FROM movie_preferences mpe, age_categories acy  where acy.id = mpe.age_category
                            union
                            SELECT mpe.id, mpe.gender FROM movie_preferences mpe
                          )
                        , 0.05
                        , 4
                        , 6
                        , CURSOR( SELECT * FROM table(FI_VARCHAR_NT('Finding Nemo')))
                        , NULL
                        )
                      )
        ) it
,       table(it.itemset) items
,       movies mve
where support > 1
and   set_contains_item( itemset, '2-6 years' ) = 1
and   set_contains_item( itemset, 'M' ) = 1
and   mve.title = items.column_value
and   mve.title != 'Finding Nemo'

At the heart of this query, we ask for Frequent Itemsets that contain at least Finding Nemo, in total between 4 and 6 items and a support of at least 5% (meaning that at least 5% of the transactions contains the frequent combination -otherwise we do not want to consider it frequent). Furthermore – and somewhat superfluously – we only care for frequent itemsets supported by at least two (>1) transactions. Note that this time we choose to present all data as VARCHAR2 rather than NUMBER. In this query we make use of stored function set_contains_item that was defined as follows:

create or replace
function set_contains_item
( p_itemset in FI_varchar_NT
, p_value   in varchar2
) return number
is
begin
  for i in 1..p_itemset.count loop
    if p_itemset(i) = p_value
    then
      return 1;
    end if;
  end loop;
  return 0;
end;

The result of this query is quite clear:

Recommended Movie
------------------------------
Sharktale

So we will install His Nephewness in front of the TV set, watching Sharktale! Data Mining showed us the way.

Some similar investigations: Recommended Movie for a young adult male (16-21 years) who expressed some enjoyment over Dodgeball:

select items.column_value "Recommended Movie"
from   ( SELECT CAST(itemset as FI_varchar_NT) itemset
         ,      support
         ,      length
         ,      total_tranx
         FROM   table ( DBMS_FREQUENT_ITEMSET.FI_TRANSACTIONAL
                        ( cursor
                          ( SELECT mpe.id, title FROM movie_preferences mpe, movies mve where mve.id = mpe.mve_id_1
                            union
                            SELECT mpe.id, title FROM movie_preferences mpe, movies mve where mve.id = mpe.mve_id_2
                            union
                            SELECT mpe.id, title FROM movie_preferences mpe, movies mve where mve.id = mpe.mve_id_3
                            union
                            SELECT mpe.id, acy.description FROM movie_preferences mpe, age_categories acy  where acy.id = mpe.age_category
                            union
                            SELECT mpe.id, mpe.gender FROM movie_preferences mpe
                          )
                        , 0.05
                        , 4
                        , 6
                        , CURSOR( SELECT * FROM table(FI_VARCHAR_NT('Dodgeball')))
                        , NULL
                        )
                      )
        ) it
,       table(it.itemset) items
,       movies mve
where support > 1
and   set_contains_item( itemset, '16-21 years' ) = 1
and   set_contains_item( itemset, 'M' ) = 1
and   mve.title = items.column_value
and   mve.title != 'Dodgeball'

As far as DBMS_FREQUENT_ITEMSET can tell based on a relatively small set of data, he might be interested in:

Recommended Movie
-------------------------
The Matrix Reloaded

Finally I try to find a movie that my girlfriend will enjoy – knowing that she liked “Love Actually”:

select items.column_value "Recommended Movie"
from   ( SELECT CAST(itemset as FI_varchar_NT) itemset
         ,      support
         ,      length
         ,      total_tranx
         FROM   table ( DBMS_FREQUENT_ITEMSET.FI_TRANSACTIONAL
                        ( cursor
                          ( SELECT mpe.id, title FROM movie_preferences mpe, movies mve where mve.id = mpe.mve_id_1
                            union
                            SELECT mpe.id, title FROM movie_preferences mpe, movies mve where mve.id = mpe.mve_id_2
                            union
                            SELECT mpe.id, title FROM movie_preferences mpe, movies mve where mve.id = mpe.mve_id_3
                            union
                            SELECT mpe.id, acy.description FROM movie_preferences mpe, age_categories acy  where acy.id = mpe.age_category
                            union
                            SELECT mpe.id, mpe.gender FROM movie_preferences mpe
                          )
                        , 0.05
                        , 4
                        , 6
                        , CURSOR( SELECT * FROM table(FI_VARCHAR_NT('Love Actually')))
                        , NULL
                        )
                      )
        ) it
,       table(it.itemset) items
,       movies mve
where support > 1
and   set_contains_item( itemset, '21+ years' ) = 1
and   set_contains_item( itemset, 'F' ) = 1
and   mve.title = items.column_value
and   mve.title != 'Love Actually'
Recommended Movie
------------------------------------------------
Bridget Jones - On the edge of reason

Associations in a strange country

Another way to make use of the data we assembled and the functionality offered by DBMS_FREQUENT_ITEMSET could be this situation:

I arrive in a strange city in a foreign country. I walk around town and see billboard for a certain movie: “K3 en het betoverde medaillon”. I do not know the movie, and I am wondering whether I might like it. So now based on the movie-preference data gathered I would like to know:
– what age groups and genders are interested in this movie?
– what other movies can this movie be compared to – or at least, which movies are often liked by people who also like this movie?
– or, putting it altogether, which movies that I do know should be liked by someone with my age-category and my gender to have a good possibility to also like this movie? For example: if I see posters for Lord of the Rings part 3, chances are that in general people who liked part 2 also liked part 3.

Issue with Horizontal Data in DBMS_FREQUENT_ITEMSET

Note: dbms_frequent_itemset allows us to present data either transactionally (all data in records that consist of a transaction identifier and a single value, i.e. many records per transaction) or horizontally (one record per transaction that contains all values). Our data in MOVIE_PREFERENCES is already in horizontal format, so that is the obvious choice when using dbms_frequent_itemset. However, I ran into some problems – presumably bugs – with the horizontal data approach, so I switched to the transactional procedure in dbms_frequent_itemset.

The following query should have worked I believe:

SELECT CAST(itemset as FI_varchar_NT) itemset, support, length, total_tranx
FROM   table( DBMS_FREQUENT_ITEMSET.FI_HORIZONTAL
              ( CURSOR ( SELECT acy.description
                         ,      gender
                         ,      mve1.title
                         ,      mve2.title
                         ,      mve3.title
                         FROM   movie_preferences mpe
                         ,      movies  mve1
                         ,      movies  mve2
                         ,      movies  mve3
                         ,      age_categories acy
                         where  mpe.age_category = acy.id
                         and    mpe.mve_id_1 = mve1.id
                         and    mpe.mve_id_2 = mve2.id
                         and    mpe.mve_id_3 = mve3.id
                       )
               , 0.2
               , 3
               , 6
               , null
               , null
               )
              );

However, after some serious processor activity, it returns with:

SELECT CAST(itemset as FI_varchar_NT) itemset, support, length, total_tranx
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qqesfCBM_QBReadNext.1], [0], [],
[], [], [], [], []

4 Comments

  1. Pingback: glamour-agency September 28, 2007
  2. Marcos M. Campos January 7, 2006
  3. Wei September 22, 2005
  4. Mike Friedman July 30, 2005
  5. Wei Li June 20, 2005