During one of the sessions on last week’s ODTUG 2006 conference, I saw a very interesting demo by Keith Laker. He discussed the problem of having several dozens of dimensions or descriptors for a particular fact and utter uselessness of a report with more than 40 or 50 qualifiers. Like an overview of salesrecords with per customer information such as name, birthdate, gender, zipcode, number of pets, political affiliation, job, health history, favorite author et. Each one of these pieces of information may be meaningful, but when all presented in a report, it becomes impossible to extract meaningful information. So he suggested an approach for determining which of all these qualifiers are the most useful ones and only including those in the report.
Using Oracle Data Mining PL/SQL API – embedded in the Oracle 10gR2 database – we can easily find out which dimensions are important ones. In his demonstration, he extracted the top 5 attributes from a set of over 50. These top five together determined the fact for more than 95 % and therefore the analysis could focus on only these attributes.
The PL/SQL package to use is called DBMS_PREDICTIVE_ANALYTICS. An introduction can be found here. You can the EXPLAIN procedure the name of a Table or View and the name of a target column – the column (fact) whose value you are interested in and for which you would like to know which of the other columns have the largest determining contribution. The results of the EXPLAIN are written to a results table:
From this table, we can easily find the columns with the largest contribution in determining the target column’s value.
For a very accessible introduction to the use of Data Mining from simple SQL queries, read Marcos Campos’ great article: The SQL of Analytics 1 – Data Mining. He shows first how existing Data Mining Models can easily be incorporated into normal SQL queries using the PREDICTION_PROBABILITY operator. With this operator, we can apply a pre-created Data Mining Model to every record evaluated in the SQL query and filter it based on the result of applying the model. He continues to demonstrate how the Data Mining model itself can be created, using pretty straightforward PL/SQL APIs. Data Mining can easily be included in our ‘normal’ applications to further refine our queries and support decision processes.