Tag Cloud analysis in SQL and PL/SQL - Using Collections, Table Operator, Multiset and Collect operator 20188367001

Tag Cloud analysis in SQL and PL/SQL – Using Collections, Table Operator, Multiset and Collect operator

Another world shattering topic: analyzing tag clouds. This article describes a very simple first step for some analysis at database level of tags that have been associated with database records this analysis can be the foundation for presenting the tag cloud in the user interface that enables the user to quickly filter on records or list ‘similar’ records. We will use some valuable SQL and PL/SQL facilities: Collections, the Table operator and the Collect aggregator.

The tags for Employee records are stored in a new tags column in table EMP:

alter table emp
add ( tags varchar2(4000))
/

We create some tag values with these update statements:

update emp
set    tags = 'sales, deal, offer, discount'
where  job='SALESMAN'
/

update emp
set    tags = 'sales, chief, boss, in charge '
where  ename = 'KING'
/

update emp
set    tags = 'discount, deal, cafetaria, sushi'
where  deptno = 10
/

update emp
set    tags = 'saving, discount, money'
where  sal < 1650
/

Note that tags are comma separated; a tag can consist of multiple words and the tags column may contain many tags for a record.

We will be dealing with Collections of Tags. The collection (nested table) type that we will use for handling the tags is defined like this:

create or replace type tags_tbl_type
as table of varchar2(4000)

One of the things we need to do is extract a table of tags from a plain comma separated string. This PL/SQL function accepts a string and returns the table with strings that are comma separated in the input parameter:

  create or replace
  function csv_string( p_string in varchar2)
  return tags_tbl_type
  is
    l_string varchar2(32000):= p_string;
    l_comma_index integer:=0;
    l_segments_tbl tags_tbl_type  := tags_tbl_type();
    l_ctr number(2):=1;
  begin
    loop
      l_comma_index:= instr( l_string,',');
      if nvl(l_comma_index,0) > 0
      then
        l_segments_tbl.extend;
        l_segments_tbl(l_segments_tbl.last):= substr(l_string, 1, l_comma_index - 1);
        l_string := trim (both from substr(l_string, l_comma_index + 1));
      else
        l_segments_tbl.extend;
        l_segments_tbl(l_segments_tbl.last):= trim(both from l_string);
        exit;
      end if;
      l_ctr:= l_ctr+1;
    end loop;
    return l_segments_tbl;
  end csv_string;

A simple test for this function:

  1  select csv_string('One tag, and another one, my last tag')
  2* from   dual
SQL> /

CSV_STRING('ONETAG,ANDANOTHERONE,MYLASTTAG')
-----------------------------------------------------------------
TAGS_TBL_TYPE('One tag', 'and another one', 'my last tag')

With this function under our belt, we can create a second function. This second function accepts a table of varchar2 and returns an even longer table of varchar2, where the returned table contains all tags that were extracted from the strings in the first table:

create or replace
function csv_to_table
( p_tags_tbl in tags_tbl_type
) return tags_tbl_type
is
  l_tags_tbl tags_tbl_type := tags_tbl_type('');
  l_index integer;
begin
  l_index:= p_tags_tbl.first;
  loop
    exit when l_index is null;
    l_tags_tbl := l_tags_tbl multiset union  csv_string( p_string => p_tags_tbl(l_index));
    l_index:= p_tags_tbl.next(l_index);
  end loop;
  return l_tags_tbl;
end csv_to_table;

A test of this function:

SQL> select csv_to_table( tags_tbl_type('tag one, tag two','tag 3, tag4 , tag5'))
  2  from   dual
  3  /

CSV_TO_TABLE(TAGS_TBL_TYPE('TAGONE,TAGTWO','TAG3,TAG4,TAG5'))
--------------------------------------------------------------------------------
TAGS_TBL_TYPE('', 'tag one', 'tag two', 'tag 3', 'tag4 ', 'tag5')

The Collect aggregator is used to aggregate string values together in a collection (of type table of varchar2). See this example:

SQL> select cast(collect(ename) as tags_tbl_type)
  2  from   emp
  3  /

CAST(COLLECT(ENAME)ASTAGS_TBL_TYPE)
--------------------------------------------------------------------------------
TAGS_TBL_TYPE('SMITH', 'ALLEN', 'WARD', 'JONES', 'MARTIN', 'BLAKE', 'CLARK', 'SC
OTT', 'KING', 'TURNER', 'ADAMS', 'JAMES', 'FORD', 'MILLER')

Using the Collect operator, we can quickly construct the collection with all values in the tags column for all employees. This collection can be passed to the csv_to_table function that will return the overall set of all tags for all employees:

  1  select column_value tag
  2  from   table (csv_to_table( ( select cast(collect(tags) as tags_tbl_type)
  3                                from   emp
  4                               )
  5                            )
  6*              )
SQL> /

TAG
------------------------------------------------------------------------------

saving
discount
money
sales
deal
offer
discount
saving
discount
.... (32 rows in total)

Now we can do a little analysis to find out which tags occur most frequently:

select column_value tag
,      count(column_value) number_of_tag_occurrences
from   table (csv_to_table( ( select cast(collect(tags) as tags_tbl_type)
                              from   emp
                             )
                          )
             )
group
by     column_value
order
by     2 desc

And the results of this query:

TAG                                      NUMBER_OF_TAG_OCCURRENCES
---------------------------------------- -------------------------
discount                                                         9
money                                                            5
saving                                                           5
deal                                                             4
sales                                                            2
cafetaria                                                        2
sushi                                                            2
offer                                                            2

Interesting, is it not?

Now we might be interested in the Employees who are associated with the most popular tag ‘discount’. We can find this out using a query that uses csv_string to turn the tags value to a collection, and the table operator to access that collection as a ‘table’  or at least intermediate query result:

select ename
from   emp
where  exists ( select 0
                from   table( csv_string( emp.tags)) tags
                where  tags.column_value = 'discount'
              )

The outcome, in case you are interested:

ENAME
----------
SMITH
ALLEN
WARD
MARTIN
KING
TURNER
ADAMS
JAMES
MILLER

9 rows selected.

We can also use some of the more recent (10gR2) collection manipulation operators.

For example to find out which employees have at least the tags discount and sales (and perhaps more), we could do:

select ename
from   emp
where  0 = ( select count(*)
             from   table(tags_tbl_type('discount','sales') MULTISET EXCEPT csv_string( emp.tags) ))

And to find the employees who have at least two of the tags ‘chief’,’money’ and ‘discount’, we can use the MULTISET INTERSECT operator:

select ename
from   emp
where  2 <= ( select count(*)
              from   table(  tags_tbl_type('chief','money','discount')
                             MULTISET INTERSECT
                             csv_string(emp.tags)
                          )
            )

with the interesting result:

ENAME
----------
SMITH
WARD
MARTIN
ADAMS
JAMES

Resources

Download Source Code:  tagcloud_analysis_collection_manipulation.txt.

3 Comments

  1. Anton Scheffer March 15, 2009
  2. Lucas Jellema March 7, 2009
  3. Anton Scheffer March 6, 2009