Tag Cloud analysis in SQL and PL/SQL – Using Collections, Table Operator, Multiset and Collect operator

3

Another world shattering topic: analyzing tag clouds. This article describes a very simple first step for some analysis at database level of tags that have been associated with database records this analysis can be the foundation for presenting the tag cloud in the user interface that enables the user to quickly filter on records or list ‘similar’ records. We will use some valuable SQL and PL/SQL facilities: Collections, the Table operator and the Collect aggregator.

The tags for Employee records are stored in a new tags column in table EMP:

alter table emp
add ( tags varchar2(4000))
/

We create some tag values with these update statements:

update emp
set    tags = 'sales, deal, offer, discount'
where  job='SALESMAN'
/

update emp
set    tags = 'sales, chief, boss, in charge '
where  ename = 'KING'
/

update emp
set    tags = 'discount, deal, cafetaria, sushi'
where  deptno = 10
/

update emp
set    tags = 'saving, discount, money'
where  sal < 1650
/

Note that tags are comma separated; a tag can consist of multiple words and the tags column may contain many tags for a record.

We will be dealing with Collections of Tags. The collection (nested table) type that we will use for handling the tags is defined like this:

create or replace type tags_tbl_type
as table of varchar2(4000)

One of the things we need to do is extract a table of tags from a plain comma separated string. This PL/SQL function accepts a string and returns the table with strings that are comma separated in the input parameter:

  create or replace
  function csv_string( p_string in varchar2)
  return tags_tbl_type
  is
    l_string varchar2(32000):= p_string;
    l_comma_index integer:=0;
    l_segments_tbl tags_tbl_type  := tags_tbl_type();
    l_ctr number(2):=1;
  begin
    loop
      l_comma_index:= instr( l_string,',');
      if nvl(l_comma_index,0) > 0
      then
        l_segments_tbl.extend;
        l_segments_tbl(l_segments_tbl.last):= substr(l_string, 1, l_comma_index - 1);
        l_string := trim (both from substr(l_string, l_comma_index + 1));
      else
        l_segments_tbl.extend;
        l_segments_tbl(l_segments_tbl.last):= trim(both from l_string);
        exit;
      end if;
      l_ctr:= l_ctr+1;
    end loop;
    return l_segments_tbl;
  end csv_string;

A simple test for this function:

  1  select csv_string('One tag, and another one, my last tag')
  2* from   dual
SQL> /

CSV_STRING('ONETAG,ANDANOTHERONE,MYLASTTAG')
-----------------------------------------------------------------
TAGS_TBL_TYPE('One tag', 'and another one', 'my last tag')

With this function under our belt, we can create a second function. This second function accepts a table of varchar2 and returns an even longer table of varchar2, where the returned table contains all tags that were extracted from the strings in the first table:

create or replace
function csv_to_table
( p_tags_tbl in tags_tbl_type
) return tags_tbl_type
is
  l_tags_tbl tags_tbl_type := tags_tbl_type('');
  l_index integer;
begin
  l_index:= p_tags_tbl.first;
  loop
    exit when l_index is null;
    l_tags_tbl := l_tags_tbl multiset union  csv_string( p_string => p_tags_tbl(l_index));
    l_index:= p_tags_tbl.next(l_index);
  end loop;
  return l_tags_tbl;
end csv_to_table;

A test of this function:

SQL> select csv_to_table( tags_tbl_type('tag one, tag two','tag 3, tag4 , tag5'))
  2  from   dual
  3  /

CSV_TO_TABLE(TAGS_TBL_TYPE('TAGONE,TAGTWO','TAG3,TAG4,TAG5'))
--------------------------------------------------------------------------------
TAGS_TBL_TYPE('', 'tag one', 'tag two', 'tag 3', 'tag4 ', 'tag5')

The Collect aggregator is used to aggregate string values together in a collection (of type table of varchar2). See this example:

SQL> select cast(collect(ename) as tags_tbl_type)
  2  from   emp
  3  /

CAST(COLLECT(ENAME)ASTAGS_TBL_TYPE)
--------------------------------------------------------------------------------
TAGS_TBL_TYPE('SMITH', 'ALLEN', 'WARD', 'JONES', 'MARTIN', 'BLAKE', 'CLARK', 'SC
OTT', 'KING', 'TURNER', 'ADAMS', 'JAMES', 'FORD', 'MILLER')

Using the Collect operator, we can quickly construct the collection with all values in the tags column for all employees. This collection can be passed to the csv_to_table function that will return the overall set of all tags for all employees:

  1  select column_value tag
  2  from   table (csv_to_table( ( select cast(collect(tags) as tags_tbl_type)
  3                                from   emp
  4                               )
  5                            )
  6*              )
SQL> /

TAG
------------------------------------------------------------------------------

saving
discount
money
sales
deal
offer
discount
saving
discount
.... (32 rows in total)

Now we can do a little analysis to find out which tags occur most frequently:

select column_value tag
,      count(column_value) number_of_tag_occurrences
from   table (csv_to_table( ( select cast(collect(tags) as tags_tbl_type)
                              from   emp
                             )
                          )
             )
group
by     column_value
order
by     2 desc

And the results of this query:

TAG                                      NUMBER_OF_TAG_OCCURRENCES
---------------------------------------- -------------------------
discount                                                         9
money                                                            5
saving                                                           5
deal                                                             4
sales                                                            2
cafetaria                                                        2
sushi                                                            2
offer                                                            2

Interesting, is it not?

Now we might be interested in the Employees who are associated with the most popular tag ‘discount’. We can find this out using a query that uses csv_string to turn the tags value to a collection, and the table operator to access that collection as a ‘table’  or at least intermediate query result:

select ename
from   emp
where  exists ( select 0
                from   table( csv_string( emp.tags)) tags
                where  tags.column_value = 'discount'
              )

The outcome, in case you are interested:

ENAME
----------
SMITH
ALLEN
WARD
MARTIN
KING
TURNER
ADAMS
JAMES
MILLER

9 rows selected.

We can also use some of the more recent (10gR2) collection manipulation operators.

For example to find out which employees have at least the tags discount and sales (and perhaps more), we could do:

select ename
from   emp
where  0 = ( select count(*)
             from   table(tags_tbl_type('discount','sales') MULTISET EXCEPT csv_string( emp.tags) ))

And to find the employees who have at least two of the tags ‘chief’,’money’ and ‘discount’, we can use the MULTISET INTERSECT operator:

select ename
from   emp
where  2 <= ( select count(*)
              from   table(  tags_tbl_type('chief','money','discount')
                             MULTISET INTERSECT
                             csv_string(emp.tags)
                          )
            )

with the interesting result:

ENAME
----------
SMITH
WARD
MARTIN
ADAMS
JAMES

Resources

Download Source Code:  tagcloud_analysis_collection_manipulation.txt.

Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

3 Comments

  1. Anton Scheffer on

    There are more useful collection conditions. For instance where you use
    select ename
    from emp
    where 0 = ( select count(*)
    from table(tags_tbl_type(‘discount’,’sales’) MULTISET EXCEPT csv_string( emp.tags) ));
    you can also use
    select ename
    from emp
    where tags_tbl_type(‘discount’,’sales’) submultiset csv_string( emp.tags);
    And there’s also the cardinality function which you can use instead of “select count(*) from table( …. )

  2. Anton,
    Thanks! That member of had escaped my notice. That is quite useful. You are right that the first query is much more compact too.

    groeten, Lucas

  3. Anton Scheffer on

    I have some more tag queries, to make your queries a little shorter

    select column_value, count(*)
    from emp
    , table( csv_string( emp.tags ) )
    group by column_value
    order by 2;

    select ename
    from emp
    where ‘discount’ member of csv_string( emp.tags );