Another world shattering topic: analyzing tag clouds. This article describes a very simple first step for some analysis at database level of tags that have been associated with database records this analysis can be the foundation for presenting the tag cloud in the user interface that enables the user to quickly filter on records or list ‘similar’ records. We will use some valuable SQL and PL/SQL facilities: Collections, the Table operator and the Collect aggregator.
The tags for Employee records are stored in a new tags column in table EMP:
alter table emp add ( tags varchar2(4000)) /
We create some tag values with these update statements:
update emp set tags = 'sales, deal, offer, discount' where job='SALESMAN' / update emp set tags = 'sales, chief, boss, in charge ' where ename = 'KING' / update emp set tags = 'discount, deal, cafetaria, sushi' where deptno = 10 / update emp set tags = 'saving, discount, money' where sal < 1650 /
Note that tags are comma separated; a tag can consist of multiple words and the tags column may contain many tags for a record.
We will be dealing with Collections of Tags. The collection (nested table) type that we will use for handling the tags is defined like this:
create or replace type tags_tbl_type as table of varchar2(4000)
One of the things we need to do is extract a table of tags from a plain comma separated string. This PL/SQL function accepts a string and returns the table with strings that are comma separated in the input parameter:
create or replace function csv_string( p_string in varchar2) return tags_tbl_type is l_string varchar2(32000):= p_string; l_comma_index integer:=0; l_segments_tbl tags_tbl_type := tags_tbl_type(); l_ctr number(2):=1; begin loop l_comma_index:= instr( l_string,','); if nvl(l_comma_index,0) > 0 then l_segments_tbl.extend; l_segments_tbl(l_segments_tbl.last):= substr(l_string, 1, l_comma_index - 1); l_string := trim (both from substr(l_string, l_comma_index + 1)); else l_segments_tbl.extend; l_segments_tbl(l_segments_tbl.last):= trim(both from l_string); exit; end if; l_ctr:= l_ctr+1; end loop; return l_segments_tbl; end csv_string;
A simple test for this function:
1 select csv_string('One tag, and another one, my last tag') 2* from dual SQL> / CSV_STRING('ONETAG,ANDANOTHERONE,MYLASTTAG') ----------------------------------------------------------------- TAGS_TBL_TYPE('One tag', 'and another one', 'my last tag')
With this function under our belt, we can create a second function. This second function accepts a table of varchar2 and returns an even longer table of varchar2, where the returned table contains all tags that were extracted from the strings in the first table:
create or replace function csv_to_table ( p_tags_tbl in tags_tbl_type ) return tags_tbl_type is l_tags_tbl tags_tbl_type := tags_tbl_type(''); l_index integer; begin l_index:= p_tags_tbl.first; loop exit when l_index is null; l_tags_tbl := l_tags_tbl multiset union csv_string( p_string => p_tags_tbl(l_index)); l_index:= p_tags_tbl.next(l_index); end loop; return l_tags_tbl; end csv_to_table;
A test of this function:
SQL> select csv_to_table( tags_tbl_type('tag one, tag two','tag 3, tag4 , tag5')) 2 from dual 3 / CSV_TO_TABLE(TAGS_TBL_TYPE('TAGONE,TAGTWO','TAG3,TAG4,TAG5')) -------------------------------------------------------------------------------- TAGS_TBL_TYPE('', 'tag one', 'tag two', 'tag 3', 'tag4 ', 'tag5')
The Collect aggregator is used to aggregate string values together in a collection (of type table of varchar2). See this example:
SQL> select cast(collect(ename) as tags_tbl_type) 2 from emp 3 / CAST(COLLECT(ENAME)ASTAGS_TBL_TYPE) -------------------------------------------------------------------------------- TAGS_TBL_TYPE('SMITH', 'ALLEN', 'WARD', 'JONES', 'MARTIN', 'BLAKE', 'CLARK', 'SC OTT', 'KING', 'TURNER', 'ADAMS', 'JAMES', 'FORD', 'MILLER')
Using the Collect operator, we can quickly construct the collection with all values in the tags column for all employees. This collection can be passed to the csv_to_table function that will return the overall set of all tags for all employees:
1 select column_value tag 2 from table (csv_to_table( ( select cast(collect(tags) as tags_tbl_type) 3 from emp 4 ) 5 ) 6* ) SQL> / TAG ------------------------------------------------------------------------------ saving discount money sales deal offer discount saving discount .... (32 rows in total)
Now we can do a little analysis to find out which tags occur most frequently:
select column_value tag , count(column_value) number_of_tag_occurrences from table (csv_to_table( ( select cast(collect(tags) as tags_tbl_type) from emp ) ) ) group by column_value order by 2 desc
And the results of this query:
TAG NUMBER_OF_TAG_OCCURRENCES ---------------------------------------- ------------------------- discount 9 money 5 saving 5 deal 4 sales 2 cafetaria 2 sushi 2 offer 2
Interesting, is it not?
Now we might be interested in the Employees who are associated with the most popular tag ‘discount’. We can find this out using a query that uses csv_string to turn the tags value to a collection, and the table operator to access that collection as a ‘table’ or at least intermediate query result:
select ename from emp where exists ( select 0 from table( csv_string( emp.tags)) tags where tags.column_value = 'discount' )
The outcome, in case you are interested:
ENAME ---------- SMITH ALLEN WARD MARTIN KING TURNER ADAMS JAMES MILLER 9 rows selected.
We can also use some of the more recent (10gR2) collection manipulation operators.
For example to find out which employees have at least the tags discount and sales (and perhaps more), we could do:
select ename from emp where 0 = ( select count(*) from table(tags_tbl_type('discount','sales') MULTISET EXCEPT csv_string( emp.tags) ))
And to find the employees who have at least two of the tags ‘chief’,’money’ and ‘discount’, we can use the MULTISET INTERSECT operator:
select ename from emp where 2 <= ( select count(*) from table( tags_tbl_type('chief','money','discount') MULTISET INTERSECT csv_string(emp.tags) ) )
with the interesting result:
ENAME ---------- SMITH WARD MARTIN ADAMS JAMES
Resources
Download Source Code: tagcloud_analysis_collection_manipulation.txt.
There are more useful collection conditions. For instance where you use
select ename
from emp
where 0 = ( select count(*)
from table(tags_tbl_type(‘discount’,’sales’) MULTISET EXCEPT csv_string( emp.tags) ));
you can also use
select ename
from emp
where tags_tbl_type(‘discount’,’sales’) submultiset csv_string( emp.tags);
And there’s also the cardinality function which you can use instead of “select count(*) from table( …. )
Anton,
Thanks! That member of had escaped my notice. That is quite useful. You are right that the first query is much more compact too.
groeten, Lucas
I have some more tag queries, to make your queries a little shorter
select column_value, count(*)
from emp
, table( csv_string( emp.tags ) )
group by column_value
order by 2;
select ename
from emp
where ‘discount’ member of csv_string( emp.tags );