Writing a pure SQL based String Tokenizer

9

A quite common concept in many programming langiages, though perhaps strongest in Java, is the String Tokenizer. This is a utility class or function that takes a String as input, and optionally one or more delimiters, and breaks up the string in individual elements, returning those as an appropriate collection. As an example: a String Tokenizer could split the string “red,blue,yellow” into a collection of the strings red, blue and yellow – if the comma had been specified as delimiter or no delimiter had been defined and the comma was the default. With the “e” as delimiter, the String Tokenizer would have produced: “r”, “d,blu”,”,y” and “llow”.

PL/SQL has something akin to the String Tokenizer, in the DBMS_UTILITY.COMMA_TO_TABLE function. By the way, my colleague Alex discussed this function in quite a bit of detail in his recent article Never Assume, Always Test.

While preparing for our session “SQL Puzzles” (next Tuesday, 13th February) I needed a String Tokenizer in SQL, ideally nothing but SQL. Again, with a different objective in mind, I wrote a SQL Query that indeed acts as a Tokenizer. I am quite sure it can be done much better – Anton: please share with me your thoughts on this – but I will share my attempt with you anyway.

The query is this one:

with numbers as
(  select rownum rn
   from   ( select count(*) rn
            from   dual
            group
            by     cube(1,2,3,4)
          )
)
, input as
( select 'red,blue,yellow' string_to_tokenize
  ,      'e' delimiter
  from   dual
)
, positions as
  (  select rn pos
   from   numbers
          join
          input
   on     instr(string_to_tokenize, delimiter, 1, rn) > 0
   union
   select 0 pos
   from   input
   where  instr(string_to_tokenize, delimiter) > 0
)
select  substr( string_to_tokenize
        , 1 + case pos when 0 then 0 else instr(string_to_tokenize, delimiter, 1, pos) end
        , case instr(string_to_tokenize, delimiter, 1, pos+1) when 0 then length(string_to_tokenize)+2 else instr(string_to_tokenize, delimiter, 1, pos+1)+1 end
          - ( 2 + case pos when 0 then 0 else instr(string_to_tokenize, delimiter, 1, pos) end)) token
from     positions
,        input

When I run the query on , the results are:

TOKEN
------------------
red
blue
yellow

And when I use “e” as the delimiter, the results are:

TOKEN
-------------
r
d,blu
,y
llow

 

Note that the inline-view numbers is somewhat limited as right now it will return no more than 16 rows. There are of course several ways to increase or tune that number, one would be to have numbers adapt to the inline-view input, that should probably be defined first in the query.

Also note that this query could be run using SQL*Plus parameters, like this:

SQL> with numbers as
  2  (  select rownum rn
  3     from   ( select count(*) rn
  4              from   dual
  5              group
  6              by     cube(1,2,3,4)
  7            )
  8  )
  9  , input as
 10  ( select '&String_To_Tokenizer' string_to_tokenize
 11    ,      '&Delimiter' delimiter
 12    from   dual
 13  )
 14  , positions as
 15    (  select rn pos
 16     from   numbers
 17            join
 18            input
 19     on     instr(string_to_tokenize, delimiter, 1, rn) > 0
 20     union
 21     select 0 pos
 22     from   input
 23     where  instr(string_to_tokenize, delimiter) > 0
 24  )
 25  select  substr( string_to_tokenize
 26          , 1 + case pos when 0 then 0 else instr(string_to_tokenize, delimiter, 1, pos) end
 27          , case instr(string_to_tokenize, delimiter, 1, pos+1) when 0 then length(string_to_toke
nize)+2 else instr(string_to_tokenize, delimiter, 1, pos+1)+1 end
 28            - ( 2 + case pos when 0 then 0 else instr(string_to_tokenize, delimiter, 1, pos) end)
) token
 29  from     positions
 30  ,        input
 31  /
Enter value for string_to_tokenizer: in the land of the pigs, the butcher is king (by Meat Loaf)
old  10: ( select '&String_To_Tokenizer' string_to_tokenize
new  10: ( select 'in the land of the pigs, the butcher is king (by Meat Loaf)' string_to_tokenize
Enter value for delimiter:  
old  11:   ,      '&Delimiter' delimiter
new  11:   ,      ' ' delimiter

TOKEN
--------------------------------------------------------------------------------
in
the
land
of
the
pigs,
the
butcher
is
king
(by

TOKEN
--------------------------------------------------------------------------------
Meat
Loaf)

13 rows selected.

Finally, the query should be extended to support multiple delimiters.

Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

9 Comments

  1. SELECT SUBSTR(X,Y + 1, Z – 1 – Y)
    FROM (
    SELECT X, INSTR(X,DELIMITER,1,ROWNUM) Y, INSTR(X,DELIMITER,1,ROWNUM+1) Z
    FROM (SELECT ‘,’ || ‘A,BB,CC’ || ‘,’ X, ‘,’ DELIMITER FROM DUAL), ALL_TABLES
    WHERE INSTR(X,DELIMITER,1,ROWNUM+1) != 0
    )

  2. hi guyz i tried to develop a number puzzle in forms6i got strucked in middle .. can any one help me out ..
    declare
    temp varchar2(40);
    lbl varchar2(40);
    begin
    lbl:=get_item_property(name_in(‘system.trigger_item’),label);
    temp:= :system.trigger_item;
    if get_item_property(temp,VISIBLE)= ‘TRUE’ then
    set_item_property(:parameter.HIDENUM,VISIBLE,PROPERTY_TRUE);
    set_item_property(:parameter.HIDENUM,ENABLED,PROPERTY_TRUE);
    set_item_property(:parameter.HIDENUM,label,lbl);
    go_item(:parameter.HIDENUM);
    set_item_property(temp,VISIBLE,PROPERTY_FALSE);
    :parameter.HIDENUM:=temp;
    end if;
    /*if get_item_property(‘number_game.TWO’,displayed)=’TRUE’ then
    set_item_property(‘number_game.TWO’,displayed,’FALSE’);
    end if;
    if get_item_property(‘number_game.THREE’,displayed)=’TRUE’ then
    set_item_property(‘number_game.THREE’,displayed,’FALSE’);
    end if;*/
    if lbl =’EXIT’ then exit_form(no_validate);
    end if;
    end;

  3. Hi,
    I found and already used a nice solution from O’Reillys "SQL Cookbook".
    Those days I needed to convert delimited data into a multi-valued IN-List.
    I just changed the query a bit to work as a string tokenizer. See:

    SELECT RTRIM(
    SUBSTR( x.token
    , INSTR(x.token, ‘&Delimiter’, 1, iter.pos) + 1
    ,INSTR(x.token, ‘&Delimiter’, 1, iter.pos + 1) – INSTR(x.token, ‘&Delimiter’, 1, iter.pos)
    )
    , ‘&Delimiter’
    ) token
    , iter.pos AS Position
    FROM (SELECT ‘&Delimiter’ || ‘&StringToTokenize’ || ‘&Delimiter’ token
    FROM dual
    ) x
    , (SELECT rownum AS pos
    FROM all_objects
    ) iter
    WHERE iter.pos

  4. My own variant using the MODEL clause

    SELECT var_list, var_list2
    FROM ( select ‘red,blue,yellow’ var_list,’,’ delim from dual )
    MODEL
    PARTITION BY (var_list, delim)
    DIMENSION BY (0 AS z)
    MEASURES (var_list var_list2,
    0 vcnt, 0 as vstart, 0 as vend)
    RULES ITERATE (4000) UNTIL (ITERATION_NUMBER+1 = vcnt[0] )
    (vcnt[ITERATION_NUMBER]=
    nvl(length(cv(var_list)) – length(replace(cv(var_list),cv(delim))) + 1,1),
    vstart[ITERATION_NUMBER] =
    case when ITERATION_NUMBER = 0 then 1
    else instr(cv(var_list),cv(delim),1,ITERATION_NUMBER)+1 end,
    vend[ITERATION_NUMBER] =
    case when nvl(instr(cv(var_list),cv(delim),1,ITERATION_NUMBER+1),0) = 0
    then nvl(length(cv(var_list))+1,1)
    else instr(cv(var_list),cv(delim),1,ITERATION_NUMBER+1) end,
    var_list2[ITERATION_NUMBER] =
    substr(cv(var_list),vstart[ITERATION_NUMBER],
    vend[ITERATION_NUMBER] – vstart[ITERATION_NUMBER])
    )
    order by var_list, vstart;

  5. My thoughts about this (only because you asked for it :) ):

    with input as <br />( select 'red,blue,yellow' string_to_tokenize <br />  , ',' delimiter <br />  from dual <br />)<br />, temp as <br />( select delimiter || string_to_tokenize || delimiter stt <br />, delimiter del <br />  from input <br />) <br />select substr( stt, instr( stt, del, 1, d ) + 1<br />,      instr( stt, del, 1, d + 1 ) - instr( stt, del, 1, d ) - 1 ) <br />from   temp <br />model <br />  partition by ( stt, del ) <br />  dimension by ( 0 x ) <br />  measures ( 1 d, ( select length( translate( stt, del || stt, del ) ) - 2 <br />                    from   temp <br />                  ) noe <br />            ) <br />  rules iterate( 1000 ) <br />  until iteration_number &gt;= noe[0] ( d[iteration_number] = iteration_number + 1 ) <br />

    For the thoughts of some other people on something simulare, see Adrian’s Oracle Page at http://mysite.wanadoo-members.co.uk/adriansoraclepages/demos/10g/scripts/string_to_table_sql_only.sql
    Anton

  6. Hi Lucas,

    you are keeping me away from work! But a SQL puzzle in the morning is good for my brain! :-)
    Here is my alternate solution using CONNECT BY 

    SELECT LEVEL <br />,      SUBSTR ( STRING_TO_TOKENIZE <br />              , DECODE( LEVEL, 1, 1, INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL-1)+1)<br />              , INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL) <br />                - DECODE( LEVEL, 1, 1, INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL-1)+1)<br />              ) <br />FROM ( SELECT '&amp;String_To_Tokenize'||'&amp;Delimiter' AS STRING_TO_TOKENIZE <br />       ,      '&amp;Delimiter' AS DELIMITER <br />       FROM DUAL <br />     ) <br />CONNECT <br />BY     INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL) &gt; 0 <br />ORDER <br />BY     LEVEL ASC ; <br />

    Patrick
    Check out my APEX blog: http://inside-apex.blogspot.com/