The Oracle 7Up workshop that I teach every few months introduces many Oracle 8i, 9i and some 10g features to experienced SQL and PL/SQL developers with a firm footing in Oracle 7. Many developers have only a vague notion of many of the facilities and objects that Oracle has made available since 1996. Especially now that 8i is no longer supported, we see that many organizations have adopted 9iR2 or will do so in the near future. Besides, many features that previously were Enterprise Edition only have now been made available in the Standard Edition as well. This seems like perfect timing to reacquaint one with very useful SQL and PL/SQL functions. One of the bonus topics in the first two-day block – focused on SQL- is Oracle Text, previously known as InterMedia, the Context Option and SQL*TextRetrieval.
Oracle Text provides functionality search through documents in very advanced ways. See for more details the Oracle Text Application Developer’s Guide 9.2:
Oracle Text is a tool that enables you to build text query applications and document classification applications. Oracle Text provides indexing, word and theme searching, and viewing capabilities for text. The purpose of a text query application is to enable users to find text that contains one or more search terms. The text is usually a collection of documents. A good application can index and search common document formats such as plain text, HTML, XML, or Microsoft Word. For example, an application with a browser interface might enable users to query a company Web site consisting of HTML files, returning those files that match a query.
The second type of application supported by Oracle Text is a document classification application. A document classification application is one that classifies an incoming stream of documents based on its content. They are also known as document routing or filtering applications. For example, an online news agency might need to classify its incoming stream of articles as they arrive into categories such as politics, crime, and sports.
One of the interesting things to note is that while Oracle Text can index and search all types of documents that are stored in your database, it is equally well equipped to provide this functionality for documents that are NOT in the database. They can be on a file system or anywhere on your intranet or internet. You provides references to the documents through directories and filenames or through URLs and Oracle Text will find and index those documents. In this post we will see how we can build a very simple Oracle Text search application for our own Weblog – but we could have done it for any collection of documents on the Internet, such as the Oracle Documentation On Line or any section on OTN. Note: a colleague of mine is currently involved with a document library system that involves Oracle Text indexing and searching of up to tens of millions of records. I hope he will write on this weblog, telling about his field experiences, from a functional as well as from a DBA perspective.
Searching the AMIS Technology Weblog using Oracle Text
We will create a local table that contains URLs that reference the articles on our weblog. Subsequently, we will build a Text Index on this table and more specifically its URL column. During the index creation process, Oracle Text will call out to our weblog to retrieve each document that the URL values refer to. The document will be processed and the results are added to the Text Index. Once the index is built, we no longer need internet access to search the weblog! The Oracle Text search is done against the (local) Text Index and the results returned are the values from the local table.
Loading the document table and building the Text Index
In three steps we will create the Text Index:
1. Create the local table
create table live_posts ( id number(10) primary key , url varchar2(200) , title varchar2(200) ) /
2. Insert references to the documents on the weblog, including the URL and the Title of each document:
insert into live_posts (id, url) select num , 'https://technology.amis.nl/blog/index.php?p='||to_char(num) from ( select rownum num from ( select 1 from dual group by cube(1,2,3,4,5,6,7,8,9) -- power(2,9) = 512 ) ) / update live_posts set title = regexp_substr(utl_http.request(url), '<title(.*)</title') /
We now have the titles, but we have several small problems: not all URL actually refer to valid documents; some do not exist, some are internally available only and not accessible to our search engine. Furthermore, the title column still contains the tags. Using the following statements we resolve these issues. Note: that probably can be done more efficiently, but I am not very experienced with Regular Expressions, hence this somewhat awkward approach:
update live_posts set title = regexp_replace ( regexp_replace(title,'<tit(.)*; ','') , '</ti(.)*' , '' ) / delete live_posts where title= '<title>' /
The insert statement creates 512 records with as many different URLs, based on our knowledge of the format of URLs to the articles on the weblog. Note the use of the CUBE operator to generate rows from DUAL (this is described in more detail in the Weblog article Courtesy of Tom Kyte: Generating rows in SQL with the CUBE statement – no dummy table or table function required). The update statement is used to add the title of each article on the weblog to the live_posts table. It calls out to the weblog, requesting the document for the given URL and then searches through that document for the text between <title
and </title
.
3. Constructing the Oracle Text Index
begin ctx_ddl.create_preference('url','URL_DATASTORE'); ctx_ddl.set_attribute('url','NO_PROXY','us.oracle.com'); ctx_ddl.set_attribute('url','Timeout','300'); end; create index live_posts_index on live_posts ( url ) indextype is ctxsys.context parameters ( 'datastore url' ) / EXEC DBMS_STATS.GATHER_TABLE_STATS('ROOT', 'LIVE_POSTS', cascade=>TRUE);
We can verify whether the index is created and what the number of Tokens is that it has collected. Note that I did not use specific stop-lists to exclude frequently occurring non-meaningful words. I also did not make explicit use of the structure of the documents: they are all HTML and I could have excluded certains tags and have focused the index on the htmlbody or even more specific on certain sections within the document. To inspect the index as it has been built for us:
SELECT count(token_text) , sum(token_count) FROM DR$live_posts_index$I / COUNT(TOKEN_TEXT) SUM(TOKEN_COUNT) ----------------- ---------------- 11369 143945
Building the text index – which involved retrieving over 350 documents with a total size of several dozen MBs – took less than 5 minutes. I am somehwat curious as to what the top ten tokens in our weblog are:
select token , token_count from ( SELECT token_text token , token_count , rank() over (order by token_count desc) rnk FROM DR$live_posts_index$I ) ordered_tokens where rnk < 11 order by rnk desc /
The results are somewhat disappointing: I now see why I should have indicated to Oracle Text not to index the HTML tags and also to focus on a certain section of the document. The highest ranking tokens are either HTML-tags or elements that appear on every article in the weblog, such as the links to other weblogs and the recent comments. However, I suspect that this will not really hamper my searches so I leave the index as it is. We are now ready to search the weblog, just from the comfort of our own database.
Querying the AMIS Technology Weblog using the Oracle Text Index
See the online Oracle Text documentation for details on the Oracle Text CONTAINS operator grammar. We can learn from there the use of several operators such as AND, OR, NOT, ASSUM, EQUIV, NEAR, $ (stem), ? (fuzzy) and ! (soundex). Let's try to find all AMIS Weblog articles on 'J2EE' :
select title , score(1) score from live_posts where contains(url, 'j2ee', 1) > 20 order by score desc /
Even with a relatively high threshold (score must be over 20), we find 24 articles with this query. Let us narrow down the search. We are interested in J2EE and in particular in OO/R Mapping frameworks:
select title , score(1) score from live_posts where contains(url, 'j2ee and $mapping', 1) > 20 order by score desc / TITLE SCORE ------------------------------------------------------------------------------------------ --------- Expert Session on Persistency Framework: Toplink 36 A single POJO persistence model to replace JDO and take over much of EJB?? 36
Now for a slightly different approach: it should be about J2EE alright but please let’s not talk about Struts this time:
select title , score(1) score from live_posts where contains(url, 'j2ee not struts', 1) > 2 order by score desc /
Unfortunately, no rows are returned. What we can do now, is lower the score for any article that discusses Struts in too much earnest:
select substr(title,1,90) title , score(1) score from live_posts where contains(url, 'j2ee minus struts', 1) > 20 order by score desc / TITLE SCORE ------------------------------------------------------------------------------------------ --------- BPEL-Soft - Oracle’s previous acquisition showcased during AMIS Query 39 On-line Java/J2EE courses! 33 A single POJO persistence model to replace JDO and take over much of EJB?? 33 Interesting articles on various subjects 33 Expert Session on Persistency Framework: Toplink 33 Combine Oracle Toplink and the Spring Framework - Rod Johnson & Jim Clark 30 Great document on Java/J2EE Guidelines 27 AppFuse - To get a J2EE project going… 24 Free J2EE and XML Development book 24 Eclipse webtools 21 Core J2ee patterns, session facade: always useful 21 Summary “Expert one-on-one J2EE design and development” 21
Now we do have some results that seem meaningful.
On to a more complex example: articles that are about something we heard mentioning, but do not know for sure how to spell: “ENT” and it should have something to do with Java and Building stuff:
select title , score(1) score from live_posts where contains(url, 'near((!ent,$build),6) and java', 1) > 20 order by score desc / TITLE SCORE ------------------------------------------------------------------------------------------ --------- (Pre)-compiling JSP for Tomcat with Jasper 27 AMIS Library System - JHeadstart 9.0.5.x (BC4J, JSP+JSTL) Demo Application released 24
Note than in the above query we have used ! for anything that sounds like ENT (we expect it to return articles about ANT) and $ for all tokens with the same stem as build, such as building, where these two tokens are within a range of six tokens of each other.
The NEAR operator can be very useful for finding contents where you know to tokens should be there but not necessarily next to each other in a certain static sequence. Looking with just the AND operator would potentially return far too many results. To find articles that deal with something that sounds like BEPEL (presumably I am interested in BPEL) I could try the query
select title , score(1) score from live_posts where contains(url, '!bepel', 1) > 10 order by score desc /
. However, it returns 40 records, which I find way too many. When I narrow down the search, as follows:
select title , score(1) score from live_posts where contains(url, 'near((!bepel, process),10) and "web service" ', 1) > 10 order by score desc / TITLE SCORE ------------------------------------------------------------------------------------------ --------- Consuming Web Services from PL/SQL - Part I: Using Java Stored Procedures 43 Consuming Web Services from PL/SQL - Part II: A pure PL/SQL solution using UTL_HTTP (Oracl 41 BPEL is great – and so is Oracle BPEL 23 Publishing PL/SQL Based WebServices 12 Friday on Oracle Open world 12
I am down to five, which is easier to deal with.
Next I want to find articles about Oracle Designer, either the 9i or 10g release – which are equivalent. I tell Oracle Text exactly that: 9i and 10g are equivalent:
select title , score(1) score from live_posts where contains(url, 'near((oracle, designer),3) and 10g=9i minus jdeveloper', 1) > 10 order by score desc / TITLE SCORE ------------------------------------------------------------------------------------------ --------- Oracle Designer/Oracle SCM Meta Model 45 9.0.4.5 Maintenance Release Oracle Designer available 27 Enabling Role-based security management in Oracle Designer 6i/9i/10g through the Repositor 27 The future of CDM RuleFrame - the Oracle framework for server-side business rules 15
Finally to find articles on the declarative, server side implementation of the business rule (or constraint) that a master must have details (mandatory master detail constraint), we use the following query:
select title , score(1) score from live_posts where contains(url, 'declarative,integrity and near((mandatory,master,detail),8) and "business rule"=constraint', 1) > 10 order by score desc / TITLE SCORE ------------------------------------------------------------------------------------------ --------- Declarative Implementation of Mandatory Master-Detail relationship (Order must have Order 63 The future of CDM RuleFrame - the Oracle framework for server-side business rules 61
Resources
- Oracle Text Application Developer’s Guide 9.2
- Optimize Your Text Retrieval (Oracle Magazine, September/October 2004)
- Oracle Text Home Page on OTN
Oracle 10g Ultra Search
We have been building a very simple SQL and Oracle Text based search application. That has already been done for us in Oracle 10g Ultra Search, see Ultra Search on OTN:
Oracle Ultra Search allows you to find what you want — on corporate Web servers, databases, mail servers, fileservers, Oracle10g Portal instances, Oracle Files and Mail. Ultra Search is based on Oracle10g Text technology and is an out-of-the box solution that requires no SQL coding. Ultra Search is included with the Oracle database, Application Server and Collaboration Suite and is available free of charge for license holders of these products. It uses a crawler to index documents; the documents stay in their own repositories, and the crawled information is used to build an index that stays within your firewall in an Oracle10g database.
More details on Ultra Search in the white paper Oracle Ultra Search Architecture Version 10.1 for Oracle10g Database Release 1 Version 9.0.3 for Oracle Collaboration Suite Release 2 and Oracle 10g Application Server Version 10g January 1994 (where 1994 presumably should be 2004, or someone with incredible vision has written this white paper!).
There were some pitfalls with East Asia Languages.
Nice article. In my experience it is very powerful but not well known feature of the database. Could (should) be used much more often. One of the more interesting options is diacritic insensitive searching, which is especially handy with foreign names or case insensitive searching.
But it also has it’s pitfall (off course). If I remember correctly the index is not rebuild automatically when the data is updated.
As you mentioned the tag indexing the use of a stoplist is very advisable, but also requieres some finetuning especially when indexing names.