Creating your own advanced search engine for any website - using Oracle Text - Searching the AMIS Technology Weblog html

Creating your own advanced search engine for any website – using Oracle Text – Searching the AMIS Technology Weblog

The Oracle 7Up workshop that I teach every few months introduces many Oracle 8i, 9i and some 10g features to experienced SQL and PL/SQL developers with a firm footing in Oracle 7. Many developers have only a vague notion of many of the facilities and objects that Oracle has made available since 1996. Especially now that 8i is no longer supported, we see that many organizations have adopted 9iR2 or will do so in the near future. Besides, many features that previously were Enterprise Edition only have now been made available in the Standard Edition as well. This seems like perfect timing to reacquaint one with very useful SQL and PL/SQL functions. One of the bonus topics in the first two-day block – focused on SQL- is Oracle Text, previously known as InterMedia, the Context Option and SQL*TextRetrieval.

Oracle Text provides functionality search through documents in very advanced ways. See for more details the Oracle Text Application Developer’s Guide 9.2:

Oracle Text is a tool that enables you to build text query applications and document classification applications. Oracle Text provides indexing, word and theme searching, and viewing capabilities for text. The purpose of a text query application is to enable users to find text that contains one or more search terms. The text is usually a collection of documents. A good application can index and search common document formats such as plain text, HTML, XML, or Microsoft Word. For example, an application with a browser interface might enable users to query a company Web site consisting of HTML files, returning those files that match a query.

The second type of application supported by Oracle Text is a document classification application. A document classification application is one that classifies an incoming stream of documents based on its content. They are also known as document routing or filtering applications. For example, an online news agency might need to classify its incoming stream of articles as they arrive into categories such as politics, crime, and sports.

One of the interesting things to note is that while Oracle Text can index and search all types of documents that are stored in your database, it is equally well equipped to provide this functionality for documents that are NOT in the database. They can be on a file system or anywhere on your intranet or internet. You provides references to the documents through directories and filenames or through URLs and Oracle Text will find and index those documents. In this post we will see how we can build a very simple Oracle Text search application for our own Weblog – but we could have done it for any collection of documents on the Internet, such as the Oracle Documentation On Line or any section on OTN. Note: a colleague of mine is currently involved with a document library system that involves Oracle Text indexing and searching of up to tens of millions of records. I hope he will write on this weblog, telling about his field experiences, from a functional as well as from a DBA perspective.

Searching the AMIS Technology Weblog using Oracle Text

We will create a local table that contains URLs that reference the articles on our weblog. Subsequently, we will build a Text Index on this table and more specifically its URL column. During the index creation process, Oracle Text will call out to our weblog to retrieve each document that the URL values refer to. The document will be processed and the results are added to the Text Index. Once the index is built, we no longer need internet access to search the weblog! The Oracle Text search is done against the (local) Text Index and the results returned are the values from the local table.

Loading the document table and building the Text Index

In three steps we will create the Text Index:
1. Create the local table

create table live_posts
( id number(10) primary key
, url varchar2(200)
, title varchar2(200)
)
/

2. Insert references to the documents on the weblog, including the URL and the Title of each document:

insert into live_posts
(id, url)
select num
,      'https://technology.amis.nl/blog/index.php?p='||to_char(num)
from   ( select rownum num
         from   ( select 1
                  from   dual
                  group
                  by     cube(1,2,3,4,5,6,7,8,9) -- power(2,9) = 512
                )
       )
/

update live_posts
set    title = regexp_substr(utl_http.request(url), '<title(.*)</title')
/

We now have the titles, but we have several small problems: not all URL actually refer to valid documents; some do not exist, some are internally available only and not accessible to our search engine. Furthermore, the title column still contains the tags. Using the following statements we resolve these issues. Note: that probably can be done more efficiently, but I am not very experienced with Regular Expressions, hence this somewhat awkward approach:

update live_posts
set    title = regexp_replace
               ( regexp_replace(title,'<tit(.)*; ','')
               , '</ti(.)*'
               , ''
               )
/

delete live_posts
where  title= '<title>'
/

The insert statement creates 512 records with as many different URLs, based on our knowledge of the format of URLs to the articles on the weblog. Note the use of the CUBE operator to generate rows from DUAL (this is described in more detail in the Weblog article Courtesy of Tom Kyte: Generating rows in SQL with the CUBE statement – no dummy table or table function required). The update statement is used to add the title of each article on the weblog to the live_posts table. It calls out to the weblog, requesting the document for the given URL and then searches through that document for the text between <title and </title.

3. Constructing the Oracle Text Index

begin
ctx_ddl.create_preference('url','URL_DATASTORE');
 ctx_ddl.set_attribute('url','NO_PROXY','us.oracle.com');
 ctx_ddl.set_attribute('url','Timeout','300');
end;

create index live_posts_index
on live_posts ( url )
indextype is ctxsys.context
parameters ( 'datastore url' )
/
EXEC DBMS_STATS.GATHER_TABLE_STATS('ROOT', 'LIVE_POSTS', cascade=>TRUE);

We can verify whether the index is created and what the number of Tokens is that it has collected. Note that I did not use specific stop-lists to exclude frequently occurring non-meaningful words. I also did not make explicit use of the structure of the documents: they are all HTML and I could have excluded certains tags and have focused the index on the htmlbody or even more specific on certain sections within the document. To inspect the index as it has been built for us:

SELECT count(token_text)
,      sum(token_count)
FROM   DR$live_posts_index$I
/
COUNT(TOKEN_TEXT) SUM(TOKEN_COUNT)
----------------- ----------------
            11369           143945

Building the text index – which involved retrieving over 350 documents with a total size of several dozen MBs – took less than 5 minutes. I am somehwat curious as to what the top ten tokens in our weblog are:

select token
,      token_count
from   ( SELECT token_text token
         ,      token_count
         ,      rank() over (order by token_count desc)  rnk
         FROM   DR$live_posts_index$I
       ) ordered_tokens
where  rnk < 11
order
by     rnk desc
/

The results are somewhat disappointing: I now see why I should have indicated to Oracle Text not to index the HTML tags and also to focus on a certain section of the document. The highest ranking tokens are either HTML-tags or elements that appear on every article in the weblog, such as the links to other weblogs and the recent comments. However, I suspect that this will not really hamper my searches so I leave the index as it is. We are now ready to search the weblog, just from the comfort of our own database.

Querying the AMIS Technology Weblog using the Oracle Text Index

See the online Oracle Text documentation for details on the Oracle Text CONTAINS operator grammar. We can learn from there the use of several operators such as AND, OR, NOT, ASSUM, EQUIV, NEAR, $ (stem), ? (fuzzy) and ! (soundex). Let's try to find all AMIS Weblog articles on 'J2EE' :

select title
,      score(1) score
from   live_posts
where  contains(url, 'j2ee', 1) > 20
order
by     score desc
/

Even with a relatively high threshold (score must be over 20), we find 24 articles with this query. Let us narrow down the search. We are interested in J2EE and in particular in OO/R Mapping frameworks:

select title
,      score(1) score
from   live_posts
where  contains(url, 'j2ee and $mapping', 1) > 20
order
by     score desc
/
TITLE                                                                                           SCORE
------------------------------------------------------------------------------------------ ---------
Expert Session on Persistency Framework: Toplink                                                   36
A single POJO persistence model to replace JDO and take over much of EJB??                         36

Now for a slightly different approach: it should be about J2EE alright but please let’s not talk about Struts this time:

select title
,      score(1) score
from   live_posts
where  contains(url, 'j2ee not struts', 1) > 2
order
by     score desc
/

Unfortunately, no rows are returned. What we can do now, is lower the score for any article that discusses Struts in too much earnest:

select substr(title,1,90) title
,      score(1) score
from   live_posts
where  contains(url, 'j2ee minus struts', 1) > 20
order
by     score desc
/
TITLE                                                                                           SCORE
------------------------------------------------------------------------------------------ ---------
BPEL-Soft - Oracle’s previous acquisition showcased during AMIS Query                        39
On-line Java/J2EE courses!                                                                         33
A single POJO persistence model to replace JDO and take over much of EJB??                         33
Interesting articles on various subjects                                                           33
Expert Session on Persistency Framework: Toplink                                                   33
Combine Oracle Toplink and the Spring Framework - Rod Johnson & Jim Clark                          30
Great document on Java/J2EE Guidelines                                                             27
AppFuse - To get a J2EE project going…                                                       24
Free J2EE and XML Development book                                                                 24
Eclipse webtools                                                                                   21
Core J2ee patterns, session facade: always useful                                                  21
Summary “Expert one-on-one J2EE design and development”                                21

Now we do have some results that seem meaningful.

On to a more complex example: articles that are about something we heard mentioning, but do not know for sure how to spell: “ENT” and it should have something to do with Java and Building stuff:

select title
,      score(1) score
from   live_posts
where  contains(url, 'near((!ent,$build),6) and java', 1) > 20
order
by     score desc
/
TITLE                                                                                           SCORE
------------------------------------------------------------------------------------------ ---------
(Pre)-compiling JSP for Tomcat with Jasper                                                         27
AMIS Library System - JHeadstart 9.0.5.x (BC4J, JSP+JSTL) Demo Application released                24

Note than in the above query we have used ! for anything that sounds like ENT (we expect it to return articles about ANT) and $ for all tokens with the same stem as build, such as building, where these two tokens are within a range of six tokens of each other.

The NEAR operator can be very useful for finding contents where you know to tokens should be there but not necessarily next to each other in a certain static sequence. Looking with just the AND operator would potentially return far too many results. To find articles that deal with something that sounds like BEPEL (presumably I am interested in BPEL) I could try the query

select title
,      score(1) score
from   live_posts
where  contains(url, '!bepel', 1) > 10
order
by     score desc
/

. However, it returns 40 records, which I find way too many. When I narrow down the search, as follows:

select title
,      score(1) score
from   live_posts
where  contains(url, 'near((!bepel, process),10) and "web service" ', 1) > 10
order
by     score desc
/
TITLE                                                                                           SCORE
------------------------------------------------------------------------------------------ ---------
Consuming Web Services from PL/SQL - Part I: Using Java Stored Procedures                          43
Consuming Web Services from PL/SQL - Part II: A pure PL/SQL solution using UTL_HTTP (Oracl         41
BPEL is great – and so is Oracle BPEL                                                              23
Publishing PL/SQL Based WebServices                                                                12
Friday on Oracle Open world                                                                        12

I am down to five, which is easier to deal with.

Next I want to find articles about Oracle Designer, either the 9i or 10g release – which are equivalent. I tell Oracle Text exactly that: 9i and 10g are equivalent:

select title
,      score(1) score
from   live_posts
where  contains(url, 'near((oracle, designer),3) and 10g=9i minus jdeveloper', 1) > 10
order
by     score desc
/
TITLE                                                                                           SCORE
------------------------------------------------------------------------------------------ ---------
Oracle Designer/Oracle SCM Meta Model                                                              45
9.0.4.5 Maintenance Release Oracle Designer available                                              27
Enabling Role-based security management in Oracle Designer 6i/9i/10g through the Repositor         27
The future of CDM RuleFrame - the Oracle framework for server-side business rules                  15

Finally to find articles on the declarative, server side implementation of the business rule (or constraint) that a master must have details (mandatory master detail constraint), we use the following query:

select title
,      score(1) score
from   live_posts
where  contains(url, 'declarative,integrity and near((mandatory,master,detail),8) and "business rule"=constraint', 1) > 10
order
by     score desc
/
TITLE                                                                                           SCORE
------------------------------------------------------------------------------------------ ---------
Declarative Implementation of Mandatory Master-Detail relationship (Order must have Order          63
The future of CDM RuleFrame - the Oracle framework for server-side business rules                  61

Resources

Oracle 10g Ultra Search

We have been building a very simple SQL and Oracle Text based search application. That has already been done for us in Oracle 10g Ultra Search, see Ultra Search on OTN:

Oracle Ultra Search allows you to find what you want — on corporate Web servers, databases, mail servers, fileservers, Oracle10g Portal instances, Oracle Files and Mail. Ultra Search is based on Oracle10g Text technology and is an out-of-the box solution that requires no SQL coding. Ultra Search is included with the Oracle database, Application Server and Collaboration Suite and is available free of charge for license holders of these products. It uses a crawler to index documents; the documents stay in their own repositories, and the crawled information is used to build an index that stays within your firewall in an Oracle10g database.

More details on Ultra Search in the white paper Oracle Ultra Search Architecture Version 10.1 for Oracle10g Database Release 1 Version 9.0.3 for Oracle Collaboration Suite Release 2 and Oracle 10g Application Server Version 10g January 1994 (where 1994 presumably should be 2004, or someone with incredible vision has written this white paper!).

2 Comments

  1. Pingback: AdSense Money Maker April 20, 2007
  2. jametong March 14, 2005
  3. Aino Andriessen March 13, 2005