Getting started with Lucene 2.0 – A powerful java search engine

4

Lucene is a text search engine written in Java. It’s very
easy to use (for both developers and users) and fast. The creator of Lucene
(Doug Cutting) started with Lucene in 1997 and still Lucene is a big player
with Java searching. This also means that it is highly unlikely to find a bug
and that it’s almost impossible to increase performance (Lucene is blazingly
fast)

This article handles the basics of Lucene and is not only
intended for developers, the first part of this article describes what Lucene
is, how to use it as an end-user and what it can do for you. The first part is
about the possibilities that Lucene will give you, how you create queries
(still understandable for everybody, queries aren’t scary, it’s just another
word for question ;-) ). After that I will start coding.....

 A lot of developers tend to do everything themselves, but
with libraries like Lucene it’s highly unlikely that you will create something
that’s faster, easier to use and more reliable than Lucene.

When you want to search things you probably have a large
dataset (this can be a database, a bunch of files (word, excel, pdf, txt, csv
and all other files that can read by Java). Lucene creates an index from these
files. An index is some kind of trick to search trough your data more quickly.
You can compare an index with the way you organize business cards. Those cards
are ordered by name or company name so you can find the cards quicker than
scanning the cards one by one. Lucene is smart enough to remember all kinds of
ordering and much more efficient that sorting on a name only.

The index Lucene creates is stored in a couple of files
inside a directory on your hard disk. This system makes it possible to transfer
the index between systems (you can even transfer an index made in Windows to a
Linux workstation). The index of Lucene is so smart that you don’t even need
your data anymore; this is great when you access your data via a web service or
get it from a slow medium like a floppy disk.

The Lucene Index

So how is the data stored? Lucene uses a Lucene Document
(when I use the word document with a capital D I mean a Lucene Document) to
store the data. This Document consists of fields. A Field is a key+value pair
of some data. Field name “title” and field data “Pulp Fiction” makes  a field for example. The next topic I will
handle is queries. I need some example data to make things more clear. The
dataset I will use are movies.

A movie is a Document. The fields of the Document are:

  • title:
    Title of movie (“Pulp Fiction”, “The Big Lebowski” etc.)
  • date:
    Release date in the format YYYYMMDD (19991231)
  • director:
    Director of movie (“Quentin Tarantino”, “Tony Scott” etc.)
  • genre:
    Genre of movie, there can be more than one genre

Lucene has the notion of dates and numbers, but for now you
should remember that everything is a string. The date if formatted in YYYYMMDD
notation for sorting and searching on date ranges.

It is allowed to have multiple fields with the same name in
a Document, there is no structure at all. If you want to have a few Documents
with fieldname foobar and no other fields that’s perfectly legal (but a bit
stupid of course).

Queries

Let’s start querying our data. The queries of Lucene are
very much like the advanced google syntax.

In principle you should provide the field name of the field
you want to search on. But there always is a default search field (which is
provided programmatically). In our case the default field is title because you
probably want to search on a title of a movie.

Suppose you want to search for Pulp Fiction. You just enter
pulp fiction in your search box. Queries are case insensitive. All the results
for movies that contain pulp OR fiction are returned. That’s right, you’d
expect pulp AND fiction, but it is OR. Pulp Fiction is probably the first
result because the results are ordered by relevance. But this is the same way
google and most search engines work
. (update: this is not true, Google uses AND search, thanks Alper)

If you want to search for pulp AND fiction you just enter
pulp AND fiction in your search box. In google you can also enter +pulp
+fiction, this is also allowed in Lucene. If you want and exact match (so only
Pulp Fiction and not Pulp Foo Fiction is found) you surround the two words with
double quotes: “Pulp Fiction”

Instead of + you can also use – if you don’t want a word to
appear in the result list. Suppose you search +pulp –fiction all movies with
pulp in the title will be return and everything with fiction will be omitted.

Sometimes you don’t know the exact word you’re searching
for. Was it Pulp Fiction or Polp Fiction? You can replace that character with a
question mark. The asterisk sign (*) is used for zero or more characters you’re
not sure about. Note that ? and * cannot be the first character of the search
term. This is due to the internal working of Lucene and also explains the
performance. There is still one thing of the basic syntax that needs to be
explained. Instead of the minus sign you can use the word NOT. Pulp AND NOT
fiction is an example. Also note that AND, OR and NOT must be written with
capitals, otherwise you’re just searching for the word and, or & not.

More advanced
queries

We don’t want to search on titles only of course. Date,
director and genre are also important. When you want to search on a field other
than the default field just put the field name followed by a colon before the
term you want to search for.

Some examples:

+genre:action +director:scott

Search for action movies from directors with Scott in their
name.

+title:pulp –director:tarantino

Search for movies with pulp in the title that are not
directed by a certain Tarantino

I only scratched the surface of querying in Lucene, but I
think it’s enough for now and you’ll find most of the stuff you’re looking for.
In a follow-up of this article I will take a look at the very advanced querying.
Thinks like proximity, range, grouping, boosting, fuzzy and stemming are all
possible.

Creating the index

Now it’s finally time to get coding. If you don’t know what
an Object with capital O it’s time to stop reading and wait for the next blog
article about Lucene with advanced querying and some other nice things. But now
you know what’s possible with Lucene.

 

First download Lucene 2.0 here and make sure the .jar file is on you classpath.

 

<code><br /><font color="#000000">Analyzer&nbsp;an=</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">StandardAnalyzer</font><font color="#000000">()</font><font color="#000000">;</font><br /><br /><font color="#000000">IndexModifier&nbsp;im=</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">IndexModifier</font><font color="#000000">(</font><font color="#2a00ff">&quot;file:///c:/temp/index&quot;</font><font color="#000000">,an,true</font><font color="#000000">)</font><font color="#000000">;</font><br /><br /><font color="#ffffff">&nbsp;</font><br /><br /><font color="#000000">Document&nbsp;d=</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">Document</font><font color="#000000">()</font><font color="#000000">;</font><br /><br /><font color="#000000">d.add</font><font color="#000000">(</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">Field</font><font color="#000000">(</font><font color="#2a00ff">&quot;title&quot;</font><font color="#000000">,</font><font color="#2a00ff">&quot;Pulp&nbsp;Fiction&quot;</font><font color="#000000">,&nbsp;Field.Store.YES,&nbsp;Field.Index.TOKENIZED</font><font color="#000000">))</font><font color="#000000">;</font><br /><br /><font color="#000000">d.add</font><font color="#000000">(</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">Field</font><font color="#000000">(</font><font color="#2a00ff">&quot;date&quot;</font><font color="#000000">,</font><font color="#2a00ff">&quot;19940923&quot;</font><font color="#000000">,&nbsp;Field.Store.YES,&nbsp;Field.Index.TOKENIZED</font><font color="#000000">))</font><font color="#000000">;</font><br /><br /><font color="#000000">d.add</font><font color="#000000">(</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">Field</font><font color="#000000">(</font><font color="#2a00ff">&quot;director&quot;</font><font color="#000000">,</font><font color="#2a00ff">&quot;Quentin&nbsp;Tarantino&quot;</font><font color="#000000">,&nbsp;Field.Store.YES,&nbsp;Field.Index.TOKENIZED</font><font color="#000000">))</font><font color="#000000">;</font><br /><br /><font color="#000000">d.add</font><font color="#000000">(</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">Field</font><font color="#000000">(</font><font color="#2a00ff">&quot;genre&quot;</font><font color="#000000">,</font><font color="#2a00ff">&quot;Action&quot;</font><font color="#000000">,&nbsp;Field.Store.YES,&nbsp;Field.Index.TOKENIZED</font><font color="#000000">))</font><font color="#000000">;</font><br /><br /><font color="#000000">d.add</font><font color="#000000">(</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">Field</font><font color="#000000">(</font><font color="#2a00ff">&quot;genre&quot;</font><font color="#000000">,</font><font color="#2a00ff">&quot;Crime&quot;</font><font color="#000000">,&nbsp;Field.Store.YES,&nbsp;Field.Index.TOKENIZED</font><font color="#000000">))</font><font color="#000000">;</font><br /><br /><font color="#ffffff">&nbsp;</font><br /><br /><font color="#000000">im.close</font><font color="#000000">()</font><font color="#000000">;</font></code>

 

Let’s walk through the code line by line. The first line is
creating an analyzer. An analyzer throws away needless information. Needless
information for a standard analyzer is the difference between upper and lower
case, stop words (words that are uses so often it is better that you don’t
allow to search on them. Punctuation is also removed. Your original field
values also can be stored if you want to (when you use field for searching and
displaying). There are numerous analyzers, I usually use the StandardAnalyzer. Consult
the javadoc on the Lucene website for all the other analyzers.

In line 2 we create an IndexModifier. The IndexModifier is a
wrapper object for IndexWriter and IndexReader (for the people that worked with
Lucene 1.4) It makes sure there is only one instance modifying the index
(imagine what will happen when 2 people write to the index at the same time).
The first argument is the directory name of the index. We use a file based
index, but it’s also possible to create an in-memory index. The second argument
is the analyzer we just created and the third parameter is whether we want to
create a new index (true is create a new index, false is do an incremental
index). Incremental indexing can be very useful, but you need to be careful.
Since Lucene has no notion of primary keys or unique keys like we have in a
database you have to remove your old document if you want to update a document.
If you add a document twice it will be found twice.

The next line creates the actual Lucene Document. Nothing
special, just an object we’ll use later. Now let’s add Fields to the document.
The argument list of Field:

  1. field
    name (String)
  2. field
    value
  3. Store
    field?
  4. How to
    store the field.

The first two arguments are pretty clear. The third and
fourth are a bit similar. The third argument is of the type Field.Store. I
always want to store my field, that’s why I put it in my Document you’re
probably thinking right now. That’s true. But the index is always stored.
Field.Store means whether you want to store the orginal field value (the value
that went through the analyzer is already stored). The choices are
Field.Store.YES, Field.Store.NO and Field.Store.Compress. The first two options
are quite obvious. The third option can be used when you want to compress the
contents of the field (when you used binary or base64 data for example).

Don’t forget to close the IndexModifier at the end. When you
did this for all your documents you’ll see some files in c:\temp\index\ What
those files do isn’t that important yet. A nice thing is you can move those
files to another file system and operating system and they will still work.

Searching your index
with Luke

Now it is time to search your index. The normal way would be
coding it in java. But fortunately there is a nice tool for this. The coding
will come later.  The magic tool is
called Luke and can be downloaded from http://www.getopt.org/luke/

Start Luke with java –jar lukeall.jar

Now click on File, Open Lucene Index and go to c:\temp\ in
the file dialog and click once at index and then open. The index is the whole
directory, not a single file.

 

When you click on the Documents tab you can browse through
the documents in your index. In the search tab you can enter queries.  Pick the StandardAnalyzer from the right
dropdown box and select the preferred default field (usually title)

Since my movie database was a little bit imaginary I have
indexed a part the HR schema from Oracle for the following screenshot. I
indexed information about department, hire date, location, managers (and the
managers of the managers) and the name of the employee. The location field also
has a locationt field, why I did that is explained in the follow-up of this
article.

Let’s select all the employees that work in Seattle. That query is location:seattle

 

That’s pretty cool huh? Play around with Luke a little bit
and then I’ll explain how you can make the code to search the index yourself.

Searching your index
in java

<code>
<font color="#000000"></font></code><code><br /><font color="#000000">Searcher&nbsp;searcher&nbsp;=&nbsp;</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">IndexSearcher</font><font color="#000000">(</font><font color="#2a00ff">&quot;file:///c:/temp/index&quot;</font><font color="#000000">)</font><font color="#000000">;</font><br /><br /><font color="#000000">Analyzer&nbsp;an=</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">StandardAnalyzer</font><font color="#000000">()</font><font color="#000000">;</font><br /><br /><font color="#000000">QueryParser&nbsp;queryParser=</font><font color="#7f0055"><strong>new&nbsp;</strong></font><font color="#000000">QueryParser</font><font color="#000000">(</font><font color="#000000">“title”,an</font><font color="#000000">)</font><font color="#000000">;</font><br /><br /><font color="#000000">Query&nbsp;query=queryParser.parse</font><font color="#000000">(</font><font color="#000000">“pulp”</font><font color="#000000">)</font><font color="#000000">;</font><br /><br /><font color="#000000">Hits&nbsp;hits=searcher.search</font><font color="#000000">(</font><font color="#000000">query</font><font color="#000000">)</font><font color="#000000">;</font><br /><br /><font color="#000000">log.debug</font><font color="#000000">(</font><font color="#000000">hits.doc</font><font color="#000000">(</font><font color="#990000">0</font><font color="#000000">)</font><font color="#000000">.get</font><font color="#000000">(</font><font color="#000000">“date”</font><font color="#000000">))</font><font color="#000000">;</font><br /><br /><font color="#000000">searcher.close</font><font color="#000000">()</font><font color="#000000">;</font></code>

The first line of code is creating the searcher. Quite straightforward,
just open the directory where the index is stored (Note that Luke locks the
index, even in read-only mode, so close Luke before you start coding).

The next line is the analyzer, it does the same with your
query as it did with the field while indexing, so it’s a wise thing to use the
same analyzer.

Now create a QueryParser object. The first argument is the
default search field (the same as in Luke) and the second is the analyzer you
just created.

The queryParser object must be put in a Query object with
the real query. Our query today is pulp. That will return all movies with pulp
in the title.

Now invoke the search method on the searcher object. This
will return a Hits object. This is comparable with a List object, consult the
Javadoc of Lucene for more details about it. With the doc(0) method you will
get the first result in the result list.

Now don’t forget to close the searcher object and you’re
done.

Final words

I tried to give a quick introduction to Lucene, but it seems
that it became a bit elaborate. I’ll write a follow up later with more advanced
stuff in it. I will also explain the differences between Pre 2.0 versions of
Lucene. What you need to remember is that the differenced between 2.0 and 1.9
and earlier are very small. 2.0 just omitted the deprecated functions. To
remove all the deprecated code from your files is quite easy, so reading
tutorials about Lucene 1.4.3 isn’t bad at all, you just have to remember that
some function have different arguments and are called different in 2.0. But the
Javadoc is pretty good, I learned Lucene with some help of a colleague and
reading a lot in the javadocs. When you’re starting with the advanced stuff a
book is nice, but not necessary.

Feel free to ask any questions (this can be done in the
comments without registering, I will receive an e-mail when someone commented
and everybody else can read the question too). 

Other blogs about
Lucene

Using Lucene with Spring – Introduction to Spring Modules

Share.

About Author

4 Comments

  1. In the code section where you are creating the index, you left out the step where you add the document to the IndexModifier. Should be: im.addDoc(d);

    Great intro to Lucene 2.2.0. Thanks!

  2. Jeroen van Wilgenburg on

    Thanks for pointing the AND thing out, I have changed it. I once made a very stupid typo in a query in google and it still returned results, so I assumed it was OR search. But apperently the typo wasn’t that stupid.

    What you can do for implicit AND-search is put a + before every search term if there isn’t a – or a + yet. Lucene rewrites the query internally to the +/- format (so pulp AND fiction becomes +pulp +fiction internally), the query object you get after invoking queryParser.parse() displays the rewritten query (it’s just the toString method). You can see the rewrtitten query in Luke just under the search box. And then you have to built in a check to see whether a user used OR in his original query. I think this is all that needs to be done, I might have overlooked something.

    I have to check if it can be done via some setting, but as far as I can remember it isn’t possible.

  3. From experience I know search based on OR is completely useless.

    Google does not OR their searches -as you say- AND is implicit. From the Google website: “The “AND” operator is unnecessary — we include all search terms by default.”

    Is it possible to make Lucene see AND as implicit? Do you have to rewrite the query parser or is there a setting you can set?