I found a very interesting article on how to get started – and what to expect – from Lucene, the Jakarta Apache text search engine library written entirely in Java. The article: “I Love Lucene” was written by Dion Almaer. It discusses how TSS used Lucene to build their search facility. As such it is a clear introduction to what Lucene can do and how you can get going. Their conclusion:
Having said that, we don’t see any reason to move away from Lucene. It has been a pleasure to work with, and is one of the best pieces of open source software that I have personally ever worked with.
TheServerSide search used be a weak link on the site. Now it is a powerhouse. I am constantly using it as Editor, and now manage to find exactly what I want.
Indexing our data is so fast, that we don’t even need to run the incremental build plan that we developed. At one point we mistakenly had an IndexWriter.optimize() call every time we added a document. When we relaxed that to run less frequently we brought down the index time to a matter of seconds. It used to take a LOT longer, even as long as 45 minutes.
So to recap: We have gained relevance, speed, and power with this approach. We can tweak the way we index and search our content with little effort.
Thanks SO much to the entire Lucene team.
Note that Lucene can be used in any Java application that wants to search through text that can be presented through interfaces that Lucene understands, not just web-applications. Jakarta Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Jakarta Lucene is an open source project available for free download from Apache Jakarta.
I hope we will have an opportunity to try out Lucene one of these days; it seems fun to work with. At the same time: most of our data is in the database and whenever Oracle is on site, Oracle Text gives probably even better functionality.
Resources
See here for the Lucene Home page on Jakarta Apache
A tutorial on Lucene: Lucene Tutorial By Steven J. Owens
Other Lucene articles – Lucene Resources on Jakarta
Based on my experience with Lucene, I work as an architect for a large publishing company so we have a lot of content. I can say that the performance of using Lucene as a searchengine is more than well for over 100 concurrent users using an index based on a several thousands of documents. Also benchmarks that are shown over here :
http://jakarta.apache.org/lucene/docs/benchmarks.html give good and even better results.
Currently we don’t put a lof of effort in optimizing our indexes simply because I think “early optimization is the root of much evil.” (well actually Knuth said that and I’m just repeating him ) We just make out indexes domain specific and we try to optimize the indexes when it is useful, for example every 10 minutes when a batch job of documents is added to the index.
Furthermore it could be nice to know that there exists an OS lucene index search engine http://www.getopt.org/luke/ which we use to do a quick search on some index, from a developers point of view. The application is actually a thinlet which is a technology I like to support, but that is another topic.
Also what I would like to add to your post is that one core part of Lucene is about breaking up a text into useful words, which is the actual analyzing of a text. And as you can imagine there is some relationship between analyzing the text and finding results. Lucene provides some basic analyzers that are useful and can be language specific, but they are very basic.
Currenlty I’m trying to find out which analyzers gives me enough power so that I can find words based on an onthology.
I think it should be possible in some way.