This blog is one of a series of 6 blogs around the combination of Elasticsearch (‘the ELK stack’) and Oracle Adaptive Case Management.
The series cover:
1. Elasticsearch and Oracle Middleware – is there an opportunity?
2. Installation of Elasticsearch: installation and the indexing of – human generated – documents
3. Elasticsearch and Oracle ACM data: example with ACM data
4. Kibana for ACM dashboards: an example of dashboards with ACM data
5. Logstash and Fusion Middleware: how to get log file data into Elasticsearch
6. Beats and Fusion Middleware: a more advanced way to handle log files
Now, we can get going with Elasticsearch. As Elastic.co quotes on page: https://www.elastic.co/start:
Grab Your Towel.
Adventures Await.
In this blog we will show:
- outline of Elasticsearch concepts
- installation Elasticsearch
- basic operations
- how to index documents in a.o. PDF and Word format
Outline of Elasticsearch concepts
Elasticsearch stores data. That data is indexed inside Elasticsearch and can then be queried/searched in ‘google-style’. The interfaces for Elasticsearch are RESTful APIs and JSON-based. Of course, clients for various programming languages are available, but we will just use ‘curl’.
Elasticsearch concepts that are important to know:
- index: a collection of documents that have somewhat similar characteristics
- shard: an index can be stored in several parts named ‘shard’
- replica: a copy of a shard
Replica’s and shards help in achieving high availability and good performance/throughput.
Installation Elasticsearch
My set-up is a laptop that runs BPM Suite in a VirtualBox machine:
- BPM Suite 12.2.1
- Oracle XE database 11.2.0
- Oracle Enterprise Linux 7.2 64bit
- BPM deployed to the Admin server
Examples will be done with user ‘development’ with home directory ‘/home/development’.
We will work with version 5.0.0 that was released on October 26th 2016:
- download zip file (there are other installation options): https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.0.0.zip
Installation:
[developer@localhost elastic]$ cd [developer@localhost ~]$ mkdir elastic [developer@localhost ~]$ cd elastic [developer@localhost elastic]$ mv ~/Downloads/elasticsearch-5.0.0.zip . [developer@localhost elastic]$ unzip elasticsearch-5.0.0.zip Archive: elasticsearch-5.0.0.zip creating: elasticsearch-5.0.0/ creating: elasticsearch-5.0.0/lib/ inflating: elasticsearch-5.0.0/lib/elasticsearch-5.0.0.jar inflating: elasticsearch-5.0.0/lib/lucene-core-6.2.0.jar inflating: elasticsearch-5.0.0/lib/lucene-analyzers-common-6.2.0.jar ...
Basic operations
Starting up Elasticsearch:
[developer@localhost ~]$ cd [developer@localhost ~]$ cd elastic/elasticsearch-5.0.0/bin [developer@localhost bin]$ ./elasticsearch [2016-11-05T10:47:53,156][INFO ][o.e.n.Node ] [] initializing ... [2016-11-05T10:47:53,288][INFO ][o.e.e.NodeEnvironment ] [XBCLgTj] using [1] data paths, mounts [[/home (/dev/mapper/ol-home)]], net usable_space [4.6gb], net total_space [23.4gb], spins? [possibly], types [xfs] [2016-11-05T10:47:53,289][INFO ][o.e.e.NodeEnvironment ] [XBCLgTj] heap size [1.9gb], compressed ordinary object pointers [true] [2016-11-05T10:47:53,291][INFO ][o.e.n.Node ] [XBCLgTj] node name [XBCLgTj] derived from node ID; set [node.name] to override [2016-11-05T10:47:53,293][INFO ][o.e.n.Node ] [XBCLgTj] version[5.0.0], pid[26764], build[253032b/2016-10-26T04:37:51.531Z], OS[Linux/3.8.13-118.2.2.el7uek.x86_64/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_66/25.66-b17] [2016-11-05T10:47:54,411][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [aggs-matrix-stats] [2016-11-05T10:47:54,411][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [ingest-common] [2016-11-05T10:47:54,411][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [lang-expression] [2016-11-05T10:47:54,411][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [lang-groovy] [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [lang-mustache] [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [lang-painless] [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [percolator] [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [reindex] [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [transport-netty3] [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [transport-netty4] [2016-11-05T10:47:54,413][INFO ][o.e.p.PluginsService ] [XBCLgTj] no plugins loaded [2016-11-05T10:47:57,921][INFO ][o.e.n.Node ] [XBCLgTj] initialized [2016-11-05T10:47:57,921][INFO ][o.e.n.Node ] [XBCLgTj] starting ... [2016-11-05T10:47:58,233][INFO ][o.e.t.TransportService ] [XBCLgTj] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300} [2016-11-05T10:47:58,246][WARN ][o.e.b.BootstrapCheck ] [XBCLgTj] max file descriptors [4096] for elasticsearch process likely too low, increase to at least [65536] [2016-11-05T10:47:58,246][WARN ][o.e.b.BootstrapCheck ] [XBCLgTj] max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144] [2016-11-05T10:48:01,394][INFO ][o.e.c.s.ClusterService ] [XBCLgTj] new_master {XBCLgTj}{XBCLgTjrSVmrMh69SynfFQ}{5dLnGLzpTXCBGIzvyqIb8g}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined) [2016-11-05T10:48:01,427][INFO ][o.e.h.HttpServer ] [XBCLgTj] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200} [2016-11-05T10:48:01,427][INFO ][o.e.n.Node ] [XBCLgTj] started [2016-11-05T10:48:01,440][INFO ][o.e.g.GatewayService ] [XBCLgTj] recovered [0] indices into cluster_state
The start-up log shows some warnings around performance problems, but for this exercise we don’t do anything with them. Furthermore, the recommended java version is 1.8.0_73 or later. The version we used is 1.8.0_66 – the one that is used by WebLogic. Fingers crossed :-S
First test: use curl to put a simple JSON document into the index named ‘acm-twitter’:
[developer@localhost ~]$ curl -XPUT 'http://localhost:9200/acm-twitter/user/LucGorissen?pretty' -d '{ "name" : "Luc Gorissen" }' { "_index" : "acm-twitter", "_type" : "user", "_id" : "LucGorissen", "_version" : 1, "result" : "created", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true } [developer@localhost ~]$
Looking up the document:
'[developer@localhost ~]$ curl 'localhost:9200/acm-twitter/_search?q=Luc&pretty' { "took" : 67, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.25811607, "hits" : [ { "_index" : "acm-twitter", "_type" : "user", "_id" : "LucGorissen", "_score" : 0.25811607, "_source" : { "name" : "Luc Gorissen" } } ] } }
Note the _score which is an indication of how well the document matches the search query.
So far, all is quite simple. This would be the time to fiddle a little bit with Elasticsearch.
ElasticSearch command reference
Cluster Health
curl 'localhost:9200/_cat/health?v'
List all indices
curl 'localhost:9200/_cat/indices?v'
Create index named test
curl -XPUT 'localhost:9200/test?pretty'
Delete index named test
curl -XDELETE 'localhost:9200/test?pretty'
Put in a document (explicit id, here: 1)
curl -XPUT 'localhost:9200/test/external/1?pretty' -d ' { "caseMilestone": { "caseId": "103242", "state": "ATTAINED", "name": "TweetScreenedMilestone", "updatedDate": "2016-05-25T10:27:34.111+02:00" } } '
Put in a document, NO explicit id
curl -XPUT 'localhost:9200/test/external?pretty' -d ' { "caseMilestone": { "caseId": "103242", "state": "ATTAINED", "name": "TweetScreenedMilestone", "updatedDate": "2016-05-25T10:27:34.111+02:00" } } '
Get a document
curl -XGET 'localhost:9200/test/external/1?pretty'
Delete a document
curl -XDELETE 'localhost:9200/test/external/1?pretty'
Search all documents in test index
curl 'localhost:9200/test/_search?q=*&pretty'
or
curl -XPOST 'localhost:9200/test/_search?pretty' -d ' { "query": { "match_all": {} } } '
Search all documents and return specific fields
curl -XPOST 'localhost:9200/test/_search?pretty' -d ' { "query": { "match_all": {} }, "_source": ["caseMilestone.caseId", "caseMilestone.name"] }} '
Search all documents for a specific term in a field, and return specific fields
curl -XPOST 'localhost:9200/test/_search?pretty' -d ' { "query": { "match_all": {} }, "_source": ["caseMilestone.caseId", "caseMilestone.name"] }} '
After fiddling around, please clean up your installation, i.e. delete the indexes.
Index documents in a.o. PDF and Word format
Next challenge is how to put PDF and Word documents into Elasticsearch and have them indexed, so it is possible to search through them.
In Elasticsearch, this is handled with the ‘Ingest Attachment Processor Plugin’, that is based on the Apache text extraction library Tika.
First, we need to install the Ingest Attachment Processor Plugin (https://www.elastic.co/guide/en/elasticsearch/plugins/5.0/ingest-attachment.html):
[developer@localhost elasticsearch-5.0.0]$ pwd /home/developer/elastic/elasticsearch-5.0.0 [developer@localhost elasticsearch-5.0.0]$ sudo bin/elasticsearch-plugin install ingest-attachment [sudo] password for developer: -> Downloading ingest-attachment from elastic [=================================================] 100% @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin requires additional permissions @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ * java.lang.RuntimePermission getClassLoader * java.lang.reflect.ReflectPermission suppressAccessChecks * java.security.SecurityPermission createAccessControlContext * java.security.SecurityPermission insertProvider * java.security.SecurityPermission putProviderProperty.BC See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html for descriptions of what these permissions allow and the associated risks. Continue with installation? [y/N]y -> Installed ingest-attachment [developer@localhost elasticsearch-5.0.0]$
I ignored the warnings in the installation…
Now, define a pipline:
- named ‘amis-attachment’
- with “description” set to “Extract attachment information”
- that contains the processor “attachment” (= ingest-attachment processor)
- that will use the field “data” to get the base64 encoded field from
[developer@localhost testdata]$ curl -XPUT 'localhost:9200/_ingest/pipeline/amis-attachment?pretty' -d' > { > "description" : "Handle PDF attachment information", > "processors" : [ > { > "attachment" : { > "field" : "data" > } > } > ] > }' { "acknowledged" : true } [developer@localhost testdata]$
Now, the pipeline is ready.
Insert a document into index ‘documents’, of type ‘expenses’, and id ‘1’:
[developer@localhost testdata]$ curl -XPUT 'localhost:9200/documents/expenses/1?pipeline=amis-attachment&pretty' -d' { "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=" }' { "_index" : "documents", "_type" : "expenses", "_id" : "1", "_version" : 1, "result" : "created",data "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true } [developer@localhost testdata]$
And retrieving that document:
[developer@localhost testdata]$ curl -XGET 'localhost:9200/documents/expenses/1?pretty' { "_index" : "documents", "_type" : "expenses", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "data" : "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=", "attachment" : { "content_type" : "application/rtf", "language" : "ro", "content" : "Lorem ipsum dolor sit amet", "content_length" : 28 } } }
For inserting a larger document, you can use the script ‘insertDoc’ below:
[developer@localhost testdata]$ more insertDoc #!/bin/bash coded=`base64 $1` json="{\"data\":\"${coded}\"}" echo "$json" > json.file curl -X POST "localhost:9200/documents/expenses/?pipeline=amis-attachment&pretty" -d @json.file rm json.file [developer@localhost testdata]$
That makes inserting a document as simple as:
[developer@localhost testdata]$ ./insertDoc WhoWeAre.docx { "_index" : "documents", "_type" : "expenses", "_id" : "AViurBsirQVDoHLCTe-O", "_version" : 1, "result" : "created", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true } [developer@localhost testdata]$
And now, search for
- the text ‘WordPress’ in the index ‘Documents’, with document type ‘expenses’, and
- return the ‘attachment.content’ field
[developer@localhost testdata]$ curl -XPOST 'localhost:9200/documents/expenses/_search?pretty' -d ' { "query": { "match": { "attachment.content": "WordPress"}}, "_source": ["attachment.content"] } ' { "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.2824934, "hits" : [ { "_index" : "documents", "_type" : "expenses", "_id" : "AViurBsirQVDoHLCTe-O", "_score" : 0.2824934, "_source" : { "attachment" : { "content" : "Who We Are\nElastic believes getting immediate, actionable insight from data matters. As the company behind the open source projects — Elasticsearch, Logstash, Kibana, and Beats — designed to take data from any source and search, analyze, and visualize it in real time, Elastic is helping people make sense of data. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots." } } } ] } }
Or, a more simple formatted search:
[developer@localhost testdata]$ curl -XPOST 'localhost:9200/documents/expenses/_search?q=Wordpress&pretty' { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.26545224, "hits" : [ { "_index" : "documents", "_type" : "expenses", "_id" : "AViurBsirQVDoHLCTe-O", "_score" : 0.26545224, "_source" : { "data" : "UEsDBBQABgAIAAAAIQAJJIeCgQEAAI4FAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIooAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC0lE1Pg0AQhu8m/geyVwPbejDGlPag9ahNrPG8LkPZyH5kZ/v17x1KS6qhpVq9kMAy7/vMCzOD0UqX0QI8KmtS1k96LAIjbabMLGWv08f4lkUYhMlEaQ2kbA3IRsPLi8F07QAjqjaYsiIEd8c5ygK0wMQ6MHSSW69FoFs/407IDzEDft3r3XBpTQAT4lBpsOHgAXIxL0M0XtHjmsRDiSy6r1+svFImnCuVFIFI+cJk31zirUNClZt3sFAOrwiD8VaH6uSwwbbumaLxKoNoInx4Epow+NL6jGdWzjX1kByXaeG0ea4kNPWVmvNWAiJlrsukOdFCmR3/QQ4M6xLw7ylq3RPt31QoxnkOkj52dx4a46rppLbYq+12gxAopFNMvv6CcVfouFXuRFjC+8u/UeyJd4LkNBpT8V7CCYn/MIxGuhMi0LwD31z7Z3NsZI5Z0mRMvHVI+8P/ou3dgqiqYxo5Bz4oaFZE24g1jrR7zu4Pqu2WQdbizTfbdPgJAAD//wMAUEsDBBQABgAIAAAAIQAekRq38wAAAE4CAAALAAgCX3JlbHMvLnJlbHMgogQCKKAAAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ... ...
Architecture considerations
Elasticsearch has a REST API that is rather straightforward to use. It can easily handle JSON documents, and then perform a search on them. It can also handle varous documents format by using the ‘ingest-attachment’ plugin. That plugin is based on Apache Tika (http://tika.apache.org/) and can handle lots of formats: http://tika.apache.org/1.14/formats.html. Most likely, in most organizations this Elasticsearch capability will have to compete with a Document Managements System that also has this type of document search capability.