Installation of Elasticsearch ELA 2 E banner

Installation of Elasticsearch

This blog is one of a series of 6 blogs around the combination of Elasticsearch (‘the ELK stack’) and Oracle Adaptive Case Management.

The series cover:

1. Elasticsearch and Oracle Middleware – is there an opportunity?
2. Installation of Elasticsearch: installation and the indexing of – human generated – documents
3. Elasticsearch and Oracle ACM data: example with ACM data
4. Kibana for ACM dashboards: an example of dashboards with ACM data
5. Logstash and Fusion Middleware: how to get log file data into Elasticsearch
6. Beats and Fusion Middleware: a more advanced way to handle log files

Now, we can get going with Elasticsearch. As Elastic.co quotes on page: https://www.elastic.co/start:

Grab Your Towel.
Adventures Await.

In this blog we will show:

  • outline of Elasticsearch concepts
  • installation Elasticsearch
  • basic operations
  • how to index documents in a.o. PDF and Word format

Outline of Elasticsearch concepts

Elasticsearch stores data. That data is indexed inside Elasticsearch and can then be queried/searched in ‘google-style’. The interfaces for Elasticsearch are RESTful APIs and JSON-based. Of course, clients for various programming languages are available, but we will just use ‘curl’.

Elasticsearch concepts that are important to know:

  • index: a collection of documents that have somewhat similar characteristics
  • shard: an index can be stored in several parts named ‘shard’
  • replica: a copy of a shard

Replica’s and shards help in achieving high availability and good performance/throughput.

Installation Elasticsearch

My set-up is a laptop that runs BPM Suite in a VirtualBox machine:

  • BPM Suite 12.2.1
  • Oracle XE database 11.2.0
  • Oracle Enterprise Linux 7.2 64bit
  • BPM deployed to the Admin server

Examples will be done with user ‘development’ with home directory ‘/home/development’.

We will work with version 5.0.0 that was released on October 26th 2016:

Installation:

[developer@localhost elastic]$ cd
 [developer@localhost ~]$ mkdir elastic
 [developer@localhost ~]$ cd elastic
 [developer@localhost elastic]$ mv ~/Downloads/elasticsearch-5.0.0.zip .
 [developer@localhost elastic]$ unzip elasticsearch-5.0.0.zip
 Archive: elasticsearch-5.0.0.zip
 creating: elasticsearch-5.0.0/
 creating: elasticsearch-5.0.0/lib/
 inflating: elasticsearch-5.0.0/lib/elasticsearch-5.0.0.jar
 inflating: elasticsearch-5.0.0/lib/lucene-core-6.2.0.jar
 inflating: elasticsearch-5.0.0/lib/lucene-analyzers-common-6.2.0.jar
 ...

Basic operations

Starting up Elasticsearch:

[developer@localhost ~]$ cd
 [developer@localhost ~]$ cd elastic/elasticsearch-5.0.0/bin
 [developer@localhost bin]$ ./elasticsearch
 [2016-11-05T10:47:53,156][INFO ][o.e.n.Node ] [] initializing ...
 [2016-11-05T10:47:53,288][INFO ][o.e.e.NodeEnvironment ] [XBCLgTj] using [1] data paths, mounts [[/home (/dev/mapper/ol-home)]], net usable_space [4.6gb], net total_space [23.4gb], spins? [possibly], types [xfs]
 [2016-11-05T10:47:53,289][INFO ][o.e.e.NodeEnvironment ] [XBCLgTj] heap size [1.9gb], compressed ordinary object pointers [true]
 [2016-11-05T10:47:53,291][INFO ][o.e.n.Node ] [XBCLgTj] node name [XBCLgTj] derived from node ID; set [node.name] to override
 [2016-11-05T10:47:53,293][INFO ][o.e.n.Node ] [XBCLgTj] version[5.0.0], pid[26764], build[253032b/2016-10-26T04:37:51.531Z], OS[Linux/3.8.13-118.2.2.el7uek.x86_64/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_66/25.66-b17]
 [2016-11-05T10:47:54,411][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [aggs-matrix-stats]
 [2016-11-05T10:47:54,411][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [ingest-common]
 [2016-11-05T10:47:54,411][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [lang-expression]
 [2016-11-05T10:47:54,411][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [lang-groovy]
 [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [lang-mustache]
 [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [lang-painless]
 [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [percolator]
 [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [reindex]
 [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [transport-netty3]
 [2016-11-05T10:47:54,412][INFO ][o.e.p.PluginsService ] [XBCLgTj] loaded module [transport-netty4]
 [2016-11-05T10:47:54,413][INFO ][o.e.p.PluginsService ] [XBCLgTj] no plugins loaded
 [2016-11-05T10:47:57,921][INFO ][o.e.n.Node ] [XBCLgTj] initialized
 [2016-11-05T10:47:57,921][INFO ][o.e.n.Node ] [XBCLgTj] starting ...
 [2016-11-05T10:47:58,233][INFO ][o.e.t.TransportService ] [XBCLgTj] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
 [2016-11-05T10:47:58,246][WARN ][o.e.b.BootstrapCheck ] [XBCLgTj] max file descriptors [4096] for elasticsearch process likely too low, increase to at least [65536]
 [2016-11-05T10:47:58,246][WARN ][o.e.b.BootstrapCheck ] [XBCLgTj] max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144]
 [2016-11-05T10:48:01,394][INFO ][o.e.c.s.ClusterService ] [XBCLgTj] new_master {XBCLgTj}{XBCLgTjrSVmrMh69SynfFQ}{5dLnGLzpTXCBGIzvyqIb8g}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
 [2016-11-05T10:48:01,427][INFO ][o.e.h.HttpServer ] [XBCLgTj] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
 [2016-11-05T10:48:01,427][INFO ][o.e.n.Node ] [XBCLgTj] started
 [2016-11-05T10:48:01,440][INFO ][o.e.g.GatewayService ] [XBCLgTj] recovered [0] indices into cluster_state

The start-up log shows some warnings around performance problems, but for this exercise we don’t do anything with them. Furthermore, the recommended java version is 1.8.0_73 or later. The version we used is 1.8.0_66 – the one that is used by WebLogic. Fingers crossed :-S
First test: use curl to put a simple JSON document into the index named ‘acm-twitter’:

[developer@localhost ~]$ curl -XPUT 'http://localhost:9200/acm-twitter/user/LucGorissen?pretty' -d '{ "name" : "Luc Gorissen" }'
{
  "_index" : "acm-twitter",
  "_type" : "user",
  "_id" : "LucGorissen",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}
[developer@localhost ~]$

Looking up the document:

'[developer@localhost ~]$ curl 'localhost:9200/acm-twitter/_search?q=Luc&pretty' 
{
  "took" : 67,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.25811607,
    "hits" : [
      {
        "_index" : "acm-twitter",
        "_type" : "user",
        "_id" : "LucGorissen",
        "_score" : 0.25811607,
        "_source" : {
          "name" : "Luc Gorissen"
        }
      }
    ]
  }
}

Note the _score which is an indication of how well the document matches the search query.

So far, all is quite simple. This would be the time to fiddle a little bit with Elasticsearch.

ElasticSearch command reference

Cluster Health
 curl 'localhost:9200/_cat/health?v'
List all indices
 curl 'localhost:9200/_cat/indices?v'
Create index named test
 curl -XPUT 'localhost:9200/test?pretty'
Delete index named test
 curl -XDELETE 'localhost:9200/test?pretty'
Put in a document (explicit id, here: 1)
curl -XPUT 'localhost:9200/test/external/1?pretty' -d '
{
  "caseMilestone": {
    "caseId": "103242",
    "state": "ATTAINED",
    "name": "TweetScreenedMilestone",
    "updatedDate": "2016-05-25T10:27:34.111+02:00"
  }
}
'

Put in a document, NO explicit id

curl -XPUT 'localhost:9200/test/external?pretty' -d '
{
  "caseMilestone": {
    "caseId": "103242",
    "state": "ATTAINED",
    "name": "TweetScreenedMilestone",
    "updatedDate": "2016-05-25T10:27:34.111+02:00"
  }
}
'

Get a document

 curl -XGET 'localhost:9200/test/external/1?pretty'

Delete a document

 curl -XDELETE 'localhost:9200/test/external/1?pretty'

Search all documents in test index

 curl 'localhost:9200/test/_search?q=*&pretty'

or

curl -XPOST 'localhost:9200/test/_search?pretty' -d '
{
  "query": { "match_all": {} }
}
'

Search all documents and return specific fields

 curl -XPOST 'localhost:9200/test/_search?pretty' -d '
 {
 "query": { "match_all": {} },
 "_source": ["caseMilestone.caseId", "caseMilestone.name"]
 }}
 '

Search all documents for a specific term in a field, and return specific fields

 curl -XPOST 'localhost:9200/test/_search?pretty' -d '
 {
 "query": { "match_all": {} },
 "_source": ["caseMilestone.caseId", "caseMilestone.name"]
 }}
 '

After fiddling around, please clean up your installation, i.e. delete the indexes.

Index documents in a.o. PDF and Word format

Next challenge is how to put PDF and Word documents into Elasticsearch and have them indexed, so it is possible to search through them.

In Elasticsearch, this is handled with the ‘Ingest Attachment Processor Plugin’, that is based on the Apache text extraction library Tika.

First, we need to install the Ingest Attachment Processor Plugin (https://www.elastic.co/guide/en/elasticsearch/plugins/5.0/ingest-attachment.html):

[developer@localhost elasticsearch-5.0.0]$ pwd
/home/developer/elastic/elasticsearch-5.0.0
[developer@localhost elasticsearch-5.0.0]$ sudo bin/elasticsearch-plugin install ingest-attachment
[sudo] password for developer:
-> Downloading ingest-attachment from elastic
[=================================================] 100%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: plugin requires additional permissions @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.lang.RuntimePermission getClassLoader
* java.lang.reflect.ReflectPermission suppressAccessChecks
* java.security.SecurityPermission createAccessControlContext
* java.security.SecurityPermission insertProvider
* java.security.SecurityPermission putProviderProperty.BC
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y
-> Installed ingest-attachment
[developer@localhost elasticsearch-5.0.0]$

I ignored the warnings in the installation…

Now, define a pipline:

  • named ‘amis-attachment’
  • with “description” set to “Extract attachment information”
  • that contains the processor “attachment” (= ingest-attachment processor)
  • that will use the field “data” to get the base64 encoded field from
[developer@localhost testdata]$ curl -XPUT 'localhost:9200/_ingest/pipeline/amis-attachment?pretty' -d'
> {
>   "description" : "Handle PDF attachment information",
>   "processors" : [
>     {
>       "attachment" : {
>         "field" : "data"
>       }
>     }
>   ]
> }'
{
  "acknowledged" : true
}
[developer@localhost testdata]$

Now, the pipeline is ready.

Insert a document into index ‘documents’, of type ‘expenses’, and id ‘1’:

[developer@localhost testdata]$ curl -XPUT 'localhost:9200/documents/expenses/1?pipeline=amis-attachment&pretty' -d'
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}'
{
  "_index" : "documents",
  "_type" : "expenses",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",data
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}
[developer@localhost testdata]$ 

And retrieving that document:

[developer@localhost testdata]$ curl -XGET 'localhost:9200/documents/expenses/1?pretty'
{
  "_index" : "documents",
  "_type" : "expenses",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "data" : "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment" : {
      "content_type" : "application/rtf",
      "language" : "ro",
      "content" : "Lorem ipsum dolor sit amet",
      "content_length" : 28
    }
  }
}

For inserting a larger document, you can use the script ‘insertDoc’ below:

[developer@localhost testdata]$ more insertDoc
#!/bin/bash

coded=`base64 $1`
json="{\"data\":\"${coded}\"}"
echo "$json" > json.file
curl -X POST "localhost:9200/documents/expenses/?pipeline=amis-attachment&pretty" -d @json.file
rm json.file
[developer@localhost testdata]$

That makes inserting a document as simple as:

[developer@localhost testdata]$ ./insertDoc WhoWeAre.docx 
{
  "_index" : "documents",
  "_type" : "expenses",
  "_id" : "AViurBsirQVDoHLCTe-O",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}
[developer@localhost testdata]$

And now, search for

  • the text ‘WordPress’ in the index ‘Documents’, with document type ‘expenses’, and
  • return the ‘attachment.content’ field
[developer@localhost testdata]$ curl -XPOST 'localhost:9200/documents/expenses/_search?pretty' -d '
{
  "query": { "match": { "attachment.content": "WordPress"}},
 "_source": ["attachment.content"]
}
'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2824934,
    "hits" : [
      {
        "_index" : "documents",
        "_type" : "expenses",
        "_id" : "AViurBsirQVDoHLCTe-O",
        "_score" : 0.2824934,
        "_source" : {
          "attachment" : {
            "content" : "Who We Are\nElastic believes getting immediate, actionable insight from data matters. As the company behind the open source projects — Elasticsearch, Logstash, Kibana, and Beats — designed to take data from any source and search, analyze, and visualize it in real time, Elastic is helping people make sense of data. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots."
          }
        }
      }
    ]
  }
}

Or, a more simple formatted search:

[developer@localhost testdata]$ curl -XPOST 'localhost:9200/documents/expenses/_search?q=Wordpress&pretty' 
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.26545224,
    "hits" : [
      {
        "_index" : "documents",
        "_type" : "expenses",
        "_id" : "AViurBsirQVDoHLCTe-O",
        "_score" : 0.26545224,
        "_source" : {
          "data" : "UEsDBBQABgAIAAAAIQAJJIeCgQEAAI4FAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIoolE1Pg0AQhu8m/geyVwPbejDGlPag9ahNrPG8LkPZyH5kZ/v17x1KS6qhpVq9kMAy7/vMCzOD0UqX0QI8KmtS1k96LAIjbabMLGWv08f4lkUYhMlEaQ2kbA3IRsPLi8F07QAjqjaYsiIEd8c5ygK0wMQ6MHSSW69FoFs/407IDzEDft3r3XBpTQAT4lBpsOHgAXIxL0M0XtHjmsRDiSy6r1+svFImnCuVFIFI+cJk31zirUNClZt3sFAOrwiD8VaH6uSwwbbumaLxKoNoInx4Epow+NL6jGdWzjX1kByXaeG0ea4kNPWVmvNWAiJlrsukOdFCmR3/QQ4M6xLw7ylq3RPt31QoxnkOkj52dx4a46rppLbYq+12gxAopFNMvv6CcVfouFXuRFjC+8u/UeyJd4LkNBpT8V7CCYn/MIxGuhMi0LwD31z7Z3NsZI5Z0mRMvHVI+8P/ou3dgqiqYxo5Bz4oaFZE24g1jrR7zu4Pqu2WQdbizTfbdPgJAAD//wMAUEsDBBQABgAIAAAAIQAekRq38wAAAE4CAAALAAgCX3JlbHMvLnJlbHMgogQCKKAAAg
...
...

Architecture considerations

Elasticsearch has a REST API that is rather straightforward to use. It can easily handle JSON documents, and then perform a search on them. It can also handle varous documents format by using the ‘ingest-attachment’ plugin. That plugin is based on Apache Tika (http://tika.apache.org/) and can handle lots of formats: http://tika.apache.org/1.14/formats.html. Most likely, in most organizations this Elasticsearch capability will have to compete with a Document Managements System that also has this type of document search capability.