This blog is one of a series of 6 blogs around the combination of Elasticsearch (‘the ELK stack’) and Oracle Adaptive Case Management.
The series cover:
1. Elasticsearch and Oracle Middleware – is there an opportunity?
2. Installation of Elasticsearch: installation and the indexing of – human generated – documents
3. Elasticsearch and Oracle ACM data: example with ACM data
4. Kibana for ACM dashboards: an example of dashboards with ACM data
5. Logstash and Fusion Middleware: how to get log file data into Elasticsearch
6. Beats and Fusion Middleware: a more advanced way to handle log files
This blog will cover loading a sample data set into Elasticsearch. We assume that you cleaned up your installation after going through the previous blog, i.e. deleted the indexes.
The ACM sample case
At AMIS, just before the AMIS25 conference, we wanted to reach out to people that needed more information on the AMIS25 conference. One of the sources were all the tweets of twitter account ‘@AMIS’. First, a tweet would be scanned by ‘Office Management’. They would determine whether that tweet needs additional scanning by Marketing or by the CTO. So, that resulted in an ACM case implementation according to the diagram below:
The case has case activities:
- ScreenTweetProcess – ‘Office Management’ screens a tweet
- MarketingScreeningProcess – Marketing department screens a tweet
- CtoScreeningProcess -the CTO screens a tweet
The case has case milestones:
- TweetScreened – Office Management has screened the tweet
- MarketingScreened – Marketing Department has screened the tweet
- CtoScreened – CTO has screened the tweet
- TweetCompleted – indicates whether the case has completed
- TweeterContacted – not used in this case
The case has case data:
- tweetId – the tweet identification by Twitter
- tweetText – the text of the tweet
- tweetCreated – time-stamp when the tweet was sent
- marketingScreening – indicates whether the Marketing Department has to screen a tweet
- ctoScreening – indicates whether the CTO has to screen a tweet
The above described ACM sample case was implemented and run (simulated with SoapUI). The simulation generated JSON documents, that now have to be inserted into Elasticsearch:
The JSON documents were generated with some quick-and-dirty usage of the API. Next, they will be inserted with a script into Elsticsearch. Should you want to do this in a real production setup, you should consider a well-designed implementation for getting data from ACM into Elasticsearch!
Important: part of the implementation of this case was the generation of (json) documents that have information on the case execution. More specifically, json documents for case activities, milestones and data were generated at the end of the case execution. Sample json documents are:
Case Activity:
{ "caseActivityDefinition" : { "caseId" : "72046d01-bcc4-464c-a601-b2fa94bbe0fd", "completedDate" : "2016-11-19T13:57:18.186+01:00", "definitionId" : "default/TwitterSupport!1.0/CtoScreeningProcess", "displayName" : "CtoScreeningProcess", "instanceId" : "600227", "name" : "CtoScreeningProcess", "nameSpace" : "http://xmlns.amis.nl/TwitterSupport/CtoScreeningProcess", "startDate" : "2016-11-19T13:57:16.859+01:00" } }
The document has:
- a caseId, which makes it possible to link it to other documents from that same case instance
- a start and end date – so the duration of the activity is known
- some technical data
Case Milestone:
{ "caseMilestone" : { "caseId" : "c637ec12-8ca1-492b-9822-5fb03f378122", "state" : "ATTAINED", "name" : "TweetCompleted", "updatedDate" : "2016-11-19T13:57:41.416+01:00" } }
Case Data:
{ "caseData" : { "caseId" : "c637ec12-8ca1-492b-9822-5fb03f378122", "value" : "RT @robbrecht: Orcas - Automatic deployment for the database https://t.co/4U6QSuROjf @amisnl @OC_WIRE", "name" : "tweetText" } }
A simulation of the case, handling 3212 tweets, resulted in:
- 3387 case activity json documents
- 16060 case data json documents
- 16060 case milestone json documents
The case data set set can be found here: run2-final.zip
Cases, their data and Elasticsearch
It is important to think about what data from a case instance has to be put into Elasticsearch. In the above sample, json documents were generated when the case execution was completed. That may cover your ‘search use case’. Or maybe you need a more ‘realtime’ data generation, e.g. to generate a milestone json document as soon as a milestone has changed. Furthermore, it is important to realize that this is not ‘something that comes for free’: the document generation and how to put these documents into Elasticsearch is somehting that has to be carefully designed and built.
How to store that data in Elasticsearch
Back to our sample case. Now that we have a set of documents, we must consider is how to store them in Elasticsearch. One question that has to be answered is whether to:
- put the data in 1 index and have different types for ‘case data’, ‘milestones’ and ‘case activities’
- put the data in 3 different indexes, one for ‘case data’, one for ‘milestones’ and one for ‘case activities’
A good article on that question is https://www.elastic.co/blog/index-vs-type.
Also, an ‘infrastructural’ design has to be made on how to implement the Elasticsearch layour in terms of replicas and shards.
As the above questions are beyond the scope of this blog series, we continue with:
- the default Elasticsearch installation – and not worry about replicas and shards
- 3 indexes, one for ‘case data’, one for ‘milestones’ and one for ‘case activities’
Uploading the sample data set
The above mentioned sample case data set can be uploaded to Elasticsearch using the scripts that are included in the zip file (OEL 7.2):
run2-final.zip
[developer@localhost ~]$ cd [developer@localhost ~]$ unzip run2-final.zip .... [developer@localhost ~]$ cd run2/datainsert/ [developer@localhost datainsert]$ ls insertData [developer@localhost datainsert]$ ./insertData ... [developer@localhost datainsert]$ curl 'localhost:9200/_cat/indices?v' health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open caseactivities TupoWoM-TUm3gbRdK1fuKw 5 1 3387 0 1.8mb 1.8mb yellow open casedata feHn0AxLQ8Sg50a0SKDcpg 5 1 16060 0 5.7mb 5.7mb yellow open casemilestones jU2LVuiiSBeA30PsZnZuGQ 5 1 16060 0 4.1mb 4.1mb [developer@localhost datainsert]$
In my setup, I uploaded 35507 documents in roughly 8:11 minutes. That’s about 72 documents per second.
Then there is the question on how to interpret that number. One could argue that the test only covers small documents on a virtually empty system. On the other hand, during the uploading disk i/o was the bottleneck, and not CPU. More information on the tuning and scaling options of Elasticsearch can be found here: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html. Given that this runs on my VirtualBox set-up and with default Elasticsearch settings, I consider this to be a fairly good result.
Sample queries
Looking for a tweet with ‘Katwijk’
[developer@localhost ~]$ curl 'localhost:9200/_search?q=Katwijk&pretty' { "took" : 7, "timed_out" : false, "_shards" : { "total" : 15, "successful" : 15, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 4.6824956, "hits" : [ { "_index" : "casedata", "_type" : "external", "_id" : "AViU-ClPZ_yJneU3z56D", "_score" : 4.6824956, "_source" : { "caseData" : { "caseId" : "69ebc23b-1e19-4da7-a8ed-544417b8bbd3", "value" : "Hash joins and Bloom filters by @ToonKoppelaars at the #AMIS25 Conference in Katwijk, NL. https://t.co/8PY2BvbIAB @Oracle_NL", "name" : "tweetText" } } } ] } }
In the above query, all indices were searched. Because we are looking for a tweet, we know that only the casedata index has to be searched. So, a more efficient search would look only in that index:
[developer@localhost ~]$ curl 'localhost:9200/casedata/_search?q=Katwijk&pretty' { "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 4.6824956, "hits" : [ { "_index" : "casedata", "_type" : "external", "_id" : "AViU-ClPZ_yJneU3z56D", "_score" : 4.6824956, "_source" : { "caseData" : { "caseId" : "69ebc23b-1e19-4da7-a8ed-544417b8bbd3", "value" : "Hash joins and Bloom filters by @ToonKoppelaars at the #AMIS25 Conference in Katwijk, NL. https://t.co/8PY2BvbIAB @Oracle_NL", "name" : "tweetText" } } } ] } } [developer@localhost ~]$
The ‘took’ value changed from 7 to 4ms, indicating that this search is indeed more efficient.
Now, Look for all data with caseId: 69ebc23b-1e19-4da7-a8ed-544417b8bbd3
[developer@localhost ~]$ curl 'localhost:9200/_search?q=69ebc23b-1e19-4da7-a8ed-544417b8bbd3&pretty' { "took" : 52, "timed_out" : false, "_shards" : { "total" : 15, "successful" : 15, "failed" : 0 }, "hits" : { "total" : 12, "max_score" : 44.742836, "hits" : [ { "_index" : "casedata", "_type" : "external", "_id" : "AViU-CkZZ_yJneU3z55_", "_score" : 44.742836, "_source" : { "caseData" : { "caseId" : "69ebc23b-1e19-4da7-a8ed-544417b8bbd3", "value" : "false", "name" : "ctoScreening" } } }, { "_index" : "casedata", "_type" : "external", "_id" : "AViU-CkmZ_yJneU3z56A", "_score" : 39.95956, "_source" : { "caseData" : ...
The above search returns 12 results.
Elasticsearch indexes – targeted search by clients
In case a web application accesses Elasticsearch for searching through cases, the indexes offer a means to more target the search. Your web application could offer a ‘Google-style-tabbed-search’, where the user can decided what areas/results the search query has to cover. In our situation, it would be straightforward to e.g. target a search to ‘only case data’. The web application would then only have to access/target the corresponding Elasticsearch index.