What is Apache Drill and how to setup your Proof-of-Concept

2

Apache Drill is a schema-free SQL query engine. Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.

Data Analytics

If you do any kind of data analytics, you most likely agree that there is one major problem in the industry. Data volumes have increased exponentially in the last 10 years and so have the kind of systems and formats. All of this comes back to the basic fact that data is not optimally arranged for ad hoc analysis. Since a data scientist will often be confronted with a wide variety of formats such as JSON, XML, CSV stored in MongoDB, Hadoop and MySQL your job is basically gathering and cleaning data. This is a time-consuming process that most people do not enjoy. This process is often called ETL, extract, transform and load. 

Hence the reason why our lord and savior Apache Drill is here! At its core it’s basically a SQL engine for big data. It’s a query engine on top of multiple data sources that allows you to query self-describing data using ANSI SQL. It enables you to interact with your data as if it were a table in a SQL-like database. With only the knowledge of SQL you can easily extract your data and load it into a BI tool like Microsoft PowerBI or Tableau and analyse/query it without having to transform the data or more it to a centralized data store. 


To get a better idea of how it works, look at the image above. As you can see it’s basically a universal translator for different data sources. You can easily load your data into your favorite business intelligence tool or expose it with REST. 

It’s easy to get started since most people are already familiar with SQL. Of course they have some additional SQL commands you can view in their documentation. It has Open Database connectivity and Java Database Connectivity interfaces so you can easily connect most BI tools. If you think SQL is for n00bs and prefer using a scripting language such as Python or R, there are modules available to import those languages.

Out of the box

Drill is very versatile, you can query a wide variety of data sources and formats including

Formats:

  • CSV, TSV, PSV or any other delimited data
  • Parquet
  • JSON
  • Avro
  • Hadoop Sequence Files
  • Apache and Nginx server logs
  • Logs files
  • PCAP/PCAP-NG

External Systems:

  • Hbase
  • Hive
  • Kafka (streaming data)
  • MapR-DB
  • MongoDB
  • Open Time Series Database
  • Nearly all relational databases with a JDBC driver
  • Hadoop Distributed File System
  • MapR-FS
  • Amazon Simple Storage Service

Since Drill looks like a relational database to the user, users often expect a database-like performance. Although Drill is very fast and optimized but don’t expect nanosecond response time. Of course you can tune the performance and it heavily depends on data source, function and the amount of data.

Benefits

  • Drill can scale data from a single node to thousands of nodes and query petabytes of data within seconds.
  • Drill supports user defined functions.
  • Drill’s symmetrical architecture and simple installation makes it easy to deploy and operate very large clusters.
  • Drill has flexible data model and extensible architecture.
  • Drill columnar execution model performs SQL processing on complex data without flattening into rows.
  • Supports large datasets

Key Features

  • Drill’s pluggable architecture enables connectivity to multiple datastores.
  • Drill has a distributed execution engine for processing queries. Users can submit requests to any node in the cluster.
  • Drill supports complex/multi-structured data types.
  • Drill uses self-describing data where a schema is specified as a part of the data itself, so no need for centralized schema definitions or management.
  • Flexible deployment options either local node or cluster.
  • Specialized memory management that reduces the amount of main memory that a program uses or references while running and eliminates garbage collections.
  • Decentralized data management.

When to use it?

Apache Drill is mostly used for data analytics. When a lot of databases, files, logs and other datatypes are spread across VM’s, filesystems, databases and more Apache Drill saves the day. It works flawlessly with popular BI tools like Tablaeu, Qlik or PowerBI. But is it worth it?

Apache Drill brings a lot of value when your data is spread across your IT infrastructure, but only when you’re actually doing analytics on a regular basis. I would not recommend using it for the occasional data export. When using Drill in production you need to run it on multiple nodes in distributed mode, if you’re not running Drill on a regular basis you’re basically throwing away your money since it requires a lot of power. In short, use when:

  • I have big datasets 
  • Our data is spread across our IT-landscape
  • I am interested in BI/data science on a regular basis
  • I have enough money to fully setup Apache Drill in distributed mode and take the appropriate security measures.

Learning Curve

Setting up Apache Drill in distributed mode and configuring it properly requires advanced computer knowledge. 

  • Knowledge and experience in setting up a proper VM with user management, security and Drill.
  • Knowledge about Zookeeper and nodes.
  • Knowledge about every database/file you’re going to use since you are going to need credentials

Using Apache Drill is really easy, especially when you’re taking advantage of the open source technology called Dremio (fancy UI). You need to know:

  • Advanced SQL
  • Drill platform knowledge
  • CLI

Overall Apache Drill doesn’t have a steep learning curve. 

Apache Drill Architecture

The image above shows the Apache Drill Architecture core modules. Apache Drill consists of a daemon service called the DrillBit. It is responsible for accepting requests from the client, processing queries and returning results to the client. When executing a query it will go to the SQL parser, this is based on the open source framework Calcite. Afterwards it goes to the Logical Plan which is responsible for determining the most efficient execution plan using a variety of techniques, it also translates a logical plan into a physical plan.The optimizer uses various database optimizations. The physical plan is also called as the execution plan. And finally it goes to the storage engine interface, this represents an interface that is used to interact with the data sources. The plugins are extensible allowing you to write new plugins for any additional data sources.

When using Apache Drill in distributed mode, you have multiple instances of drillbits. We are only using one instance (embedded).

Proof of concept

In this paragraph we will go through an installation in Docker. I am running a Drill container on the cheapest VM from Azure using Ubuntu 18.04 LTS. Don’t forget to open up port :8047 to access the web ui.

Prerequisites:

We will run our apache drill in a docker container.

sudo docker run -i --name drill -p 8047:8047 --detach -t harisekhon/apache-drill

If you ever want to update the container to add restart on failure.

sudo docker update drill --restart=always

We now have Drill running in embedded mode rather than distributed mode. Embedded  mode requires less configuration and it is preferred for testing purpose, hence the reason why we are using it. Distributed mode runs on one or more nodes in a clustered environment. Running a ZooKeeper quorum is required. If you ever going to use Drill in production, you should use distributed mode.

Let’s go into our container and start Apache Dril Localhost

sudo docker exec -it drill /bin/bash
cd /apache-drill/bin
./drill-localhost

Wait a couple of seconds till you see a drill quote. 

Nice! Let’s query our version.

select version from sys.version;

As you can see Drill heavenly uses the optimizer for faster queries. 

Let’s access the webui and see what’s going on there. Go to:  

[yourip]:8047

You will see:

1 – Drillbits, since we are running embedded mode you will only see one drill bit. When running a lot of queries across multiple big data sources you want to carefully monitor your drill bits since it’s such a demanding task.
2 – Query, this is where you can execute your queries. When you successfully ran your query you can click on it to view the results and detailed metrics.
3 – Profiles, this shows your completed queries. You can click on them to view the results again.
4 – Storage, here you can enable and update storage plugins like Mongo,S3 or you can add a new one. 
5 – Metrics contains very detailed metrics about your running system
6 – Threads got an auto-refresh function and shows logs about the running threads.
7 – Logs, you can view your logs here. Especially handy when something crashes. 

Enough explanation, lets start using Apache Drill and see what it’s able to do. Since it’s too much of a hassle to setup multiple databases with relevant data we are going to make use of some cryptocurrency data. Visit https://www.cryptodatadownload.com/ and copy a link, I chose kraken btc/usb hourly. You can also copy the data below.

2018-10-03,BTCUSD,6498.46,6507.95,6398.96,6439.99,5502.61,35471949.91
2018-10-02,BTCUSD,6572.83,6591.37,6452,6498.46,5283.88,34517949.11
2018-10-01,BTCUSD,6605,6639.33,6479.03,6572.83,6869.97,45113832.24
2018-09-30,BTCUSD,6582,6640.37,6523,6605,3604.16,23739483.01
2018-09-29,BTCUSD,6620.01,6620.01,6452.02,6582,4164.89,27236990.33
2018-09-28,BTCUSD,6680.01,6786.73,6528.13,6620.01,8630.93,57429631.92
2018-09-27,BTCUSD,6455.66,6734.78,6430,6680.01,7629.43,50049900.06
2018-09-26,BTCUSD,6436.89,6537.26,6380.62,6455.66,5767.65,37350373.43
2018-09-25,BTCUSD,6582.09,6582.1,6320.5,6436.89,8918.53,57200568.82
2018-09-24,BTCUSD,6696.99,6716.16,6555.99,6582.09,5245.81,34758698.5
2018-09-23,BTCUSD,6707.33,6777,6660.05,6696.99,3403.22,22806982.88
2018-09-22,BTCUSD,6750,6823,6628.01,6707.33,5114.24,34363292.87
2018-09-21,BTCUSD,6493.11,6777.1,6492.5,6750,10972.56,73208617.45
2018-09-20,BTCUSD,6386.94,6530,6337,6493.11,6573.75,42236058.52
2018-09-19,BTCUSD,6335.7,6511,6106.37,6386.94,9736.67,61712813.0
2018-09-18,BTCUSD,6250.7,6384.04,6228,6335.7,6009.83,37901046.74
2018-09-17,BTCUSD,6498,6529.75,6207.55,6250.7,8901.88,56553438.08
2018-09-16,BTCUSD,6518.68,6521.75,6324.9,6498,3191.64,20649801.91
2018-09-15,BTCUSD,6478.04,6565,6470,6518.68,3344.58,21811302.81
2018-09-14,BTCUSD,6485.99,6580,6378.05,6478.04,7312.68,47509635.88
2018-09-13,BTCUSD,6326,6530,6323.65,6485.99,9327.54,60165545.77
2018-09-12,BTCUSD,6282.53,6340.94,6194.88,6326,7259.91,45544454.52
2018-09-11,BTCUSD,6299.99,6405,6176.76,6282.53,7223.25,45346177.73
2018-09-10,BTCUSD,6235.01,6375.13,6221.49,6299.99,6841.81,43020287.34

Lets copy this data and go back into our Docker container.

cd apache-drill/sample-data
vi crypto.csv
*paste your data and save*

Now we have a file called crypto with some CSV data. Let’s test it!

You can either go to the web interface and execute your queries there. Personally I’m going to use the CLI. First make sure you’re running /apache-drill/bin/drill-localhost

Running our first query and you will see our data. 

select * from dfs.`/apache-drill/sample-data/crypto.csv`;

Well that doesn’t look too good.

As you can see there are 9 columns defined in our CSV file. Let’s split them.

SELECT columns[0],columns[1],columns[2],columns[3],columns[4],columns[5],columns[6],columns[7] FROM dfs.`/apache-drill/sample-data/crypto.csv`;

That looks way better already! But we still need to give the column a name. Remember this is literally SQL, a simple “as” statement is enough.

SELECT columns[0] as `Date`,columns[1] as `Symbol`,columns[2] as `Open`,columns[3] as `High`,columns[4] as `Low`,columns[5] as `Close`,columns[6] as `Volume From`,columns[7] as `Volume To` FROM dfs.`/apache-drill/sample-data/crypto.csv`;

This is exactly how we want our table to be shown. Now we want to save our table with our amazing markup to Apache Drill’s storage. First we need to change our database or schema. If you execute the codeblock below (the create table) you will get an error, you cannot write to dfs.root. That’s why we are going to change our schema by typing:

use tmp;

If you want to change the settings regarding these storage plugins, go to the Web UI and click storage. Click DFS update.

This is where you can change settings, update and make new storage plugins. For this proof-of-concept we are going to use the default tmp folder. Now let’s execute the query below. It will create a table named “BTCUSD” and uses our fancy formatting command and data from our .csv file.

create table BTCUSD as SELECT columns[0] as `Date`,columns[1] as `Symbol`,columns[2] as `Open`,columns[3] as `High`,columns[4] as `Low`,columns[5] as `Close`,columns[6] as `Volume From`,columns[7] as `Volume To` FROM dfs.`/apache-drill/sample-data/crypto.csv`;

If everything went correctly you should see:

Now let’s use basic SQL table syntax.

select * from BTCUSD;

Maybe you’re thinking “Well, this is just some stupid csv file, what’s so special about this?!”. When adding a data source, the process is exactly the same. You use the Drill SQL syntax to get your MongoDB data (or something else) visualized in a table. And afterwards you can join data across multiple sources! Even with something like InfluxDB you can export a lot of data to CSV and load it in Apache Drill. 

MongoDB

The example with a CSV file was just a warm up, let’s spin up a MongoDB instance (a NoSQL database) with sample data. We are going to make a folder on our VM and pull the standard MongoDB data from the interwebs. Afterwards we are going to run a standard MongoDB and bindmount our datafolder to a folder in the mongo container directory. We use the mongoimport functionality to make a database with a collection of mongodb test data.

Execute the following command in the VM. 

mkdir sample-data
cd sample-data
wget http://media.mongodb.org/zips.json -O zips.json
sudo docker run --name mongodb -p 27017:27017  -v ~/sample-data:/data/sample-data -d mongo
mongoimport -d sample -c zips --file /data/sample-data/zips.json

To verify that you have successfully loaded your data.

mongouse 
sampledb.zips.find()

It works! Don’t forget to open up your port to get access. Yes this is dangerous and insecure, but for this demo it’s okay. If you want to do it the secure way; setup a Docker network. We are not doing this because I can write an entire blogpost about Docker networks.

Let’s to back to our Apache Drill web interface and click on “Storage”.

Click on update and edit the connection string and the attribute enabled. Afterwards click on update.

To verify that it’s actually working go to Apache Drill Query and use the following command.

show databases;

You should get databases like mongo.admin and of course our sample collection.

use mongo.sample;
ALTER SYSTEM SET `store.mongo.read_numbers_as_double` = true;
select * from zips limit 10;
SELECT city, avg(pop) FROM zips GROUP BY city LIMIT 10;

As you can see it’s working like a charm. As you can see we have now successfully loaded a MongoDB database and a .csv file. Both are read as tables so you can easily query them and make complex joins. Just treat your data like regular SQL tables and everything will be fine.

If you want to know everything there is to know read the book “Learning Apache Drill” by Paul Rogers and Charles Givre. It’s a great book and covers everything Drill has to offer.

Sources

https://drill.apache.org/
Learning Apache Drill by Paul Rogers and Charles Givre 
Figure one: https://www.thegalleria.eu/apache-drill-architecture-the-ultimate-guide-mapr.html
https://drill.apache.org/docs/mongodb-storage-plugin/
https://drill.apache.org/docs/develop-custom-functions-introduction/
https://docs.mongodb.com/

About Author

Passionate nerd who wants to know everything about technology, besides that I try to pick up as many skills as possible besides my passion for IT.

2 Comments

  1. It is probably a combination of IP address, port, username and password. All stuff the author wants to keep private!

  2. Hi,
    I followed this blog but I couldn’t mongo.sample in the listing of command ‘show databases’.
    In the configuration, what is that blurred part? What to enter over there?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.