In this small blog post I will talk about when to use a time series database and why to use InfluxDB for this.
But first of all I need to explain what time series data is.
Time series data is data where the time aspect is the most important characteristic. This means every data record contains a timestamp. Time series doesn’t mean the data is always ingested sequentially. The time intervals can be at regular or irregular intervals.
Examples of time series data are:
- Sensor data from IoT sensors.
- Metrics from monitored IT systems.
- Click data from users in web sites.
Basically time series data contains, next to the time stamp, a key (which is a string) and one or more values (which are numbers, but also can be a string).
Another characteristic of time series data is that usually each data point is new or is interpreted as new. This reflects the data source characteristic: sensor data from IoT sensors or performance metrics don’t need an update.
When the amount of data is very high and we also need to query for specific time periods than regular databases like the relational or even the NoSQL types can’t cope with this pace: a new type of database is needed: A time series database.
Time series database characteristics
Time series database don’t have the concept of prescriptive table definitions. The comparison fits the best with a NoSQL database (https://en.wikipedia.org/wiki/NoSQL). The big difference is that each record in a time series database has a timestamp as explained in the previous paragraph. Also the concept of primary key applies (some user defined fields and a timestamp).
For all other fields the column definition is determined by the field type of the first written data record!
The database will check this and will reject the record when this is violated.
When new fields are added later (by just sending new data containing these new fields), than these fields are added automatically and will be available as null value for all older data.
So there is a (sort of) concept of a table definition but it is dynamic and you can only add new fields!
The primary key fields are fixed and needs to be designed well!
Data writes are handled as upserts: meaning that existing data (identified by the primary key) is always overwritten. There is no concept of foreign keys. So data integrity checks are not needed (except that you can’t insert a number to a float field).
The database needs to be able to handle ACID well:
|
Time series Database Capability |
Atomicity | For time series databases this is less complicated to achieve because there is one transaction which is usually an insert (can be one or a batch). |
Consistency | Consistency checking for time series databases is easy because they don’t have foreign keys or constraints, it only has a primary key. When the primary key value is the same the record is just overwritten. So basically we always have un upsert. Consistency checking is only needed for field type: you can’t insert a string in a float field. |
Isolation | Time series databases usually have the concept of Nodes for concurrency and availability. Data is always sent to one Node and usually a Replication server will deal with the data replication between Nodes. Most time series databases are using caching to solve the isolation issue, so a query will always read the newest data. |
Durability | Data needs to be saved to storage. Some time series databases uses the WAL (Write-Ahead-Log) concept for this. A WAL is write-optimized storage, fast and durable.
The database itself will update periodically the time series data files (these are used for query). |
Also the following aspects needs to be taken into account:
- Availability: Does it have the concept of different nodes? Is it resilient (can you kill a node without data corruption and the database is still operational)? What is the RTO?
- Scalability: Can new nodes added or deleted? Can this be done real time?
- Resilience: Can it recover quickly from major incidents?
- Security: Is the data stored encrypted? Are the connections to the database encrypted? Is user management (Access Control List) in place? How fine grained can the ACL be defined?
- Backup and Recovery: Cold or hot backup and recovery? Can it be done incremental? What is the RPO?
Other selection criteria:
- Can it run in a docker container?
- What is the community traction?
- What is the company vision?
- What is the product roadmap?
- Are there bugs open longer than 6 month?
Why InfluxDB is no 1?
There are a lot of time series databases available.
The site https://db-engines.com/en/ranking/time+series+dbms provides a good overview and ranking.
Below you can find the ranking per January 2019:
Or the time series trend chart:
As you can see InfluxDB is the no. 1 time series database (according the db-engines.com).
The decision which time series database best fits your situation depends on more requirements than only this ranking. For instance the fit with your current architecture landscape drives the decision.
The latest version of InfluxDB is version 1.7. You can find documentation about this here: https://docs.influxdata.com/influxdb/v1.7/
We have implemented this version in production for a couple of our clients. We were very surprised about the performance, which was really fast.
The nice thing is that InfluxDB has a HTTP endpoint for data ingestion and you can use SQL (also by HTTP if you like) to query the database.
Our use case for choosing for InfluxDB was IoT related (in the industrial area).
Below you can find some of our findings for the Open Source edition:
- Very high write and query performance.
- SQL to query data is nice, but limited. Combining data from different measurements is not possible.
- Define the primary key (which is called “key” in InfluxDB) well so you can better make use of the group by functionality.
- Save each type of data in its own table (called “measurement” in InfluxDB).
- Real time aggregation functionality (called Continuous Queries) is real cool stuff.
- Make use of retention policies. This can be defined per table (measurement).
- HTTP endpoint so it can easily integrated in our architecture.
- Various SDKs available (Python, JavaScript) for tight integration.
- User access management can only by on database level. So you need to implement fine grained ACL in the logic on top of this database.
- Backup and recovery can be done on the file system level.
- InfluxDB runs in a docker container so it can be managed well.
- Log files and metrics are available out of the box.
- Data compression is performed automatically. Which is nice.
- InfluxDB can run in a docker container.
- The community traction is good. Its a popular database.
- InfluxDB is part of a complete stack which also consists a Dashboard (Chronograf), a Plugin Server Agent (Telegraf) and an Streaming Engine (Kapacitor), All these together is called TICK stack. For details see: https://www.influxdata.com/time-series-platform/
Influxdata (which is the company) is currently (January 2019) working on InfluxDB 2.0 (alpha release is available). This release contains a new architecture which lots of new capabilities.
The mentioned strong points in this blog shows to us that, besides that the InfluxDB roadmap is promising, this is our no 1 database for time series data.
If you want to know more about how InfluxDB could help your business, please contact me.
For more information about our IoT services see: https://www.conclusionconnect.nl/
The majority of findings are valid except of the first one:
> Very high write and query performance.
There are systems with much higher insert and query performance than InfluxDB:
– VictoriaMetrics – https://medium.com/@valyala/insert-benchmarks-with-inch-influxdb-vs-victoriametrics-e31a41ae2893
– ClickHouse – https://medium.com/@AltinityDB/clickhouse-for-time-series-be35342bf31d
Thanks for reading my blog post and thanks for being critical. I now that there are time series databases which are faster than InfluxDB. Speed is important but certainly not the only selection criteria.