In our daily work we are dealing with data from many sources. Data in CSV files, from Cloud APIs, in mail servers, configuration files, Terraform plans, in logging systems, source code repositories and many more. Different formats, access methods, tools. And retrieving data from one such source can be challenging enough – combining data from different sources for enrichment, filtering, aggregation and consolidation is really a big effort.
Steampipe (first released in early 2021) changes this. Steampipe turns dozens of sources of such data into SQL views that can be queried using good old – well, actually blazingly modern – SQL select statements. Reading, filtering, sorting, converting and aggregating data from individual sources becomes much easier (“how many compute instances on Azure are currently in status Stopped?”, “show all entries in this Kubernetes yaml file that apply to Pods based on a MySQL container image”). And to take it to the next level entirely: data from different sources can be joined together in SQL queries, making it possible to enrich, compare, consolidate across wildly different sources. Count all storage buckets across AWS, Azure and Google Cloud Platform in a single query.
Let’s look at a simple example to start with.
A local csv file contains data on books. Each row contains fields that represent the title, author, publisher and genre of the book. To get a list (or JSON or CSV file) with the title and author for all books in the genre economics we can ask Steampipe to execute a simple SQL Select statement.
The Steampipe CSV plugin was installed, the query is executed from the command line in the directory that contains the books.csv file.
With two different CSV files, we can join the data. Here files books.csv and genres.csv are used. The latter provides additional information about the genres, and in a single straightforward query, records are produced with data from both CSV files (that are joined using the value for genre).
Retrieving data from JSON files is a tiny bit harder. Here we leverage the JSON capabilities in PostgreSQL. For example: file books.json contains a JSON object with a single property books that in turn consists of an array of book objects. This example shows how to present a “relational” result with book records. Note: json_file is the reference to the Steampipe plugin for JSON documents; all local files with extension .json are available from json_file, Column content contains the JSON payload and column path gives the fully qualified file name.
(the big trick here: how to retrieve individual “records” for each of the elements in the books array in the JSON document; json_path_query performs that magic)
Data from the books.csv and books.json file can easily be combined – with UNION and MINUS and with joins. The WITH construction allows us to define inline views for each of the data sources and subsequently build the join using only these views that hide the underlying source details
The final query is as simple as this one select statement:
The tagline of Steampipe reads “select * from cloud”.
This represents one valuable application of Steampipe: query meta data regarding cloud resources, in AWS, Azure, Oracle Cloud Infrastructure, IBM Cloud, Alibaba Cloud and Google Cloud Platform. In a simple SQL select, we can ask for details on virtual networks, VMs, databases and other aspects of the cloud environment. Steampipe extracts from our SQL questions the corresponding API calls to be made, makes them, gathers the JSON responses and makes them available in a relational structure.
Here an example with Oracle Cloud. The query at the bottom asks for compute instances in a specific compartment named “go-on-oci”. Steampipe makes corresponding calls to the OCI APIs, performs the transformation and subsequent filtering/sorting/conversion/aggregation specified in the SQL statement. The result is produced in one of various formats.
The essence: a simple SQL query returns information that is far less available without Steampipe through CLIs, SDKs, consoles and REST APIs. This SQL query can span multiple (types of) resources in one cloud environment and even join this data with from other sources, either on or off the cloud.
How does this work under the hood?
Steampipe leverages PostgreSQL and the Foreign Data Wrapper mechanism. The plugins installed with Steampipe for various data sources each register one or more database views in an embedded PostgreSQL database. Each view is populated with data through the Foreign Data Wrapper that invokes the plugin with the values provided for key parameters. The plugin performs the required actions to retrieve the requested data based on type of data source, the connection configured and the parameters passed in; this can mean calling a REST API, reading a file, invoking local platform API or inspecting operating system status.
The plugin returns the data in a relational structure as view query result and subsequently the PostgreSQL SQL engine does what it always does to process a SQL query. Note: the filter condition in the where clause (except for filtering on key properties) and the group by aggregation are performed in PostgreSQL. That means that all data is first retrieved from the source system to the embedded database. Steampipe data sets are cached – by default for 5 minutes (this cache timeout is configurable).
Each plugin corresponds to a database schema. Each individual resource type is exposed as a database view.
The Steampipe PostgreSQL database can be accessed directly from any tool that can interact with PostgreSQL That means you can benefit from Steampipes data retrieval capabilities in any SQL enabled tool you fancy.
Custom Steampipe plugins can be created and easily be integrated; plugins are programmed in Go. An overview of all plugins can be found at: https://hub.steampipe.io/plugins. At the time of writing, 73 plugins have been published. For a tool that is not much more than one year old, that is an impressive score. And likely to continue to increase.
In addition to only querying data into text, Steampipe can run dashboards that consolidate and visualize results from Steampipe queries. Dashboards are defined in HCL (language of Terraform). Per dashboard element, you can define title, SQL query, type of widget, width. 100s of predefined dashboards are available (see https://hub.steampipe.io/mods?objectives=dashboard).
Data can be downloaded from Dashboard as JSON. The SQL queries used in a dashboard can easily be copied from Dashboard – and pasted elsewhere. Note that queries can be parametrized – with values coming from
user input UI elements – text or select widgets (that are populated from SQL executed by Steampipe).
Steampipe also offers Benchmarks & Controls: a generic mechanism for defining and running control frameworks such as CIS, NIST, HIPAA, etc, as well as your own customized groups of controls. A simplified, consistent mechanism for defining, running, and returning output from disparate benchmark and control frameworks. It checks in the SQL results for violations of rules & recommendations, on security, cost control.
Code Sources for this article on GitHub – https://github.com/lucasjellema/steampipe-codecafe-may2022
website for Steampipe.io- https://steampipe.io/
An overview of all plugins can be found at: https://hub.steampipe.io/plugins.
Community Built Checks and Controls for Compliance, Security and Cost – https://hub.steampipe.io/mods?objectives=compliance,cost,security
My slides for the Conclusion Code Cafe on Steampipe – https://www.slideshare.net/lucasjellema/steampipe-use-sql-to-retrieve-data-from-cloud-platforms-and-files-code-caf-may-2022