Apache Superset is an open source end user tool for analyzing and visualizing data. Dozens of chart types are available out of the box, dashboards can be created, finetuned and shared/published. Superset can work with virtually any SQL data source. (I have recently written an introduction to Superset and created a Gitpod workspace for trying it out).
Of course there is a lot of data that is not in SQL data sources. All cloud resource information about for example our AWS EC2 instances or the AKS nodes or the GCP Cloud Functions or Oracle Cloud Autonomous Databases. And data exposed in RSS feeds, secret vaults, GitHub, LinkedIn, Twitter, in script files such as Terraform and Wikis. This data cannot be accessed from Superset. Except that it can! By combining Superset with Steampipe we can bring data from hundreds of non-SQL data sources to Superset.
Steampipe is a great technology that turns data from virtually any source into a relational view that can accessed like any truly relational data set. Steampipe leverages the PostgreSQL and its Foreign Data Wrapper mechanism that makes it possible to turn a SQL query into what read action or API call is required to fetch data from the remote, non-relational data source and pass the resulting data to the SQL engine as to make it “relationally processable”. I have written and presented about Steampipe in an earlier article and presentation: Steampipe–analyze data from cloud, file, platform, IaC using SQL queries (may 2022) and SlideShare Steampipe – use SQL to retrieve data from cloud, platforms and files (Conclusion Code Café May 2022).
Superset connects to the PostgreSQL database spun up by Steampipe and defines data sets on top of views that in turn link to these external data sources.
By combining Superset with Steampipe and installing the required plugins into Steampipe, we can analyze and visualize all data in Superset that Steampipe can unlock.
The remainder of this article introduces a Gitpod workspace definition into which both Superset and Steampipe have been installed, along with the Steampipe RSS plugin. I will describe how this combination allows us to define a Superset data set for the RSS feed for the AMIS Technology Blog that we can then use to analyze and visualize. Of course you can extend the workspace with additional plugins and additional external data source to create very rich and advanced dashboards.
You can get going by simply accessing this URL https://gitpod.io/#https://github.com/lucasjellema/gitpod-superset-steampipe-postgresql that takes you right into the Gitpod workspace. After half a minute or so, Superset will be running (using docker-compose and six containers) as well as Steampipe and the PostgreSQL database.
Gitpod Workspace with Apache Superset and Steampipe
Once the workspace is up and running, the following components will have been installed:
- Apache Superset (six Docker containers – coordinated by Docker Compose)
- PostgreSQL (metadata store for SuperSet)
- Steampipe (including RSS plugin)
This picture shows the contents of the workspace:
Once the workspace has fully started and Steampipe is available, you can execute batch-mode (non-interactive) queries like this one:
steampipe query “select title, link, description from rss_channel where feed_link = ‘https://technology.amis.nl/feed/'”
to query details about the RSS feed published for the AMIS Technology Blog. This next query will list title, publication timestamp and URL for the most recent articles on the AMIS blog (latest first):
`steampipe query “
feed_link = ‘https://technology.amis.nl/feed/’
Other properties available in the RSS_ITEM “view” include author_name and email, image_title and image_url, categories, description and content. Let’s see if we can inspect these properties in Superset.
In order to look at data from Steampipe in Superset, we first need to add a database connection in Superset to Steampipe’s PostgreSQL database.
Superset – and Connection to Steampipe
Open port 8088 to enter the Superset web UI.
Login with user admin and password admin.
Click on the plus icon. In the dropdown list select *Data*. In the child menu, click on *Connect database*.
The Database Connection configuration wizard appears. Click on the PostgreSQL tile.
To configure the connection to the PostgreSQL database that Steampipe is running on, enter the following values:
- Host: 172.18.0.1
- Port: 9193
- database name: steampipe
- user: steampipe
- password: check in terminal window where steampipe was installed (or run: `steampipe service status –show-password `)
- Display Name: anything you like, for example *Steampipe*
The password created for the *steampipe* user can be seen in the terminal window labeled *Install Steampipe with RSS Plugin*:
Click on the Connect button.
A window appears that indicates that the database has been connected. Press the Finish button.
Define Dataset and Chart
Define the Dataset rss_item based on the “table” *rss_item* in the *rss* schema in the *Steampipe* database. Note: *rss_item* is not actually a table, but it is interpreted by Superset as one. For the purposes of what we are doing that is completely fine.
We have not yet specified from which RSS Feed this data set should draw its data. We will define that through a filter on the data set that is defined as part of the chart definition. Alternatively we can define a View in the PostgreSQL database that selects from RSS_ITEM and contains the filter on feed_link.
Click on the button “Create Dataset and Create Chart”.
The Chart editor is opened with the RSS_ITEM dataset already selected.
In the next screenshot, you see how the columns title, author_name and published are dragged to the *columns* box. Note that the toggle *Raw Records* is set. And very importantly: the filter condition has been specified: feed_link = https://technology.amis.nl/feed/
Click on “Update Chart” to see the data that is the result of the selection and filtering specified.
Now let’s turn it into a proper chart – one that shows us the number of articles published over the last few weeks.
Set the chart type to *Time Series Line*. Make sure the Time Granularity is set to Week (and the Time Column should be *published* ). The aggregate operator for the *title* metric should be *count*.
A little editing of the chart – to set titles on the axes – is easily done:
Analyze the BCC News Top Stories Newsfeed
If we want to work with data from a different RSS feed, we only have to modify the filter condition for the feed_link column. Here we switch from the AMIS Technology Blog feed to the feed that publishes the most recent BBC News Top Stories: http://feeds.bbci.co.uk/news/rss.xml
This example obviously is only the scratching the surface. So many data sources can be unlocked in Superset through Steampipe. And with PostgreSQL views that join multiple Steampipe sources together we can combine, enrich, filter and otherwise prepare data in countless ways.
Steampipe Homepage – https://steampipe.io/
Steampipe RSS Plugin – https://hub.steampipe.io/plugins/turbot/rss
My article: Steampipe–analyze data from cloud, file, platform, IaC using SQL queries (may 2022)
My presentation on SlideShare Steampipe – use SQL to retrieve data from cloud, platforms and files (Conclusion Code Café May 2022).
Apache Superset Homepage https://superset.apache.org/
Apache Superset GitHub – https://github.com/apache/superset