Hey folks, sorry I missed the last couple of weeks (was out on vacation or busy at work). However, I'm back just in time for the holidays with another addition of "WMMH" in the software world.
The theme for this week is data analytics. I've been working on a metrics project at work and have been pleasantly surprised at how much better the tooling has become since I first started data engineering. While there is a breadth of tools to be thankful for, I wanted to highlight three in particular that are making me happy:
Grafana is a dashboard/visualization tool that's feature-rich and easy to install. The tool is widely used by engineers, however, from my experience, it's been more classically employed for monitoring infrastructure metrics. This doesn't mean it can't be used to track your business's KPIs. Over the last few weeks, that's exactly what we've been doing at Peachjar.
What I really love about Grafana is how capable the open source version is (unlike competitors where the features are locked in the enterprise version). Some of the best features include:
- Google Auth
- Persistence of dashboards and preferences with an external DB (none of that HSQL or SQLite crap for the FOSS version). This means you can redeploy without fear of losing data or scale out the number of instances.
- Configurable session storage (allowing you to have multiple instances behind a load balancer)
- A plethora of data source plugins
- An alert system integrated with Slack
- Highly configurable, whether by config file or environment variables
- Playlists! (a set of dashboards in rotation)
I was able to get a custom deployment of Grafana via Helm/Kubernetes done in only a couple of hours. We simply extended the official Dockerfile with the configuration we needed, imported any dynamic variables via environment (mounted from Kubernetes secrets), and tied it to an internal instance of Postgres. Needless to say, everyone is very excited to use this product.
Timescale is not actually a database, but rather an extension that adds improved time series functionality to Postgres. This is done by introducing a new data structure called a "hypertable" to the database that more efficiently indexes and queries timestamps. The killer feature about Timescale is that it works basically like a Postgres table -- this means you can join data from other tables and use any other Postgres feature in conjunction with the extension. This is a huge win for modeling time series; bespoke time series data stores tend to lack the rich query support of an RDBMS. With Timescale, you get the best of both worlds.
And there's more folks! The Postgres data source plugin for Grafana was written by the Timescale crew. When you add a Postgres data source (that has Timescale installed), you can check a box in Grafana designating the data source Timescale compliant. This will optimize Grafana queries by using Timescale language extensions, boosting the performance of your queries!
There are a ton of stream processing tools and frameworks out there. If you are not working full time in the space, it's often hard to evaluate the merits of each when selecting a platform (should I choose Spark? Flink? Samza?). This is why Apache Beam is so exciting. Apache Beam is a stream processing abstraction that allows you to write portable stream jobs compatible with the top six frameworks on the market.
Beam comes with its own simple, yet powerful programming model which generates the bindings needed to be launched on platforms like Spark and Flink. While Apache Beam is an open source project, it's also the API for Google Cloud Dataflow. So if you are using GCP, you have the option of launching jobs on hosted infrastructure.
Thoughtworks Technology Radar - Q4 2018
Finally, we have a new addition of Technology Radar. If you are unfamiliar with Tech Radar, it's a publication of recommended software engineering technologies and practices. Keep in mind, Thoughtworks makes very conservative recommendations -- so if you see a recommendation to "adopt", you probably should have been doing it for 3 years now!
In this addition, there are a couple of interesting items of note:
- Adopt -- Event Storming: I wholeheartedly agree with this notion. Event Storming is probably the best way to explore your business domain and is great at helping engineers define the commands, queries, and events therein.
- Trial -- Crypto Shredding: the practice of encrypting PII data with separate keys (keys stored on a separate table or database) and basically destroying the key when you need to get rid of the PII data. Instead of having to remove records in databases, logs, Kafka, etc., you can simply "forget the key". The data stored in persistent mediums basically become lost by virtue of being undecipherable.
- Trial -- TypeScript: TS continues to win mindshare amongst Web and Node.js developers. The latest versions of the language bring even richer support to the type system making developers lives easier. TS also has fantastic tooling; I use both VSCode and Intellij IDEA and they feature integrated type checking, auto-completion, and more.
- Assess -- Debezium: this is a tool that watches databases and pushes changes to Kafka. If you are working with a legacy product, this is a fantastic way to synchronize a new system from changes in the old.
- Assess -- Apache Beam: hey, looks like I'm not crazy!
There are many other great things in the report, so check it out.
That's it for this week. I'll try to keep these coming weekly, but I may not be able to keep up over the holidays. If you have any comments, reach out to me on Twitter or LinkedIn.