Modern Data Engineering practices at Buffer with David Gasquez
May 19, 2021
Hey, I’m David and I work as a Data Engineer at Buffer. My love for tinkerings with digital stuff started early in my life so when the time to go to University came, I applied to Computer Science (it was either that or Philosophy).
It wasn’t fun at all initially. I wanted to do something related with robotics but I wasn't even sure what that meant. Then, around the 3rd year, I coursed “metaheuristic” and took my first Machine Learning classes. After that year, I realized that the thing I enjoyed the most was playing with data and algorithms. I became much more interested in the backend/data side of things.
During my last year, I participated in a Kaggle competition organized by Airbnb with the hopes of learning more about the hyped “Data Science” world. I learnt a huge amount of things and ended up in a great position which gave me some moral boost (also a great thing to put on my CV).
Meanwhile I was learning more about Buffer and its culture. One day I saw a comment on a HN thread saying they were looking for a Data Analyst. At that time I was working as a backend engineer for a small company in Granada and decided to apply even though the only thing I knew about SQL was that it was used to select data from databases. After a few interviews, I got in as a Data Analyst.
I started as a Data Analyst right when I joined Buffer. After some time I started to miss the engineering side of things and began to drift slowly into a Data Engineering role. I’ve been doing that for around 4 years now. That said, with a small team like ours, we all have to wear some extra hats!
Buffer is a fully remote company and distributed across the globe. I have a few hours of overlap with my teammates in the afternoon so I have to take advantage of these. I do most of my focused work in the mornings and have most of my syncs in the afternoon while the rest of the Data team is online.
We aim for mostly async communications for most of the tasks. That means I will have around 3 meetings on a given week.
A typical day will involve spending some time on catching up on what’s going on with other teams, then some focused downtime on a project. My day usually ends either with a meeting or with some writing.
Data at Buffer
I would say Buffer data architecture is very much what is being called the “Modern Data Stack”. That means, we have some tools that write data to the warehouse, some tools to model the data then, and some extra tools to visualize and analyze the data.
On our case, these are tools we’re currently using:
We are 4. Buffer’s data team consists of 2 Data Scientists semi-embedded on Marketing and Product teams, one Data Engineer and a Data Architect that also acts as the team lead.
As a small team managing a huge amount of datasets, pipelines, and resources, one of the hardest problems we’ve had to solve frequently was figuring out the best way of scaling us up. Even though it is not a pure data problem, I’d say iterating on our tools and automating as many things as we could is something we had to solve a few times in the past and will need to continue doing so in the future.
For example, at some point we had some custom pipelines running that were processing and storing all production data into our data warehouse. Figuring out a way to keep these pipelines working reliably was tricky with the resources we were spending there. I’d love to say that we did some super smart trick and that made the pipelines much faster and reliable but that wasn’t what happened. We decided that developing and specially maintaining custom ETLs wasn’t worth it for us at that point so we started using managed services for these kinds of data pipelines. That was our super smart trick. We had to increase a bit more the data team’s budget but we got a lot of extra time in our weeks to work on the next layer of the data pyramid. In data (as in almost everywhere), there are no solutions, only trade offs!
Around the same time, we started adopting dbt, BigQuery, Segment, and other tools that moved us closer to the “Modern Data Stack”. We now had more time to focus on more important things and new data problems to solve.
Another good example of a problem we managed to solve early and has been working really nice for us was having a declarative tracking plan. Inspired by dbt and other declarative tools, we developed a tool that is helping us make our Segment tracking plan easier to audit and edit.
It takes care of reading a set of YAML files that contains all the event definitions (properties, documentation, types, ...) and syncs that back to Segment. There is no need to do anything from Segment UI and we can enforce types and some properties at tracking time. Also, it gives us history, PRs style collaboration and a way to know what's being currently tracked and who is doing changes where!
We started adopting dbt, BigQuery, Segment, and other tools that moved us closer to the “Modern Data Stack”. We now had more time to focus on more important things and new data problems to solve.
A problem in which we are always trying to solve is making sure that, as a company, most of the decisions we take are guided by Data™. Cleaned up, accurate and properly collected data.
Real life gives everyone lots of datasets to explore and learn from. On the other hand, you need to take care of a few things if you want to use it as input for your decisions: evolving business rules, technical caveats, weird relationships between entities, missing data, data that seems correct but is not in a subtle way, etc.
Right now, one of the things we’re doing to make “real life data” more actionable is designing better metrics and providing useful dashboards to stakeholders. We always had some kind of self serve analysis tools open to everyone but this has been a great exercise that helped figure out priorities, caveats, and have a deeper understanding of our datasets.
There’s an interesting tradeoff between providing your stakeholders many datasets and the proper self-serve tools, and having the data team abstract some of the complexity, making the final datasets much easier to explore at the cost of flexibility.
We’ve swung back and forth between these modes for a while and we’re still trying to figure out the best place to be. When it makes sense to provide the data and when it makes sense to have someone from data working alongside the decision makers hiding some of the complexities of the data.
Another interesting problem to deal with that we share with many other fields is managing the technical debt. Data technical debt specifically. Tables begin to pile up, dashboards become stale and at a certain point you might not know if a certain dataset is being used or hasn’t been updated in a few months. Right now, we’re keeping it under control with very ad-hoc methods and at the same time thinking some ways we can address it in a smarter way (resource owners, dashboards with TTL, more usage metrics, ...).
Managing our own services worked well for a while. We were managing a Redshift cluster, a Kubernetes cluster to run some jobs, and also a bunch of pipelines.
Over time we had to switch to managed services to better make use of our time. As I mentioned previously, we started using BigQuery, dbt Cloud and Fivetran. We can now focus on the data we have instead of figuring out all the different configurations for each service.
So we have around 10 terabytes of data stored in BigQuery, In the last 30 days, we’ve scanned or processed an average of 22 terabytes per day. That 22 terabytes account for all the things we’ve done on the BigQuery data, ingestion , preprocessing, analysis, dashboards, etc.
We receive almost 4 million visits on our website and around 170k monthly active users.
We have around 10 terabytes of data stored in BigQuery, process about 22 terabytes per day, have 4 million visits on our website and around 170k monthly active users.
We do! Buffer Analyze is the part of the product that takes care of that. This was completely developed by the engineering team.
A couple of years ago we started replicating the data into our warehouse and built a “Best Time to Post” feature. That was done in collaboration with the engineering team and is a great example of how a data team can help provide extra features to a product!
It can get expensive really quick for sure. Just a few weeks ago, we saw a 50% cost increase because we added a new table that was scanning a very large table each run.
That said, on BigQuery there are a couple of approaches you can take to mitigate these kinds of large and potentially slow scans. The most common solution would be to add a partition key on a certain field. In our case, we use the timestamps of the events. Since most of the analysis/queries only want recent data, we can use that field to tell BigQuery to only scan and return recent data.
Another approach is to break down the query into smaller steps and materialize them individually. For us, that works really well since we use dbt to control which tables are materialized and which aren’t. Having some intermediate tables materialized also reduces the costs in certain occasions where we don’t need to scan the original tables.
Running experiments has been an interesting use case to see at Buffer. We started with a custom framework that we’ve been evolving over time and we’re now at a point in which we run and visualize how experiments are doing quite easily.
We’ve also played a bit with ML. We currently have 2 features powered by the data and some models on top of it. One is providing customers an estimate of the best time to post for each social network they connect and the other is detecting toxicity and negativity in comments in real time.
One thing on top of my mind these days is trying to put the data we have where it's going to be used. The idea is to do some kind of “Reverse ETL” and push some of the data we have in the warehouse (MRR, complex metrics, custom cohorts) to external tools and make use of the data there.
Right now, we’re using Segment for something similar to the reverse ETL where we send our events to different places in real time. We can’t send data that is sitting on BigQuery though! Doing that will mean moving from real time to batch and will impact some use cases (e.g: monitoring live launches)
Regarding tools, we’ve been testing both Census and Hightouch, both very similar in features. That said, we’re still evaluating our options and what are the real needs we have in this area.
I’ve learned a lot of lessons since I joined Buffer and I’m sure I still have a lot more to learn. Besides the obvious one that data is very messy, I think a great lesson is that data is not enough. You need to present it properly and then take action to make a real impact.
In the technical side, I’ve learned to appreciate some of things:
The power of automation and standardization. For example, all data team repositories have a Makefile as the entrypoint. Alongside a Dockerfile, it makes getting started with a repository as easy as make dev.
General questions about Data Engineering
There are a few clear trends that are now settling up as the default for lots of companies.
One is the use of a pattern similar to the “Modern Data Stack”. The space is converging a bit more and we now have better patterns to solve common use cases. E.g: using something like dbt to model and manage the raw data in the warehouse. Also, more vendors and tools are popping up making managing large amounts of data easier.
Another trend is that data engineering is moving even closer to software engineering. That means more monitoring, testing, and other patterns that have been established in software engineering for a while.
At the same time, while more and more data engineering tasks get solved by vendors or open source tools like dbt or Meltano, it’ll open up possibilities for data engineers to move closer to ML engineers.
You’ve probably realized but I love dbt. Besides that, I’m a huge Python fan (that's the language I use at Buffer).
There are lots of great resources around Data Engineering. Some of my favourites are:
I’d love to learn how companies like Netflix or Twitter manage their data pipelines and what are the details and tradeoffs they have to take.
Recently I’ve been also curious about how organizations like Our World in Data are managing their data, especially how they maintain all the different and specific pipelines.