Musings of a DevOps Data Pro

Continuous delivery and automation are a couple of core concepts of DevOps. As a Data Professional, I have spent countless hours pushing through the resistance of people who feel the data tier should be left out of source control, automated deployments, and who feel the need to make the data tier a separate work stream from applications. I have also spent a many hours building DevOps solutions for the data tier and teaching open minded data pros on how we can overcome our challenges of working with stateful technology.

We have a lot to learn from application developers. Our challenges can be a bit different than the challenges which the typical application developer deals with. However, there is major overlap between our problems and theirs. We may not be quite as special as we think we are. Many strategies which they have perfected are relevant to us.V

I am running a team of seventeen data professionals between Data Architects, ETL Developers, Business Intelligence Developers, and Quality Engineers. Currently we are working on a complete re-platforming and partial re-write of our primary enterprise data warehouse.

When planning this project, I knew that the team could afford to take a couple steps forward in the way that we released our processes and managed the code of our data warehouse and databases. One such improvement was to design a deployment workflow and to automate it. There would be technology enforced standards and best practices built right in, guaranteeing adherence or else the process would never make it to production.

The basic design

We put our meta database scripts, data warehouse scripts, and json based ETL process files into a single git repository. We then created an automated deployment workflow which would only consume from our release branch in git, to guarantee compliance with source control commits.

We then would deploy sequentially to the meta database, then the warehouse, and then to our ETL tool. This would happen in our quality environment, then performance test environment, and then production.

The database and data warehouse steps are an incremental / migration script method and the ETL tool supported an API based desired state deployment method. I would have preferred to use desired state deployments in the data tier, as well, but the products we are using did not have the proper tooling to make that easy for us.

The resistance

Going into this project I expected to have to guide the team through certain DevOps challenges. I knew, however, that once we implemented the processes the team would experience the major benefits first-hand. In many ways that has been true.

We did run into some unexpected resistance. We experienced pain in areas that I did not fully expect. The one I am going to focus on in this post is, “irrelevant bugs as a release blocker”.

Problem statement

I build a release for v1.0.65 and deploy that to the QA environment. In that release there are 10 new integration processes and 10 enhanced integration processes. A scattering of those processes have associated database changes. Version 1.0.65 passes testing and moves into the performance environment. In the meantime, v1.0.66 is built and deployed to QA.

Next, 5 of the new integrations for v1.0.65 and 5 of the enhanced integrations fail performance testing and are rejected. Now there is a complication because we realize that these integrations are not related at all. We want the 10 processes which passed testing to go to production but we have bundled our release in such a way that they cannot move to production without modifying the deployment artifacts. Which is an anti-pattern that we have disallowed. In addition, v1.0.66 is blocked by v1.0.65 because our databases need to have scripts deployed incrementally.

Learning from the other side

When thinking through this problem, I asked myself, “what would the application developer do?” I believe that most IT problems have already been solved, we just need to learn from those who are already doing it better than we are and adopt the pieces which are required to solve our problems.

This thought process brought me to consider four mitigating strategies which I could pick from or blend together to deal with my problem.

Smaller units

In a past job, I was a C# developer working on the business services team of an application which was service oriented. We had approximately six databases, twelve business services, a content layer, and a presentation layer. I worked on the databases, data access layer, and in the business services.

The deployments were automated and they were built in such a way where we could mix and match pieces of the tiered architecture on a release per release basis. The general rule was to not deploy to anything which did not have changes. Common sense, right? Well, that common sense lesson was violated in my deployment workflow above.

I have come to think of ETL processes as similar to a collection of services. The main difference between them and what I described in my C# days, is that we have over 400 ETL processes, not twelve. The concepts still apply, though.

One way to mitigate the stated problem is to break the repository or deployment workflow down into smaller pieces. Maybe the data warehouse has different domains, subject areas, or logical data marts which can be developed and deployed independently. It may be crazy to make a repository and deployment workflow per ETL process to make them truly independent but reducing the number of processes that each release chain contains will have a proportional reduction in your delays caused by bugs from irrelevant processes which just happen to be in the pipeline.

Feature kill switches

I enjoy challenging all assumptions. Especially when challenging them seems ridiculous or downright impossible. Recently, I challenged a peer or mine. I asked her, “why not just deploy the broken process to production? That would prevent it from blocking up your release pipeline.”

I do enjoy being dramatic in my professional communications. I was serious, though. Many applications embrace the concept of a feature kill switch. Stubbed code or simply incomplete code is often allowed to reside in the released code base, as long as the feature is turned off and has no impact on the users or the process.

ETL processes are cool because most of them tend to be triggered by a schedule. For new integrations which did not pass QA or are incomplete, it is an option to decide to deploy them into production with their schedules or trigger endpoints disabled. This allows you to keep your repositories, branching strategies, and deployment workflows simple, while avoiding lengthy delivery delays by a clogged release pipe.

A / B releases

I know what you were thinking, I mentioned disabling schedules to kill the feature but what about when you enhance a process? You cannot disable the schedule and just wait for the next release. That means a functioning process is going to absorb an outage until you can promote the necessary bug fixes.

This is where I think we can learn from A / B testing. The concept is that you can show two different sets of users two different sets of features. The reasons to do this are many and are very much off-topic, such as measuring the impact of the new feature in order to properly assess the value or negative disruption which came from it.

The specific ETL tool that my team is using has the concept of process versioning baked in. This is not meant as a source control replacement, however, because they also directly integrate with git. While not its intended purpose, I proposed that we allow buggy enhancements be released into production. During deployment we would then map the schedule for the process to the last stable version of the process. 400 processes could promote from v1.0.64 to v1.0.65 but those five buggy processes can just rollback or execute on their v1.0.64 copies.

Iteration speed

I was once told, “deadlocks are a performance problem, not a transaction flow problem, because if all queries completed instantaneously there would never be any opportunity to deadlock.” Our release coordination problem is similar. There is a direct correlation between the length of time it takes for a release to get from dev to prod and the risk of release timing challenges.

By reducing the size of each release we can reduce the time it takes to test it. By implemented automated tests and automated deployments we can further reduce the time any release sits in any given environment. By optimizing the end-to-end release process we keep fewer versions, of smaller scope, in the pipeline which reduces the likelihood and size of any problem which may come up.

Some times we don’t need to install traffic lights and stop signs, we just need to pick up the pace when crossing the street.

Takeaways

I intentionally left specific products and tools out of this post because I did not want anyone to come away thinking that I presented a how-to or a set of best practices. The point of this post is to, hopefully, broaden your way of thinking about how to deal with data tier DevOps challenges. Look at the similarities and differences between what data professionals deal with and what application developers deal with. It is too easy to constrain our minds to, “the way you do data,” or dismiss concepts like disposable code and environments just because disposable and stateful seem like opposites.

Maybe some day I’ll tell you about when I went on a week’s vacation the day before I began dropping my team’s development environment and restoring it from an upper environment, every morning.

Leave a Reply

%d bloggers like this: