DevOps - Start of Collaboration
In a recent meeting of the Octopus Deploy Workgroup, I proposed the question "what do we have to change in our process to be able to deploy to pre-production every 45 minutes?" The Octopus Deploy Workgroup was created a year ago when Farm Credit Services of America decided to adopt Octopus Deploy (shocking!) as the deployment tool and VSTS as the build tool. The group consists of two DBAs, three web admins, four developers, two lead developers and an enterprise architect.
We have been going strong for about a year, every C# project has been converted from TFS over to using GIT as the version control of choice, VSTS as the build tool, and Octopus Deploy as the deployment tool. That was a relatively painless conversion, the tooling was different (in the case of GIT, very different), but the core concept was the same, a tool would automatically build and deploy code. The database piece is a different story, placing a database under source control and having a tool automatically deploy changes was very new. As a result, there are still a number of databases still being placed into that deployment process.
But for the most part, the overall process was stable and successful. In the case of my team, very successful, we are averaging a deployment every 10 days (both major and minor), our typical deployments take less than 5 minutes to get the whole application deployed, and we can respond to production issues very quickly, in some cases faster than users can report them. This is a marked improvement over the previous deployment cycle, which was much, much slower. Deployments occurred every 2-3 months, which resulted in large deployments that took a lot of time to deploy and verify. The point is, we have made a good start, it is time to make improvements and make it even better.
Back to the start of the story, which was proposing the question, "what do we have to change in our process to be able to deploy to pre-production every 45 minutes?" This question was proposed as a means of realizing in order to solve for that both developers and operations people would need to collaborate, which, hopefully, would help us on our path to DevOps.
Why Pre-Production instead of Production
Despite our improvements, going to production is still, well, quite a production. Get it?!? Pun intended!
For example, in order to go to production, we have to meet with our change review committee, or CRC, every day at 11:00 AM. If I used the word production, the group would immediately start focusing on CRC and the need to get rid of it. My fear was we would spend the entire time on that. Another team is working on getting CRC streamlined. Another team is looking into that because it involves a lot of leaders (our term for a manager or vice president) and if I had to guess, some C-Level Executives.
Pre-production is another beast entirely. Only developers and operations people know about that and use it. Deploying to Pre-Production doesn't require a CRC. As a result, we can focus on building the collaboration, process, and any required tooling between developers and operations (DevOps!).
Why 45 minutes instead of once a day or every couple of hours
Most people can handle doing something once a day, even if it sucks. There is always some grumbling, but rarely enough to make any sort of meaningful change. Inertia will kick in for enough people and no progress will be made.
Flip that on its head to every 45 minutes. If you have to do something that sucks every 45 minutes everybody will be screaming, "oh god make it stop, I'll do anything to make this suckfest stop."
One example is every database deployment to pre-production requires an approval of a delta script by the development team and a DBA. The development team might be able to handle approving the delta script every 45 minutes, but a DBA is responsible for dozens of databases. They would be approving something basically every minute of their 8 hour day. Having manual approvals means the process, as I like to say, "doesn't scale for shit."
The Answers
The workgroup has a great cross-section of job roles. Naturally, each job role had their own set of answers. The only rule to the answer was there were no rules, the sky was the limit.
DBAs
The manual approval step is the biggest hurdle to overcome with going to pre-production. Going to the leaders and telling them we want to remove it and not have anything to replace it will fail.
- Involve the DBAs earlier in the process, pre-production is just too late, we need to start the collaboration a lot earlier
- Check to see if there is a way to check the delta scripts for certain breaking changes, like drop tables, drop columns, etc. That is basically what we are doing now, there is no reason we couldn't automate that
- Start making use of some sort of tool to verify database guidelines are being followed much earlier in the process, ideally during the build steps
Web Admins
The process for getting C# code up to the pre-production environment is fairly streamlined. The only time a manual approval is required is when a new project is about to be deployed or something in the deployment pipeline changed. The expectation is the code will be pushed to pre-production by Octopus Deploy within a minute or two.
- Regularly scheduled realistic load testing. Review the logs to determine which endpoints are being hit the most and how often and run load tests against them. Don't have the load tests just push the system to the absolute extreme all the time, maybe first focus on being 20% higher than the heaviest load, or 50%.
- Research what it would take for blue/green deployments. What tooling can we use, how should we change Octopus Deploy, and so on
- Use NewRelic to be proactive instead of reactive. Say after a deployment an endpoint's performance dropped 5% then the development team should be notified before it drops to a point where performance is unacceptable to our users.
Developers
Change has to occur on both sides of the DevOps equation. It can't just be the operations people making changes. There are a number of strategies developers can start looking at to help get to pre-production every 45 minutes.
- Research for additional analysis tools to help enforce C# guidelines. The loan origination system my team is responsible for had the guideline is any time a database is hit or a service is called the code must make use of .NET's async programming.
- Automated testing right after the deployment, not just unit testing during the build. This would allow us to make use of a blue/green deployment strategy.
- Start looking into breaking apart large monolithic applications into smaller components and only deploy those components that actually change. For example, if there is a windows service that only changes once a quarter then it doesn't make sense to rebuild and redeploy it all the time. Only build and deploy when a change needs to happen.
The Next Steps
We identified the having the manual verification of the delta script being our first thing to remove to help us get to pre-production every 45 minutes. Remember earlier how I said going to a leader and asking to remove the manual DBA approval step without replacing it with anything will fail? Guess what, it did fail. I tried a couple of months ago. I don't blame them. My approach was all wrong. That step provided something beneficial. Exactly what has been debated quite a bit. Knowing what I know now I should have approached it more along the lines of, "hey the DBAs and Developers have collaborated and come up with these automated steps that we think can replace the manual DBA approval step."
My first focus is collaborating with the DBAs and database developers on some automated verifications that can be added throughout the database development lifecycle. Right now, we have a couple of ideas in the hopper that are being evaluated.
- Delta Script verification - we are making use of Redgate's DLM tooling for our deployments. That process kicks out a delta script of changes. We have started looking into running some regular expressions using PowerShell to look for certain key phrases we know will cause issues when going to production. Our goal is to run this verification on every environment being deployed to and perhaps stop the deployment if a bad change is about to happen.
- TSQLT Unit Tests - We are looking into what kind of verification these tests can provide out of the box as well as any other tests we can add. These tests would run during the build and would help prevent any bad changes from reaching Octopus Deploy.
- Scheduled Script Verification - the DBAs and enterprise architects have created a set of scripts that can tear through a database and check to see if guidelines are being followed. Where guidelines are not being followed the scripts will spit out a SQL Statement to update the database so it does follow guidelines. We are looking at ways we can schedule them and/or where to include them in the process (build or deployment).
Conclusion
After the meeting one of the DBAs stopped me in the hall and said he is all for DevOps, and most of the DBAs are on board as well. We have to be very careful not to ram this down their throats. I couldn't agree more. That was more or less my approach with the automated database deployments and I received a lot of pushback. In an enterprise change takes time. In some cases, a lot of time. Some people take longer than others to accept the change.
A good chunk of the time people are for the change, but because of demands of the business (you know the people we are all there to serve), their focus is elsewhere. From March of 2014 until November 15th, 2016 my team was working on a major project. During the last six or so months of the project, my focus was on getting that done. Any changes being proposed had to be postponed till basically now. It is important for me, any really anyone starting on this, this will happen.
It is also human nature to fear change. Completely understandable. And a lot of them have very valid concerns, which will require collaboration to address and resolve.