There are millions of web applications on the Internet that are under constant development. Paying software developers to work on bug fixes and new features is quite expensive already, but what’s often neglected is the cost for deployment and operation. Well-run organizations invest in their deployment and runtime infrastructure and are rewarded with reduction of errors, shorter downtimes and lower costs in the long run.
Professional IT organizations run dozens of different applications on thousands of machines with a surprisingly small number of administrators. I’ve seen rollouts on 20+ machines during the day without any downtime; it’s just a matter of infrastructure.
In this article I’ll discuss some best practices I’ve learned in the field, concentrating on three important aspects: automation, monitoring, and standardization.
You should never need to ask yourself what it takes to roll out the latest release on production systems. Rollouts shouldn’t require arcane knowledge or lots of detailed installation instructions. In a good organization, there’s a general rollout process in place and things are automated as much as possible. A deployment process that involves admins doing an "svn checkout" in your web server’s htdocs directory and then manually adjusting configuration on each host is not exactly the most robust approach.
Your build system is the best starting point for improving things there. Provide a one-button build with different profiles for development, QA, and production systems. It should be absolutely painless to create a release artifact that relies on as little external configuration as possible. Don’t create a release from a developer’s workstation though. Use a dedicated, well-configured integration machine for that.
Usually it’s a good idea to store production configuration (maybe except passwords) in your source repository or configuration management database. You don’t want to lose your hand-crafted configuration if a machine crashes beyond repair. Release artifacts can be tarballs, WARs, ZIPs, even RPMs, but make sure you actually have a self-contained, versioned and installable artifact.
Installing the software on multiple machines has to work with as little human interaction as possible, too. Nobody should have to log into the box and fiddle with configuration settings. Write scripts and test them thoroughly so that people trust them. Robustness and transparency (in case things go wrong) are key here.
If you’re working with Java web applications on Tomcat, for example, why not use a fresh Tomcat installation for each release? Copy it to a new directory on the target machine, shut down the old server and then start the new server that already contains your web application. No cruft is left between installations, no long-forgotten configuration, no questions why one host works and another one mysteriously doesn’t. You might even be able to revert to an old release in case of trouble (unless there are DB schema changes or something).
No application will work indefinitely like it did when it was first installed. To provide a reliable service, you need notification when (not if) things go wrong. Fortunately, with tools like Nagios the infrastructure isn’t difficult to set up.
Noticing a crashed machine is good but certainly not enough. There are many things that can go wrong in your application even if all your machines are running happily. Meaningful monitoring works on the application level, too, and of course it is highly application specific. You could test if external resources are still available (like databases or web services) or if important use cases still work (like an ordering process). Careful analysis is needed here.
Your application has to provide interfaces to the monitoring framework so that regular checks of the application’s health are possible. The interface can be JMX-based or there might just be a web page with an easily parseable status format that’s only available from within your network. There are many ways, but the difficult part is to figure out when your application works perfectly and when it doesn’t. Typically, you want to be notified if a web application generates 4xx or 5xx pages above a given threshold. But be careful, attackers could use that knowledge to ruin your admins’ Sundays.
Standardization is important if you’re operating many similar applications. You typically want the application running 24/7 and every sysadmin (even one who isn’t familiar with the particular application) should be able to perform the basic tasks like figuring out the status, fixing minor problems, restarting it, finding the documentation etc.
Good and simple things to standardize are the location of program and data files, configuration, and logging output. Provide documentation for developers to follow, or even better, provide a project template that already contains the framework that’s necessary to make an application blend into your environment nicely.
Work out a general deployment process to reduce the risk of rollouts. Even if you don’t have a dedicated rollout manager, you can still cut down the stress for everyone involved. Create a rollout plan for each release that contains all required information like involved people, affected systems, step-by-step instructions, expected effects and success or failure conditions. Don’t forget to provide a rollback plan in case the new release doesn’t work out as expected.