Surviving AWS Failures with a Node.js and MongoDB Stack
Node+Mongo on EC2 is a very popular software stack among web services developers. There are many user guides on how to design such a system with built-in redundancy so that even coordinated failures don’t bring down the service. The absolute minimum for a resilient service requires a MongoDB replica set behind a load-balanced node farm.
However, you are not ready for an EC2 outage until you have deliberately shutdown components in your system and verified the expected behavior. As you periodically do this, you might discover that there are gaps you did not account for. In this blog I want to share our experiences beyond the initial configuration and add some fine details on creating a Node+Mongo application that is both secure and resilient to common EC2 failures.
What you should worry about upfront:
- Node.js single event will by default crash on an unhandled exception so you need to worry about restarting the process. Upstart or forever will do the job.
- You also have an external process on your server making liveness checks and potentially restarting your service. Monit is a good one, it has the added benefit of emailing you when it had to restart. While upstart ensures that your process is up, monit ensures that it is responsive, and those are two different things.
- Your application instances and your MongoDB instances are each load balanced across multiple Availability Zones. The more the better.
- Place the node servers and the mongo servers all in a security group, which beyond ssh allows only the Mongo ports internally and your application ports externally. This is trivial to set up and protects your database from external requests.
- If you are like us, you have the additional protection of the MongoDB authentication. Mongo’s security model has limited robustness, but having authentication in your MongoDB store is still useful even if the application and the database are inside an EC2 security group. For your data to get exposed, you will have to make multiple mistakes at the same time, which happens, but the chances are greatly reduced.
Are you good to go for the next AZ outage?
Not quite. You need to ensure that failover happens smoothly. Let’s shutdown the primary Mongo instance and see what happens as requests keep coming in. The replicas notice the down primary and one of them takes over, but upon an incoming request you see this error message:
“unauthorized db:mydb lock type:-1 client:127.0.0.1”
What this means is that the failover happened, but your application’s request are not authenticated. This is an example of an esoteric bug that may not show up until you do a full end to end test. The bug is now in a pull request. Since pull requests don’t get released quickly, use npm git dependencies to install your app from your forked repo.
Ready to go? Not quite.
While you have a down MongoDB instance, you may come across other trouble. If your application has to restart for whatever reason, it may take a long time. This is because the driver, on connect, won’t call your callback until it times out connecting to the down instance.
The default in node-mongodb-native is no timeout, which means leaving it down to the OS (unless your instance is down in a way that results in RSTs to be sent in response to SYNs). On standard Linux systems this is controlled by tcp_syn_retries, and comes up to about 3 minutes. For EC2 instances I see it takes ~20s, regardless of what /proc/sys/net/ipv4/tcp_syn_retries says.
What makes matters worse, is that your service checker may find your application unresponsive while it is connecting to the database. It tries to be helpful and restarts the process, leading into a yet another timeout to restart cycle…
The solution is to use Mongo’s connectTimeoutMS setting.
This ensures that your restarts will take a little over 500ms, if you have a down instance. However, don’t assume that your Mongo SDK supports it – node-mongodb-native doesn’t, unless you use my patch.
You are now ready for an AWS AZ outage.