No. You can’t fairly compare Amazon OpsWorks to a CoreOS environment so we’re not going to do that. All I will say is that OpsWorks and Chef combined for an extremely slow and buggy deployment process which lead us to look for an alternate solution.
We started experimenting with Docker a few months back to accelerate our testing environment. The thought of building the components of Turret.IO into self-reliant containers was most welcomed and it provided an easy way to ship images around. Deploying to our development environment took a few minutes rather than 15. Making a simple change to one of our Go binaries? No problem. Commit and push our changes to our Git repository, trigger a Jenkins build, and our new image is sitting in our private Docker repository ready to go. A fleetctl destroy
and fleetctl start
and we’re in business.
After several weeks of testing under this model, we decided to migrate from OpsWorks. OpsWorks and Chef are incredibly powerful and together, provide a deployment strategy that’s suitable for many applications. OpsWorks has the ability to support entire platforms and can scale instances as needed while Chef can handle virtually any job you throw at it (deploying software, installing libraries, configuration, etc.). Even with all this support, it just didn’t feel right. CoreOS, Docker, and Fleet are relatively new players in the game, but the concepts they brought with them meshed well with our goals and expectations.
Process Communication
One of our first migration steps was figuring how how our Docker containers would find one another. Our existing solution was relying on configuration files and was hardly resilient. Ambassador containers provide a way to create links between containers, but in an effort to keep things simple and reduce the overall number of containers we were going to run, we decided against using them for now.
Etcd is part of CoreOS and provides a REST-style API for reading and writing to its key-value store. Since the CoreOS cluster relies on Etcd, it’s already setup and ready to go. We added a few lines of code to our service startup scripts to insert or update their IP addresses into their service keys. For instance, our Redis slaves first check for a REDIS_MASTER
before starting. Once they’re up, they append their IP address to REDIS_SLAVES.
Each service must only know the Etcd key of the service it wants to connect with. We’re using this model for the majority of our internal services like Cassandra, Redis, and RabbitMQ.
Docker
The next part of our migration was determining how we were going to use Docker. Since CoreOS (currently) promotes the use of Docker for running virtually all user-land programs on the system, we needed to make sure we weren’t making any poor decisions. We started by strategizing an image for each service. All of our Go binaries could share the same image, but to keep the bloat down on our other images, we constructed a Dockerfile for each service. In total, our six Dockerfiles would give us the minimal images needed to support each individual service.
To make the most efficient use of Docker for deployments, we found it was almost required to have our own privately hosted Docker Registry. After setting up NGiNX as an SSL-terminating and authenticating reverse-proxy, we were able to login to our own registry from the command-line. Amazon S3 was elected as our image back-end as to not concern ourselves over the size of a growing pile of images. Now that our registry was setup, we were able to push and pull images which, in my opinion, is one of the things that sets Docker apart. Without the registry, we’d be forced to come up with our own way of pushing images around. I’m sure this could be done with scp
or rsync
but using the registry just felt easier.
Building
Now that we have our Dockerfiles and can push and pull images, it was time to figure out how to build and test. Our build process consisted of a simple bash script that accepted a suffix (dev, integration, production, etc.) and ran a go-runtime Docker container with our Go sources mounted as a shared folder. The binaries it spit out after testing were then copied to the directory containing the Dockerfile for our Go programs and the script continued by building that Docker image, copying the Go binaries into the image as the last step.
Next was our Python app that supports the front-end. The Dockerfile installs Python libraries and copies the application files into the image as the last step.
The remaining Dockerfiles are then built and we start running our tests. If all tests succeed, each image is pushed to the registry with the appropriate suffix.
Testing Locally
One of the important features of Docker is the ability to easily run containers in almost any environment. The downside to our CoreOS deployment is that testing locally wouldn’t give us access to CoreOS specific programs like Etcd. To address this, we adjusted the run scripts we use to start each service (the ones that connect to Etcd) to check for environment variables first. That way, starting a Docker container with -e REDIS_MASTER=x.x.x.x
lets us bypass Etcd and use the specific environment variables we provided without any other modifications.
Testing in AWS
The next part of our migration involved setting up a test environment in AWS to match our production environment. CoreOS requires a minimum of three machines (always an odd number), so we created a VPC with public and private subnets. A 2-machine cluster would make up our Cassandra test installation in the private subnet while a single machine in the public subnet would run our front-end. Because CoreOS instances must be able to reach the discovery.etcd.io endpoint to receive a cluster key, we needed to enable NAT and IP forwarding on the public instance (as to not be forced to run a 4th instance merely for NATing).
Fleet and Systemd
Once our instances were up and running it was time to launch some services. fleetctl
can be thought of as a clustered version of systemctl
and it’s responsible for submitting our unit files to the cluster and managing them. In total we have 12 unit files that are required to run all of the components of Turret.IO. Here’s an example of our first uWSGI unit file, aptly named uwsgi@.service
:
Here, we introduce an EnvironmentFile
that points to /home/core/environ
(see cloud-config below) and is responsible for setting the $DEPLOY_SUFFIX
environment variable. This gives us the flexibility to pull and run Docker images that match a given suffix (i.e. dev, production, feature branches, etc.). We run several Docker commands to make sure the desired container is stopped and we have the latest version before running, which allows us to update an entire service with two simple commands like so: fleetctl destroy uwsgi@1.service && fleetctl start uwsgi@1.service
. The unit file executes the ExecStartPre
directives on each startup to stop and remove any existing image of the same name and download the latest image from our private repository (which requires a login, see cloud-config below). The next lines handle startup and shutdown of the service.
The last interesting bits of this unit file are within the X-Fleet
section. Our Conflicts
directive ensures that all instances of uwsgi@[1..100?].service
are distributed to different machines to create a highly available service. The MachineMetadata
directive instructs Fleet to only run this unit file on machines with metadata matching subnet=10.0.0.0/24
(see cloud-config below).
cloud-config
Next in our migration process was bootstrapping instances with some data that’s required before we can properly start our Systemd units with Fleet. We have three main dependencies that our cloud-config file is responsible for:
1. Machine metadata
Part of our infrastructure involves keeping Cassandra on a private subnet, not directly accessible from the outside. So when we launch instances on that subnet, we specify which subnet we’re launching on in the cloud-config for that instance using the following:
fleet: metadata: subnet=10.0.1.0/24
This corresponds with the MachineMetadata
directive in our unit file and allows us to start specific Systemd units only on particular subnets, exactly what we need for Cassandra.
2. Authentication to our private Docker registry
Since we’re making heavy use of Docker to pull images and start services, we need to authenticate against our private Docker registry. This is accomplished by writing a JSON file to /home/core/.dockercfg:
write_files: - path: /home/core/.dockercfg owner: core:core permissions: 0600 content: | { "https://private_repository_url:1234/v1/": { "auth": "basic_auth_encoded", "email": "email_address" } }
basic_auth_encoded
is simply base64(username:password).
Once this exists, Docker will automatically authenticate against your private repository when your images are prefixed with your URL (i.e. registry.yourdomain.com/docker_image).
3. Establishing an EnvironmentFile
Our environment file can contain any environment variables that we want our unit files to have access to. In our case, we launch each instance with a DEPLOY_SUFFIX
- path: /home/core/environ owner: core:core permissions: 0655 content: | DEPLOY_SUFFIX=-dev
Now whenever we pull an image from our registry, we can make sure the image we’re pulling matches the environment we’re using. This works well to manage multiple environments like dev, staging, production, etc.
Deployment and Future
After testing this setup, we decided to bring up our CoreOS production cluster, copy our Cassandra data from our old environment, and finally switch load balancers in our DNS to conclude our migration.
We do have the occasional hiccup, mostly due to the immaturity of the platform we’ve chosen. One of our services needs to be restarted periodically and Docker has misbehaved with its memory use, forcing us to restart it on two occasions.
At this stage we have not yet completed our auto-scaling strategy, but are evaluating the use of AWS’ Autoscaling combined with automatic unit file submission of highly available Systemd units.
The future seems bright for CoreOS — recently announcing plans for their own container run-time could make the container interactions even more seamless. While we’ll most likely be sticking with Docker for at least the immediate future, we’re interested to see where this takes them!
Leave a Reply