Original post

Not FAANG but for small to medium “cloud native” businesses I like to use this approach with minimal dependencies:

Managed Kubernetes cluster such as GKE for each environment, setup in cloud provider UI since this is not done often. If you automate it with terraform chances are next time you run it, the cloud provider has subtly changed some options and your automation is out-of-date.

Cluster services repository with Helm charts for ingress controller, centralized logging and monitoring, etc. Use a values-${env}.yaml for environment differences. Deploy with CI service such as Jenkins.

Configuration repository for each application with Helm Chart. If it’s an app with one service or all services in a single repo this can go in the same repo. If it’s an app with services across multiple repos, create a new repo. Use a values-${env}.yaml for environment differences. Deploy with CI service such as Jenkins.

Store secrets in cloud secrets manager and interpolate to Kubernetes secrets at deploy time.

Cloud provider keeps the cluster and VMs up-to-date, CI pipelines do the builds and deployments. No terraform/ansible/other required. Again, this only works for “cloud native” models.

Yeah, in a decent architecture the only place state is located is in the datastore layer.

The goal is to make servers disposable, able to be destroyed and created at will, so configuration management becomes kind of a legacy technology at that point.

For traditional datastore, I usually do:

Dev/QA/Similar: Either containerize and back to persistent volume, or use a managed DB service such as RDS or Cloud SQL and create a schema per environment. Include a deployment pipeline argument to reset to known state. CI pipeline can be tuned to handling dynamic environments in either case.

Stage/Prod: Use managed DB service such as RDS or Cloud SQL.

The time and cost to automate a DB upgrade with every edge case considered is huge. Rarely makes sense for small/medium business.

Nitpick: I really don’t suggest a divergence in the DB/stack-of-choice between Dev/QA/Stage/Prod. I’ve chased so many issues that were in the planning process dismissed as “yeah that’s an edge case and most likely won’t happen”.

The reasons I’ve seen for doing so are usually penny-wise, pound-foolish. Penny-wise in saving a few dollars (conceptually) on a spreadsheet for per-env/per-cycle, while neglecting the long-tail consequence of your labor factor just growing, potentially forever, without regard for total cost of ownership.

Sorry didn’t mean to rant. Hope this helps.

I favor Ansible for 2 main reasons:

– If you have SSH access, you can use it. No matter what environment or company you work for, there’s no agent to install and no need to get approval to use the tool. It’s easy to build up a reproducible library of your shell habits that works locally or remotely, where each step can avoid being repeated in case there’s a need to rerun things.

– If you get into an environment where performance across many machines is more important you can switch to pull based execution. Because of that, I see very little advantage to any of the other tools that outweighs the advantages of Ansible.

Try Puppet Bolt. Workflow is similar to Ansible. No pesky master or certificate setup, no preparation needed, just an inventory file and SSH to the remote server. You get the entire ecosystem of Puppet modules and the Puppet language scales well when your configuration becomes larger.

> If you have SSH access, you can use it. No matter what environment or company you work for, there’s no agent to install

I don’t get why is this always brought up as a major advantage when discussing CM. Ansible actually installs its Python runtime to target systems. Once I had a server that had full disk root and Ansible failed to work because there was no space left to copy tons of its Python code.

Ansible doesn’t install a runtime on the target machine, it temporarily copies over the scripts that do the work and removes them after the run is complete. These are a few kilobytes typically.

No configuration system is likely to work with a full root partition, though.

I still prefer the Open Source edition of https://puppet.com/ to manage larger, diverse environments – which may include not just servers, but workstations, network appliances and so on. It’s well established with lots of quite portable modules. But it can also be a bit on the slower side and comes with a steeper learning curve then some of the others.

https://www.ansible.com/ is surely a good solution for Bootstraping Linux cloud machines and can be quite flexible. I personally feel like its usage of YAML manifests instead of a domain-specific language can make complex playbooks harder to read and to maintain.

If all you do is to deploy containers on a managed Kubernetes or a similar platform, you might get away with some solution to YAML templating (jsonnet et al) and some shell glue.

I am keeping an eye on https://github.com/purpleidea/mgmt which is a newer contender which many interesting features but lacks more complex examples.

Others like saltstack and chef still see some usage as far as I know, but I’ve got no personal experience with them.

Ansible amazing for configuration management, much better than Puppet. Storing the config in YAML makes it super easy to read and maintain, also much better than Puppets method.

As you mention, puppet has a steep learning curve, whereas Ansible has a very shallow one. It’s easy to get running in a few minutes!

We use both Puppet and Ansible at work, and its constant complaints and delays with Puppet whereas Ansible is little complaints and no delays.

> We use both Puppet and Ansible at work, and its constant complaints and delays with Puppet whereas Ansible is little complaints and no delays.

That’s probably because you are not running masterless, which means your puppet master is a bottle neck.

I used to use Puppet back when they were ruby based. I dropped them once they switched to Java, not interested in pushing Java onto every host when it’s not in our stack.

It’s still good in Enterprise land where taking the time to work out the declarative style and dependency chains is worth it (and you have the people to put on it and the CAB process to review infrastructure changes). For a small to mid sized company I find it gets in the way of iterating fast. I spent waaaay too much time there either fighting the tooling or having to work out dependency chains. Redhat and I-think-AWS-but-I-might-be-thinking-of-Chef also have tooling in this space.

I’ll take Chef or Ansible’s imperative approach in the environments I work in any day (Mostly ansible playbooks for baking hosts only, I’ve never been entirely comfortable with having one Ansible Tower/Chef Server/Puppetmaster/etc be authoritative over everything, too large a failure pattern if security controls fail). But again, I’m working in many younger small environments and not large mature ones.

Most of this is also irrelevant for us as we’re all in on Docker/ECS for anything new. Config management plays a limited role there over having your tasks/services checked into the individual repos.

Just for reference, the clients are all still ruby based. It’s only the web servers for the puppet masters ( the parsing code is still jRuby ) and puppetdb that are written in clojure that runs on the JVM.

I’m curious why people use configuration management software in 2020. All of that seems like the old way of approaching problems to me.

What I prefer to do is use Terraform to create immutable infrastructure from code. CoreOS and most Linux variants can be configured at boot time (cloud-config, Ignition, etc) to start and run a certain workload. Ideally, all of your workloads would be containerised, so there’s no need for configuration drift, or for any management software to be running on the box. If you need to update something, create the next version of your immutable machine and replace the existing ones.

“Immutable infrastructure” what a laugh. In a large deployment, configuration somewhere is always changing – preferably without restarting tasks because they’re constantly loaded. We have (most) configuration under source control, and during the west-coast work day it is practically impossible to commit a change without hitting conflicts and having to rebase. Then there are machines not running production workloads, such as development machines or employees’ laptops, which still need to have their configuration managed. Are you going to “immutable infrastructure” everyone’s laptops?

(Context: my team manages dozens of clusters, each with a score of services across thousands of physical hosts. Every minute of every day, multiple things are being scaled up or down, tuned, rearranged to deal with hardware faults or upgrades, new features rolled out, etc. Far from being immutable, this infrastructure is remarkably fluid because that’s the only way to run things at such scale.)

Beware of Chesterton’s Fence. Just because you haven’t learned the reasons for something doesn’t mean it’s wrong, and the new shiny often re-introduces problems that were already solved (along with some of its own) because of that attitude.

Are you sure you two are talking about the same thing?

My understanding of immutable infrastructure is the same as immutable data structures: once you create something, you don’t mess with it. If you need a different something, you create a new one and destroy the old one.

That doesn’t mean that the whole picture isn’t changing all the time. Indeed, I think immutability makes systems overall more fluid, because it’s easier to reason about changes. Mutability adds a lot of complexity, and when mutable things interact, the number of corner cases grows very quickly. In those circumstances, people can easily learn to fear change, which drastically reduces fluidity.

Yup. We do this. When our servers need a change, we change the AMI for example, and then re-deployment just replaces everything. Most servers survive a day, or a few hours.

Makes sense to me. I was talking with a group of CTOs a couple years back. One of mentioned that they had things set up that any machine more than 30 days old was automatically murdered, and others chimed in with similar takes.

It seemed like a fine idea to me. The best way to be sure that everything can be rebuilt is to regularly rebuild everything. It also solves some security problems, simplifies maintenance, and allows people to be braver around updates.

Configuration Management is still present in this process, it’s just moved from the live system to the image build step.

Probably the most insightful comment in this entire thread. Thank you. In many cases, an “image” is just a snapshot of what configuration management (perhaps not called such but still) gives you. As with compiled programming languages, though, doing it at build time makes future change significantly slower and more expensive. Supposedly this is for the sake of consistency and reproducibility, but since those are achievable by other means it’s a false tradeoff. In real deployments, this just turns configuration drift into container sprawl.

Is this still as painful as it used to be? AMI building took ages, so iteration (“deployment”) speed is really awful.

Personally that’s why I avoid Packer (or other AMI builders) and keep very tightly focussed machines set up by the cloud-init type process.

So, once you create a multi-thousand-node storage cluster, if you need to change some configuration, replace the whole thing? Even if you replace onto the same machines – because that’s where the data is – that’s an unacceptable loss of availability. Maybe that works for a “stateless” service, but for those who actually solve persistence instead of passing the buck it just won’t fly.

Could you say more about why your particular service can’t tolerate rolling replacement of nodes? You’re going to have to rebuild nodes eventually, so it seems to me that you might as well get good at it.

And just to be clear, I’m very willing to believe that your particular legacy setup isn’t a good match for cattle-not-pets practices. But I think that’s different than saying it’s impossible for anybody to bring an immutable approach to things like storage.

The person you’re replying to didn’t say “replace every node,” they said “replace the whole thing.”

To give a really silly example, adding a node to a cluster is a configuration change. It wouldn’t make sense to destroy the cluster and recreate it to add a new node. There are lots of examples like this where if you took the idea of immutable infrastructure to the extreme it would result in really large wastes of effort.

Could you please point me at prominent advocates of immutable infrastructure who propose destroying whole clusters to add a node? Because from what I’ve seen, that’s a total misunderstanding.

As I said, it’s a silly example just to highlight an extreme. In between there are more fluid examples. I don’t think it’s that ridiculous to propose destroying and recreating the cluster in its entirety when you’re deploying a new node image. However as you say I’m not sure anyone would advocate that except in specific circumstances.

On the other hand, while my suggestion of doing it to add a node sounds ridiculous I’m sure there are circumstances in which it’s not only understandable but necessary, due to some aspect of the system.

I’m saying it’s not even an extreme, in that I don’t believe what people are calling “immutable infrastructure” includes that.

If your biggest objection to an idea is that you can make up a silly thing that sounds like it might be related, I’m not understanding why we need to have this discussion. I’d like to focus on real issues, thanks.

Wow, look at those goalposts go! If you make enough exceptions to allow incremental change, then “immutable” gets watered down to total meaninglessness. That’s not an interesting conversation. This conversation is about configuration management, which is still needed in a “weakly immutable” world.

Interesting to say you’ve “solve[d] persistence” when you seem to be limited by it here. Is there a particular reason your services can’t be architected in less stateful, more 12-factor way?

Kick the persistence can down the road some more? Sure, why not? But sooner or later, somebody has to write something to disk (or flash or whatever that doesn’t disappear when the power’s off). A system that stores data is inherently stateful. Yes, you can restart services that provide access or auxiliary services (e.g. repair) but the entire purpose of the service as a whole is to retain state. It’s the foundation on top of which all the slackers get to be stateless themselves.

The vast majority of people simply redefine the terms to fit whatever they are selling.

If your systems are immutable they can run read-only. In the in nineties Tripwire, the integrity checker, popularized it. You could run it off cdrom. Today immutable infrastructure is VMs/containers that can be ran off a SAN or a pass through file system that is readonly. It means snapshots are completely and immediately replicatable. When you need to deploy, you take a base image/container, install a code onto it, run tests to ensure that it is not broken and replicate it as many times as you need, in a read-only state. This approach also has an interesting property where because system is readonly ( as in exported to the instance read-only/mounted by the instance readonly ) it is extremely difficult to do nasty things to it after a break in – if it is difficult to create files, it is difficult to stage exploits.

That’s the only kind of infrastructure where configuration management on the instances themselves is not needed

What sort of stack do you all use then to manage these clusters? Have you found any solutions to your conflicts?

The hosts are managed via chef, the jobs/tasks running on those hosts by something roughly equivalent to k8s.

As for the conflicts, I have to say I loathe the way the more dynamic part of configuration works. It might be the most ill conceived and poorly implemented system I’ve seen in 30+ years of working in the industry. Granted, it does basically work, but at the cost of wasting thousands of engineers’ time every day. The conflicts occur because (a) it abuses source control as its underlying mechanism and (b) it generates the actual configs (what gets shipped to the affected machines) from the user-provided versions in a non-deterministic way which causes spurious differences. All of its goals – auditability, validation, canaries, caching, etc. – could be achieved without such aggravation if the initial design hadn’t been so mind-bogglingly stupid.

But I digress. Sorry not sorry. 😉 To answer your question, my personal solution is to take advantage of the fact that I’m on the US east coast and commit most of my changes before everybody else gets active.

Sometimes you have to work with what you’re given in a brownfield env and a config managment tool is useful in that case, but it’s possible that you are working with a less than ideal architecture with less than ideal time/money to make changes.

State is always the enemy in technology.

I can’t even imagine managing hundreds of servers whose state is unpredictable at any moment and they can’t be terminated and replaced with a fresh instance for fear of losing something.

> State is always the enemy in technology.

I work in data storage. Am I the enemy, then? 😉

> can’t even imagine managing hundreds of servers whose state is unpredictable at any moment

Be careful not to conflate immutability with predictability. The state of these servers is predictable. All of the information necessary to reconstruct them is on a single continuous timeline in source control. But that doesn’t mean they’re immutable because the head of that timeline is moving very rapidly.

> can’t be terminated and replaced with a fresh instance for fear of losing something.

No, there’s (almost) no danger of losing any data because everything’s erasure-coded at a level of redundancy that most people find surprising until they learn the reasons (e.g. large-scale electrical outages). But there’s definitely a danger of losing availability. You can’t just cold-restart a whole service that’s running on thousands of hosts and being used continuously by even more thousands without a lot of screaming. Rolling changes are an absolute requirement. Some take minutes. Some take hours. Some take days. Many of these services have run continuously for years, barely resembling the code or config they had when they first started, and their users wouldn’t have it any other way. It might be hard to imagine, but it’s an every-day reality for my team.

> I work in data storage. Am I the enemy, then? 😉

You’re the prison guard.

> State is always the enemy in technology.

Except that state and its manipulation is usually the primary value in technology.

> I can’t even imagine managing hundreds of servers whose state is unpredictable at any moment and they can’t be terminated and replaced with a fresh instance for fear of losing something.

Yes, that sounds awful. That’s why we have backups and, if necessary, redundancy and high availability.

I’m going to agree with you. In 2020 (and really the last few years), configuration management is outdated. IaC (infrastructure as code) is the current approach. Containerize everything you can, use terraform or cloudformation, or azure devops.

Avoid managing the underlying os as much as possible. Use vanilla or prebuilt images to deploy these containers on, coreos, Amazon’s new bottle rocket (maybe). Or use a service like fargate when possible. All configuration should be declarative to avoid errors.

If you need to build images tools like packer are great. AWS has a recommended “golden Ami pipeline” pattern and a new image builder service if you can’t use community images.

I’m speaking imperatively but read these as my own directives. I work for a company that consults and actively helps fortune 500’s migrate to the cloud. So some of what I’m saying is not possible or harder on prem and I recognize that.

If I had to, I still like Chef, puppet second favorite mostly because of familiarity. Ansiblee can be used with either of these. And tools like serverspec to validate your images. I don’t really use any of this anymore though.

But… How do you configure the hosts where your containers are running on? How do you configure your storage (NAS/SAN)? How do you configure your routers and switches? …

The original question didn’t have much context, and I guess my answer assumed someone would be using a cloud provider as opposed to anything on premise.

Are Ansible/Puppet/Chef any good for managing the hardware you mentioned?

> Are Ansible/Puppet/Chef any good for managing the hardware you mentioned?

Yes. Well, OK, maybe not good but better than ad hoc.

Build the underlying vms with packer. Or use cloud-init as the parent mentioned – I think it has a bunch of knobs.

The question asks what I would consider to be the right approach for 2020, and also what my team is doing. This is the design pattern I’ve been following for 5 years, but obviously your mileage may vary, it won’t work for everyone, etc.

But you still need to configure things, even if they are immutable at runtime. And you need to manage that configuration over time in some systematic way.

You always have a configuration management system.

Yes, but to be clear, some of those containers have been .Net Core containers (running in Kubernetes) for me. I appreciate not having Windows in an estate isn’t common to all setups.

I mostly don’t for new stuff (all in on Docker/ECS), however we have a lot of old stuff and things in the process of being migrated where it makes sense. There’s also always the odd bird thing you use that needs to run on a regular host.

(Genuinely curious) what old stuff do you think doesn’t make sense to be set up immutably? and what odd stuff needs to run on a regular host?

Example: How do you do “immutable” management of Mac OS machines? Taking what’s typically described as such there, you’ve just turned a 30s deploy of software into a multi-hour “lets reimage the entire machine”?

(although that’s of course not strictly “old stuff”)

Were Macs in scope of the original post? I assumed it was server side stuff, rather than office hardware. For that, though, I’d use Jamf (Pro) or some other MDM option.

Even if you exclude “office hardware” from configuration management, our Mac OS build and test farm is “servers” I’d say. Not everyone running servers is doing so to run an online service on a platform of their choice.

Not the GP, but some proprietary software requires license activation and you only get a certain (small) number of activates/deactivates.

> What I prefer to do is use Terraform to create immutable infrastructure from code.

Can you mount all your volumes read-only and run all of your stack? If you cannot, then you do not have immutable infrastructure. You simply happen to agree that no one write anything useful, which with time will absolutely fail because someone, somewhere is going to start storing state on a stateless system giving you “a cow #378 called ‘Betsy'”

In the current state of infrastructure, an accepted definition of “immutable infrastructure” is that:

1. You deploy a completely fresh instance/container, instead of in-place updates 2. You don’t actively push changes on a running instance/container

Of course you might have stuff written to disk, such as logs, temp files, etc. But it should be non-essential data, and potentially pushed to a central place in near real-time.

Interesting. How would you do that if your deployment is, say, a couple of new tables in a 50TB Oracle database?

It only works with stateless resources.

There’s no point in trying to manage a database or similar resources this way.

Surprised more people here are not using Salt. Having used both Salt and Ansible, I much prefer Salt, especially when working with larger teams.

When working solo I use Guix, both Guix and Nix are _seriously_ amazing.

Wow really? I tried to learn Salt and it was way too complex. Comparatively Ansible was amazing to learn.

Salt has much nicer configs and feels supeior to Ansible. The main disadvantage that I had with salt was the need to have a salt master server. I read that this is no longer needed but I have not tried it myself. Keeping secrets outside of the repo was not trivial task, ansible has an easy way to encrypt secrets.

What’s salt? Any link? I found something called SaltStack but that appears to be enterprise security software.


Salt (also known as SaltStack) was right.

> Salt is a new approach to infrastructure management built on a dynamic communication bus. Salt can be used for data-driven orchestration, remote execution for any infrastructure, configuration management for any app stack, and much more.


I’m ultra confused about their marketing btw. Their website doesn’t even say it’s open source. You have to sign up to “try it now”. It’s like they don’t want customers? Or are people who want to understand what they’re buying not the target market, somehow?

For reference, this appears to be the Salt primer: https://docs.saltstack.com/en/getstarted/system/index.html

Sure, but understanding what the thing is, that’s part of the buying decision right? I have no clue what https://www.saltstack.com/ is about.

How do I get from this:

> Drive IT security into the 21st century. Amplify the impact of your entire SecOps team with global orchestration and automation that remediates security issues in minutes, not weeks.

to “it’s a provisioning tool for servers, like ansible but faster”?

SaltStack is actually a configuration management system, but they’re re-characterizing themselves as security management software.

they had a very large presence at RSAC 2020. i was very confused by it. they are not security software. however i suppose there are no “configuration management “ tradeshows

I haven’t looked into what they added specifically for the space, but I think a configuration management company at a security trade show makes sense: Configuration management is a very useful tool for various security goals.

I have been using Ansible for over four years now, my current use case has around 1k VMs and a handful of baremetal in a couple of different datacenters running 100s of services.

No orchestration as well FWIW, we usually have ansible configuring Docker to run and pulling the images…

As for the future I have been meaning to explore Terraform and some Orchestration platforms (Nomad).

I use Ansible, mostly because it works pretty well for deployments (on traditional, non-dockerized applications), and then I can just gradually put more configuration under management.

So it’s a very good tool to gradually get a legacy system under configuration management and thus source control.

My default tends to be Ansible because it is really versatile and lightweight on the systems being managed. That versatility can bite you though because it’s easy to use it as a good solution and miss a great one. Also, heaven help you if you need to make a change on 1000s of hosts quickly.

I also use (In order of frequency): Terraform, Invoke (Sometimes there is no substitute for a full programming language like python), Saltstack (1000’s of machines in a heterogenous environment)

If I were going to deploy a new app on k8s today, I would probably use something like https://github.com/fluxcd/flux.

I haven’t really had a pleasant time with the tooling around serverless ecosystem yet once you get beyond hello worlds and canned code examples.

> Also, heaven help you if you need to make a change on 1000s of hosts quickly.

Why? I would have seen that as Ansible’s strong point.

Re: performance: That’s fair. I didn’t realize it scaled that badly.

Re: mitogen: Thanks! I saw that once, a long time ago, but couldn’t find it again. I’ll have to try it; vanilla ansible is fine for me so far, but I’m hardly going to ignore a speed boost that looks basically free to implement.

I might be fanboy of the type safety and having a quick feedback loop, but I cannot imagine a better configuration management system than just straight configuration as code e.g. in Go: https://github.com/bwplotka/mimic

I really don’t see why so many weird, unreadable languages like jsonnet or CUE were created, if there is already a type safe, script-like (Go compiles in miliseconds and there is even go run command), with full pledged IDE autocompletion support, abstractions and templating capabilities, mature dependency management and many many more.. Please tell me why we are inventing thousands weird things if we have ready tools that helps with configuration as well! (:

The tool you link recommends “kubectl apply, ansible, puppet, chef, terraform” to actually apply the changes, at least 3 of those I’d classify as configuration management. Generating the configuration is only a small part of it, and the traditional tools typically have some way to do that too because they were designed to be used by non-/almost-non-coders too.

I agree. I wish we could just use EDN and Clojure, but your DevOps guy is not writing Go or Clojure code.

They are also not doing code reviews to enforce security policies.

If you have DevOps guys who are also software developers, more power to you, but if I approach my DevOps team with:

Hey just code your scripts in this turing-complete languages, they will ask me “what’s your username again?” BOFH-style 😉

I find this post very condescending. You are not better than your devops guy at managing infrastructure because you can code in Clojure or GoLang or whatever other programming language.

Hah, true.

I totally see this being difficult. To enforce DevOps/Ops to actually do code reviews, and versioned, type-safe configuration, but once you accomplish it – the profits are really worth it!

Please consider that you’re a principal engineer with a BS and a Master’s. And you’ve achieved all those things quite quickly! You’re on the far end of a bell curve.

A full programming language is the natural choice for people who are full programmers. But for people who aren’t, they’re intimidating and add a lot of complexity. Templating systems are much more approachable for people who have a lot of experience configuring things via big blobs of text.

As a programmer, I would personally rather express everything in a programming language, so I get your perspective here. But it isn’t an accident that there are so many ops-focused systems that are different takes on just automating the things people were previously doing manually.

I totally understand that, but why not aiming high? Why we just say “you have experience configuring stuff, so we will just give you some extended json with templating, you won’t be able to code….”. I think this is bad approach (: We should always aim high and mentor those less experience to use programming languages for this. They don’t need to know complex algorithms, distributed systems and performance optimization. It’s really just more smart templating that is actually can be easier to use! (:

> Templating systems are much more approachable for people who have a lot of experience configuring things via big blobs of text.

I have mixed feelings about this statement. How reading or using jsonnet is easier? I am a principal engineer and I am struggling to work with this, how less expierience people can deal with that efficiently? (:

They can deal with it more efficiently because its model is closer to their current mental model, and so requires less cognitive load to achieve initial results.

Gabriel talks about this under the label “worse is better”. [1] I agree your preferred approach is better in the long term and at scale. But that only matters in the long term, and only if scale is eventually achieved. Tool adoption is generally a series of short-term decisions, and most projects start small.

I agree mentoring people is great, but neither you nor I have time to mentor all the people just configuring things into becoming good programmers.

[1] https://en.wikipedia.org/wiki/Worse_is_better

Secretaries at Multics sites not only used Emacs in the 70s — they customized it with Lisp. They were only ever told they were customizing their editor, not programming it, so they never got intimidated!

Thank you! I’ve seen that and I don’t fully like it. I am not interested in deploying the configuration. I believe that generating configuration, versioning it, baking it, should be a totally separate process to deploying, rolling out, reverting etc

That’s why IMO we should separate those. (:

Hashicorp tools are quite solid, and give you a lot for free. Ansible can automate host-level changes in places where hashicorp cannot reach. There shouldn’t be many such places.

Alternatively, if you have the option of choosing the whole stack, Nix/NixOS and their deployment tools.

I would recommend staying away from large systems like k8s.

We use Ansible with Packer to create immutable OS images for VMs.

Or Dockerfile/compose for container images.

Cloud resources are managed by Terraform/Terragrunt.

I think this is the ideal scenario for Ansible— one-time configuration of throwaway environments, basically as a more hygenic and structured alternative to shell scripts.

My experience trying to manage longer lived systems like robot computers over time with Ansible has been that it quickly becomes a nightmare as your playbook grows cruft to try to account for the various states the target may be coming from.

Could you say more about why ansible is better than shell scripts for one-time configuration? In my mind, ansible’s big advantage over shell scripts is that it has good support for making incremental changes to the configuration of existing resources. In a situation like packer, where the configuration script only gets run once, I prefer the conciseness of a shell script.

I see the incremental piece as a dev-time bonus rather than something to try to leverage much in production— it lets you iterate more quickly against an already-there target, but that target is still basically pristine in that any accumulated state is well understood. But that’s very much not the case if you’re trying to do an Ansible-driven incremental change against a machine that was deployed weeks or months earlier.

Even in the run-once case, though, I think there’s a benefit to Ansible’s role-based approach to modularization. And again for the dev scenario, it’s much easier to run only portions of a playbook than it is to run portions of a shell script.

And finally, the diagnostics and overall failure story are obviously way better for Ansible, too.

Now, all this said, I do still go back and forth. For example, literally right now in another window I’m working a small wrapper that prepares clean environments to build patched Ubuntu kernels in— and it’s all just debootstrap, systemd-nspawn, and a bunch of shell script glue.

That’s a very good point : I also found that the core feature of configuration management – idempotency – actually becomes mostly useless in this case, as ansible applies a playbook only once.

I still use it as it allows more portability across OS releases and families (as in easier migration), but it also increases the complexity when creating a new task/role/playbook.

In that sense, Dockerfiles with shell-based RUN commands are much easier to manage.

Another advantage of config management over shell might be a better integration with the underlying cloud provider. For instance Ansible supports AWS SSM parameter store, which allows me to use dynamic definitions of some configuration data (RDS database endpoints, for instance) or secrets (no need for Ansible vault)

Here’s what we’re using which I’m pretty happy with:

0. Self-hosted Gitlab and Gitlab CI.

1. Chef. I’d hardly mention it because it’s use is so minimal but we have it setup for our base images for the nitpicky stuff like connecting to LDAP/AD.

2. Terraform for setting up base resources (network, storage, allocating infrastructure VMs for Grafana).

3. Kubernetes. We use a bare minimum of manually maintained configuration files; basically only for the long-lived services hosted in cluster plus the resources they need (ie: databases + persistent volumes), ACL configuration.

4. Spinnaker for managing deployments into Kubernetes. It really simplifies a lot of the day-to-day headaches; we have it poll our Gitlab container repository and deploy automatically when new containers are available. Works tremendously well and is super responsive.

Nix (nixos, nixops) is worth looking into if you want a full solution and can dedicate the time and energy.

Fabric https://www.fabfile.org/ (just one step above shell scripts using python), using 1.x as the 2.x stuff is still missing things. The key is having is structure to almost be like Ansible where you kind of have “playbooks” and “roles” (had this structure going before Ansible) … probably have to move out of this soon though

Could you please explain why you think you’ll have to move out of it soon?

Works for smaller teams and smaller # of hosts. I would say that it would start getting harder with 5-6 people and > 100 hosts. But for small stuff, it is the most awesome thing in the world. I had a structure I used a long time ago (https://github.com/chrisgo/fabric-example) but have broken it up now differently in the last 2 years (looks more like Ansible)

… and peer pressure (which is probably not a good reason)

Shameless plug for a thing I maintain, which is in the config management space but a little bit different from the usual tools: https://github.com/sipb/config-package-dev#config-package-de…

config-package-dev is a tool for building site-specific Debian packages that override the config files in other Debian packages. It’s useful when you have machines that are easy to reimage / you have some image-based infrastructure, but you do want to do local development too, since it integrates with the dpkg database properly and prevents upgraded distro packages from clobbering your config.

My current team uses it – and started using it before I joined the company (I didn’t know we were using it when I joined, and they didn’t know I was applying, I discovered this after starting on another team and eventually moved to this team). I take that as a sign that it’s objectively useful and I’m not biased 🙂 We also use some amount of CFEngine, and we’re generally shifting towards config-package-dev for sitewide configuration / things that apply to a group of machines (e.g. “all developer VMs”) and CFEngine or Ansible for machine-specific configuration. Our infrastructure is large but not quite FAANG-scale, and includes a mix of bare metal, private cloud and self-run Kubernetes, and public cloud.

I’ve previously used it for

– configuring Kerberos, AFS, email, LDAP, etc. for a university, both for university-run computer labs where we owned the machines and could reimage them easily and for personal machines that we didn’t want to sysadmin and only wanted to install some defaults

– building an Ubuntu-based appliance where we shipped all updates to customers as image-based updates (a la CrOS or Bottlerocket) but we’d tinker with in-place changes and upgrades on our test machines to keep the edit/deploy/test cycle fast

Thanks for posting this. I’ve rolled my own version of this in the past and was very happy with the end results.

You can never go wrong with bash, you should not put secrets in metadata and you should not have IAM profiles that have overreaching privileges. Any IAM profile that you use or whatever you use on azure or gcp you should always consider what somebody can do with it if they get access to it.

Probably also just straight up docker and docker compose is another good idea, and terraform and possibly hashicorp vault are real high on the list, too. Ansible and chef and puppet are all pretty esoteric and I thought chef was great till I just got good with bash and gnu parallel

I haven’t used either (yet) but Dhall or Cue lang should be on your list of candidates IMO.


(To me things like puppet or ansible seem like thin layers over shell and ssh, whereas both Dhall and Cue seem to innovate in ways that are more, uh, je ne sais quoi 😉 YMMV)

Funnily, I wrote my take on this not too long back:


Don’t be distracted by FAANG scale. It’s not relevant to most software and is usually dictated by what they started using and then deployed lots of engineering time to make work.

My suggestion is to figure out how you will manage your database server and monitoring for it. If you can do that, almost everything else can fall into line as needed.

I operate a couple of Elixir apps and so far a simple Makefile with a couple of shell scripts has been enough. This simplicity is due to the fact that the only external dependency is a database server, everything else (language runtime, web server, caching, job scheduling, etc.) is baked in the Elixir release. One unfortunate annoyance though is that Elixir releases are not portable and can’t be cross-compiled (e.g. building on latest Ubuntu and deploying to Debian stable won’t work) so we have to build them in a container matching the target OS version. So to be really honest I should mention that Docker is also part of our deployment stack, although we don’t run it on production hosts.

How do you handle multiple servers? Eg for fallback, vertical scaling, whatever

I think we’ve developed multiple layers in our infrastructure (Cloud Infra – AWS, GCP.., Paas – Kubernetes, ECS.., Service mesh – Istio, linkerd.., application containers..). So it depends on how many layers you have and how you want to manage a particular layer. Companies at `any` scale can get away with just using Google App Engine (Snap) or have 5+ layers in their infrastructure.

I find Jenkins X really interesting for my applications. It seems to solve a lot of issues related to CI/CD and automation in Kubernetes. however, still lacks multi-cluster support.

I’ve prototyped ansible for rolling out ssl certs to a handful of unfortunately rather heterogeneous Linux boxes – and it worked pretty well for that.

I still think there’s too much setup to get started – but am somewhat convinced ansible does a better job than a bunch of bespoke shell would (partly because ansible comes with some “primitives”/concepts such as “make sure this version of this file is in this location on that server – which is quick to get wrong across heterogeneous distributions).

We’re moving towards managed kubernetes (for applications currently largely deployed with Docker and docker-compose on individual vms).

I do think the “make an appliance;run an appliance;replace the appliance” life cycle makes a lot of sense – I’m not sure if k8s does yet.

I think we could be quite happy on a docker swarm style setup – but apparently everything but k8s is being killed or at least left for dead by various upstream.

And k8s might be expensive to run in the cloud (a vm pr pod?) – but it comes with abstractions we (everyone) needs.

Trying to offload to SaaS that which makes sense as SaaS – primarily managed db (we’re trying out elephant sql) – and some file storage (100s of MB large Pdf files).

For bespoke servers we lean a bit on etckeeper in order to at least keep track of changes. If we were to invest in something beyond k8s (it’s such a big hammer, that one become a bit reluctant to put it down once picked up..) I’d probably look at gnu guix.

Ansible Ansible Ansible for me!

I’ve tried Puppet and SaltStack, and I constantly find they are harder and more complex than Ansible. I can get something going in Ansible in short order.

Ansible really is my hammer.

I typically use terraform and ansible. tf creates/manages the infrastructure and then ansible completes any configuration.

This is the approach we take. We don’t track states or do continuous config management either as we’re all in on cattle > pets (and we don’t typically have the time to maintain terraforms properly enough to do anything but cut new environments). Something gets sick? Shoot it and stand up another one.

Ansible for dev boxes or smaller deployments. For large-scale deployments CFEngine3. When deployed within a cloud environment one doesn’t even need a master node for CFE3 but the agents can just pull the latest config state from some object storage.

We use terraform to describe cloud infrastructure, check all k8s configmaps and secrets into source control (using sops to securely store secrets in git).

I’m pretty happy using both Puppet and ansible. I use Puppet for configuring hosts and rolling out configuration changes (because immutable infrastructure isn’t a thing you can just do; there’s overhead and it does not fit all problems) and ansible for orchestrating actions such as upgrades. They work well together.

I very much dislike ansible’s YAML-based language and would hate to use it for configuration management beyond tiny systems, but it’s pretty decent as a replacement for clusterssh and custom scripts.

I’m using puppet for everything, including nearly immutable infrastructure ( if you can’t mount your disks read only and run that way you dont have immutable infrastructure )

Puppet maintains the base image with the core system.

Special systems are recreated by applying system specific classes to a base image.

Application software is installed via packages with git commit-ids being versions.

Nothing is upgraded, rather a new instances are rolled out and the old instances are destroyed.

This also ensures that we always know that we can recreate our entire infrastructure because we do that for rapidly changing systems several times a day and for all systems at least monthly.

This makes our operational workflow match the disaster recovery, which is god sent.

I used to use Chef, but I really didn’t like it. For small projects now, I just use a set of shell scripts, where each installs and/or configures one thing. Pair it with a Phoenix server pattern. It has treated me very well the last two years

I would go with Ansible for side projects/smaller tasks, and use Puppet at large.

Ansible is just extremely easy to begin with, and comfortable to use since it is an agentless solution using SSH. As for Puppet, well, it could largely depends on your team. Is it a devops one or a strictly dev one? Puppet seems to be the perfect balance for us (devops mostly, but devs can touch it with confident too)

If you already know and/or use Ruby, use Chef.

It is silly to ask “what should be used at FAANG scale”, because either you are working at a FAANG and you are using what they use, or you are very unlikely to ever be at that scale — and somewhere along the journey to getting there, you will either find or write the system that you need.

It’s not a silly question if you want to learn. Just because you don’t need it doesn’t mean it isn’t worth learning about.

Easy, flexible, ansible but not super fast (ssh) Still pretty easy but very fast saltstack (zmq)

For anyone here who isn’t yet using and end to end setup like terraform, ansible, puppet etc and has more basic needs around managing environment variables and application properties, I highly recommend https://configrd.io.

We’re using Terraform for infrastructure and Ansible for deployments with great success.

Ansible where possible, Chef when I have to (for legacy reasons, usually), and Terraform/Docker/Packer when given the option.

Shameless self-plug: ChangeGear. We’re cheapest in-class for medium-sized companies.

I’m also really interested in what companies at scale are using. Anyone here from FAANG?

I’ll tell you the one tool I DON’T use. Cloudformation. I’ve touched it a grand total of once and it burned me so hard I set a company policy to never use it again.

It’s like terraform, except you can’t review things for mistakes until it’s already in the process of nuking something. Which is terrible when you’re inheriting an environment.

And a set of environments along the lines of at least, Dev, Test, preview, production.

At G scale you could never afford to run something as grossly wasteful as chef. It would be cheaper to have several full-time engineers maintaining a dedicated on-host config service daemon and associated tools, than it would be for some ruby script to cron itself every 15 minutes.

Here also docker-compose. Easy to separate tenants using same stack (nginx+django+postgres+minio).

Question though: how do you manage the possible rebooting-containers-loop after a host reboot? I had to throw in more memory to prevent this but it feels like a (expensive|unnecessary) workaround. Anyone figured out how to let multiple containers start after each other (while not in 1 docker-compose.yaml)?

Kind of surprised there isn’t really a consistent answer for this. Just skimming through these answers.