Not in large scale production

I love Docker. It's so freakin' useful for distributing predictable runtime environments.

Using it during development means radically less time configuring my dev machine. Using it during development means I can "install" different versions of PostgreSQL, Redis, etc. for different projects.

In this way, Docker is very useful.

Not in production

But in production, Docker has a ton of problems.

Each version of the CLI is incompatible with the last version of the CLI.

I mean, WTF. This means that if I have Docker 1.9 in production someplace, I have to have a machine with a Docker 1.9 CLI installed in order to control that machine.

Just imagine if someone emailed you a PDF or an Excel file and you didn't have the exact verion of the PDF reader or Excel that would open that file. Your head would blow clean off.

It means that if I built a production instance with with Docker Machine version X, I can only control that instance with a version X installed. What's worse is the paths to the CERTs and such that Docker Machine generates are full paths like /home/dpp/.docker/xxx. That means I can't take my Docker Machine credentials from my Linux box to my Mac. It means I can't put the credentials in a shared secret system (e.g. LastPass) for others to use.

Both of the above issues are trivial to fix. The fact that Docker isn't spending any resources on this is really frightening.

Docker Engine 1.12/Swarm is brittle

I set up a 10 node, 3 data center Swarm cluster with Docker Engine 1.12 (and tried it with 1.12.1 as well).

Within 40 hours, the cluster failed. None of the manager nodes could answer:

root@f1 ~ # docker node ls
Error response from daemon: rpc error: code = 2 desc = raft: no elected cluster leader

The Swarm lost consensus and none of the master nodes was able to assert that it was the leader. Granted a geographically distributed swarm means the probability of a network partition is high. But network partitions happen all the time. The fact that out of the box, Swarm is way too sensitive to network partitions means something bad will happen in a single data center deployment as well.

This combined with the version fragility issue with the CLI means that it's unlikely that I can run a mixed 1.12/1.13 swarm and that means whatever version my swarm is started on is the version of Docker I'm stuck with until I build a new swarm and migrate my code to it.

UPDATE 2016/08/30

It turns out that having a manager that's got an IP address different from the public IP address triggers the swarm instability. While Docker 1.12 had consensus-related issues, 1.12.1 is stable if all the managers can see each the live manager nodes. Here's the ticket.

Wanting to love Docker

I really want to love Docker. There's so much good in it.

But "in production, at scale, with real-world constraints" are all attributes Docker needs to have in order to make it into a large scale production environment.

Yes, easy set-up and a great UX are parts of the equation... parts that Docker gets right.

But whatever is built on Docker today may be around in production systems for the next 20-30 years. Docker has to get the versioning and stability stuff right.