If you follow me on Twitter, you may think I hate YAML.

I'm not against YAML, just against abuse of YAML. I want to help prevent people  abusing YAML and being cruel to themselves and their coworkers in the process.

YAML's strength is as a structured data format. Yes, it has issues. Whitespace is a minefield. Its syntax is surprisingly complex. It has gotchas: "Anyone who uses YAML long enough will eventually get burned when attempting to abbreviate Norway." But YAML is human readable and supports comments: two key benefits that drive its popularity.

Where it can go wrong is where we use YAML to describe behavior.

Consider some examples from the CI domain. This isn't the only domain in which YAML is abused this way, but it's among the worst offenders.

Take GitLab's pipeline definition for delivering itself: an 1170(!) line YAML file rife with sections like this:

gitlab:assets:compile: <<: *dedicated-no-docs-pull-cache-job image: dev.gitlab.org:5005/gitlab/gitlab-build-images:ruby-2.5.3-git-2.18-chrome-71.0-node-8.x-yarn-1.12-graphicsmagick-1.3.29-docker-18.06.1 dependencies: - setup-test-env services: - docker:stable-dind variables: NODE_ENV: "production" RAILS_ENV: "production" SETUP_DB: "false" SKIP_STORAGE_VALIDATION: "true" WEBPACK_REPORT: "true" # we override the max_old_space_size to prevent OOM errors NODE_OPTIONS: --max_old_space_size=3584 DOCKER_DRIVER: overlay2 DOCKER_HOST: tcp://docker:2375 script: - node --version - yarn install --frozen-lockfile --production --cache-folder .yarn-cache - free -m - bundle exec rake gitlab:assets:compile - time scripts/build_assets_image - scripts/clean-old-cached-assets artifacts: name: webpack-report expire_in: 31d paths: - webpack-report/ - public/assets/

Note the script block containing a list of shell scripts. Does this look like data? Is this the right model for specifying execution?

There are many similar cases. Here is a fragment from an example of Tekton, a newish Kubernetes-based delivery solution:

apiVersion: tekton.dev/v1alpha1
kind: Task
metadata: name: build-push
spec: inputs: resources: - name: workspace type: git params: - name: pathToDockerFile description: The path to the dockerfile to build default: /workspace/workspace/Dockerfile - name: pathToContext description: The build context used by Kaniko (https://github.com/GoogleContainerTools/kaniko#kaniko-build-contexts) default: /workspace/workspace outputs: resources: - name: builtImage type: image steps: - name: build-and-push image: gcr.io/kaniko-project/executor command: - /kaniko/executor args: - --dockerfile=${inputs.params.pathToDockerFile} - --destination=${outputs.resources.builtImage.url} - --context=${inputs.params.pathToContext}

Ouch. Variables. Qualified names. Arguments. This is not structured data. This is programming masquerading as configuration.

Haven't we met concepts like variables and successive instructions before? Why clumsily reinvent imperative programming? What about modularity and testability? What about toolability, which we'd get for free with a programming language? Why reinvent exception handling, which is rigorously defined in modern languages? What about logical operations, let alone more advanced and elegant FP or OOP concepts?

The best argument in favor of such YAML-based syntax is that it's an external DSL, enforcing a beneficial structure. However, even this doesn't stack up, for several reasons:

  • The prescriptive structure is largely an illusion. The bulk of the work is pushed into shell scripts like this (from the GitLab example), which have no structure beyond the environment. In practice it's the Wild West.
  • If a step is missing in the design of the DSL, you hit a wall. For example, CI tools typically model delivery phases as YAML stanzas. If you need a unique phase, you're probably out of luck.
  • YAML is a poor format for an external DSL, just as XML was. The popular configuration format du jour is always misused this way.

You probably don't want an external DSL, anyway: something we learnt the hard way at Atomist.

External DSLs...are like puppies, they all start out cute and happy, but without exception turn into vicious beasts as they grow up.

Modern programming languages are flexible enough to make internal DSLs more and more compelling, with far superior tooling and extensibility.

Trying to use a data format as a programming language is wrong. Calling it out has nothing to do with the merits of the data format for what it was designed for.

YAML as data format is defensible. YAML as a programming language is not. If you're programming, use a programming language. You owe it to Turing, Hopper, Djikstra and the countless other computer scientists and practitioners who've built our discipline. And you owe it to yourself.