Brain Dump Space - Securing Nomad, Consul and Gluster

This is Part 6 of the Avoiding the Cloud series. This step is slightly dependent on the provider you use. I'll use TransIP in this example.

In Part 1: Building a Docker cluster with Nomad, Consul and SaltStack on TransIP we defined our technology stack, our IP address space and node types and roles.

In Part 2: Reproducibly provisioning Salt Minions on TransIP we provisioned 4 different VPSs with just a Salt Minion and their hostnames and IP address.

In Part 3: Provisioning Consul, Nomad and Gluster with SaltStack you created a central SaltStack configuration that defined and configured all nodes to their correct configuration.

In Part 4: Getting Traefik, Health checks and Dashboards on a Nomad Cluster you made your Traefik, Consul and Nomad dashboards available to yourself and you made sure your provider's load balancer could direct traffic to your cluster.

In Part 5: Adding Redundancy to Consul, Nomad and Gluster you deployed additional nodes to your cluster to create redundancy.

In this part of the series we will take our cluster and start bolting on security to make it more resilient in this hostile world.

Just like in the other parts, whenever I talk about putting specific contents in a file, I'll leave it up to you to pick your preferred way. Personally I use vi, but nano is present as well.

Consul Security

TLS Communication

By default it's possible for a compromised Consul client to just change it's configuration to becoming a server and joining the cluster as a server. At that point it can change configuration values. To prevent current clients to restart as server and perform a Man-in-the-Middle attack, Consul nodes need to be able to verify identities of at least the Consul server nodes.

Verification is performed with TLS certificates. Thus this measure requires a long-living CA certificate to be created. The CA certificate needs to survive restarts of cluster nodes, so our Salt Master is the best location to create it.

That means we should at least install the Consul binary on our Salt Master.

To install the Consul binary on our Salt Master (which is based on Ubuntu 20.04) we run:

$ curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
$ sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
$ sudo apt-get update && sudo apt-get install consul

Luckily Consul has built-in CA functionality, so now we can create our CA, and server certificates for our three nodes:

$ mkdir -p /srv/data/ca
$ ( cd /srv/data/ca && consul tls ca create )
==> Saved consul-agent-ca.pem
==> Saved consul-agent-ca-key.pem
$ ( cd /srv/data/ca && consul tls cert create -server )
==> WARNING: Server Certificates grants authority to become a
    server and access all state in the cluster including root keys
    and all ACL tokens. Do not distribute them to production hosts
    that are not server nodes. Store them as securely as CA keys.
==> Using consul-agent-ca.pem and consul-agent-ca-key.pem
==> Saved dc1-server-consul-0.pem
==> Saved dc1-server-consul-0-key.pem
$ ( cd /srv/data/ca && consul tls cert create -server )
==> WARNING: Server Certificates grants authority to become a
    server and access all state in the cluster including root keys
    and all ACL tokens. Do not distribute them to production hosts
    that are not server nodes. Store them as securely as CA keys.
==> Using consul-agent-ca.pem and consul-agent-ca-key.pem
==> Saved dc1-server-consul-1.pem
==> Saved dc1-server-consul-1-key.pem
$ ( cd /srv/data/ca && consul tls cert create -server )
==> WARNING: Server Certificates grants authority to become a
    server and access all state in the cluster including root keys
    and all ACL tokens. Do not distribute them to production hosts
    that are not server nodes. Store them as securely as CA keys.
==> Using consul-agent-ca.pem and consul-agent-ca-key.pem
==> Saved dc1-server-consul-2.pem
==> Saved dc1-server-consul-2-key.pem

Ok.. Now we have a CA key and certificate and three server keys and certificates.

Warning: It's never a good plan to put keys and certificates in a repository. So, if you plan to put /srv/ in a repository, add /srv/data to .gitignore!

Now it's time to distribute the right key and certificate to the right Consul server. We'll need a bit of Jinja templating for that. If you remember from one of the earlier articles, node-specific information should be in a pillar and then used in a state. Let's start by adding a new pillar in /srv/pillar/data/consul_ca.sls:

consul_ca_tls:
  user: consul
  group: consul
  filename: 'consul-agent-ca.pem'
  mode: 600
  contents: |
    -----BEGIN CERTIFICATE-----
    SNIPPED
    -----END CERTIFICATE-----

And on for the server certificates in /srv/pillar/data/consul_server_tls.sls:

consul_server_tls:
  user: consul
  group: consul
  cert:
    filename: 'dc1-server-consul.pem'
    mode: 640
{% if grains['id'] == 'consul-server-01' %}
    contents: |
      -----BEGIN CERTIFICATE-----
    SNIPPED
      -----END CERTIFICATE-----
{% endif %}
{% if grains['id'] == 'consul-server-02' %}
    contents: |
      -----BEGIN CERTIFICATE-----
    SNIPPED
      -----END CERTIFICATE-----
{% endif %}
{% if grains['id'] == 'consul-server-03' %}
    contents: |
      -----BEGIN CERTIFICATE-----
    SNIPPED
      -----END CERTIFICATE-----
{% endif %}

  key:
    filename: 'dc1-server-consul-key.pem'
    mode: 640
{% if grains['id'] == 'consul-server-01' %}
    contents: |
      -----BEGIN EC PRIVATE KEY-----
    SNIPPED
      -----END EC PRIVATE KEY-----
{% endif %}
{% if grains['id'] == 'consul-server-02' %}
    contents: |
      -----BEGIN EC PRIVATE KEY-----
    SNIPPED
      -----END EC PRIVATE KEY-----
{% endif %}
{% if grains['id'] == 'consul-server-03' %}
    contents: |
      -----BEGIN EC PRIVATE KEY-----
    SNIPPED
      -----END EC PRIVATE KEY-----
{% endif %}

Here we use our knowledge of the nodes ID, to put the right key and certificate on the right Consul server node.

Which we'll add to /srv/pillar/top.sls:

base:
  '*':
    - ufw_private_only
    - unbound_consul_dns
    - data/consul_ca
  'consul-*':
    - data/consul_server_tls
<skipped everything after>

Let's add one new state to /srv/salt/consul_node.sls before 'consul running':

<SKIPPED>

required private consul files:
  file.managed:
    - names:
      - /etc/consul.d/{{ pillar['consul_ca_tls']['filename'] }}:
        - contents_pillar: consul_ca_tls:contents
        - user: {{ pillar['consul_ca_tls']['user'] }}
        - group: {{ pillar['consul_ca_tls']['group'] }}
        - mode: {{ pillar['consul_ca_tls']['mode'] }}
    - require_in:
      - service: consul running
    - watch_in:
      - service: consul running

consul running:
<SKIPPED>

Then we add one state for the Consul Server certificates in /srv/salt/consul_server.sls after our consul.hcl state:

<SKIPPED>

consul server certificate files:
  file.managed:
    - names:
{% for item in ['cert', 'key'] %}
      - /etc/consul.d/{{ pillar['consul_server_tls'][item]['filename'] }}:
        - contents_pillar: consul_server_tls:{{ item }}:contents
        - user: {{ pillar['consul_server_tls']['user'] }}
        - group: {{ pillar['consul_server_tls']['group'] }}
        - mode: {{ pillar['consul_server_tls'][item]['mode'] }}
{% endfor %}

Now for the actual configuration files. Let's add the TLS configuration as required in /srv/salt/data/consul.d/server.hcl by appending:

verify_incoming = true,
verify_outgoing = true,
verify_server_hostname = true,

ca_file = "/etc/consul.d/consul-agent-ca.pem",
cert_file = "/etc/consul.d/dc1-server-consul.pem",
key_file = "/etc/consul.d/dc1-server-consul-key.pem",

auto_encrypt {
  allow_tls = true
}

And similarly for the Consul clients in /srv/salt/data/consul.d/client.hcl by appending:

verify_incoming = false,
verify_outgoing = true,
verify_server_hostname = true,
ca_file = "/etc/consul.d/consul-agent-ca.pem",
auto_encrypt = {
  tls = true
}

Now finally apply our new config:

$ sudo salt '*' saltutil.refresh_pillar
$ sudo salt '*' state.apply

Now all communication between our Consul nodes is TLS protected, and it's not possible anymore to turn a compromised Consul client into a new Server.

Consul ACLs

By default every Consul node (either client or server) in the cluster can read and write any value within Consul. That means access to all the Key/Value pairs (which contains important configuration data), and to write the current location of services, nodes, etc.

In production environments it's thus important to make sure each service and/or node can only read the data that it needs to, and write the data it needs to write.

Consul has Access Control lists for this purpose and an extensive tutorial on Consul ACLs. But I found the ACL tutorial severely lacking in real-world useful knowledge. There is no advise on what permissions are needed on what kind of nodes and it takes trial and error, and a lot of failing and looking at the logs to see what is actually needed to enable a set-up to work.

So let's make things easy and skip all the reading and just go to the end result. What will we do:

Enable ACL support within our Consul cluster and get a Bootstrap ACL
Define ACL policies on our Salt master
Create the actual policies and roles
Create tokens for our nodes and services
Distribute tokens to our nodes and services

Bootstrap Consul ACL system

To enable and enforce the ACL on all Consul nodes, you need to add the following ACL configuration block in a new file called /srv/salt/data/consul.d/acl.hcl:

acl = {
  enabled = true
  default_policy = "deny"
  enable_token_persistence = true
}

And let's add it to /srv/salt/consul_node.sls just before consul running: to make sure it gets to our nodes:

<SNIPPED>

/etc/consul.d/acl.hcl:
  file.managed:
    - source: salt://data/consul.d/acl.hcl
    - user: consul
    - group: consul
    - mode: 644
    - require_in:
      - service: consul running
    - watch_in:
      - service: consul running

consul running:
<SNIPPED>

Warning: Do not run this on an already running datacenter, as the default_policy of deny will immediately block all access to Consul information. You should theb use a default_policy of allow and change it to deny after you put all ACLs in place.

In all cases, we should now apply our new default policy, so we can start bootstrapping the system.

$ sudo salt '*' state.apply
$ sudo salt 'consul-server-01' cmd.run "consul acl bootstrap"
consul-server-01:
    AccessorID:       <SOME_ID>
    SecretID:         <SOME_ID>
    Description:      Bootstrap Token (Global Management)
    Local:            false
    Create Time:      2021-03-27 16:24:02.075981348 +0000 UTC
    Policies:
       00000000-0000-0000-0000-000000000001 - global-management

Success! You'll need to keep this token (SecretID) secret and around for the future, as it's like your server's root password. Very useful in case there is an emergency and not needed for regular operations other than bootstrapping the ACL system. In the rest of this article I'll reference the specific value as <BOOTSTRAP_SECRET>.

Define cluster ACL policies

For storing our ACL policies, let's make a new directory at /srv/salt/data/acls. We'll store all ACL policies here in text so we can version them. We can define a policy for each node here, but we don't need to. We can make use of Salt's smart features to use a default policy in case we don't define a node-specific policy here.

Within Consul there are two types of tokens for each Consul node. An Agent token which is only used for internal agent actions, and a default token that is used for everything else it needs to perform.

For us the agent token is identical on all nodes. Let's create a single templates agent policy at /srv/salt/data/acls/agent-default.hcl with:

node "{{ node_id }}" {
  policy = "write"
}
agent "{{ node_id }}" {
  policy = "write"
}

This gives each node's agent rights to write and update it's own node information and it's own agent information. That's all it needs.

By default we give all nodes access to read which nodes and services there are by creating /srv/salt/data/acls/node-default.hcl:

node_prefix "" {
  policy = "read"
}
service_prefix "" {
  policy = "read"
}

But that's not enough for our Nomad server nodes. Each individual Nomad server node is allowed to write or update its own node and service as well. So let's create /srv/salt/data/acls/node-nomad-server-01.hcl with:

node "{{ node_id }}" {
  policy = "write"
}
node_prefix "" {
  policy = "read"
}

service "{{ node_id }}" {
  policy = "write"
}
service_prefix "" {
  policy = "read"
}

And create symlinks from /srv/salt/data/acls/node-nomad-server-02.hcl and /srv/salt/data/acls/node-nomad-server-03.hcl to /srv/salt/acls/node-nomad-server-01.hcl because they will be identical:

$ ( cd /srv/salt/data/acls && ln -s node-nomad-server-01.hcl node-nomad-server-02.hcl )
$ ( cd /srv/salt/data/acls && ln -s node-nomad-server-01.hcl node-nomad-server-03.hcl )

Our docker hosts will need some node-specific policies as well and are allowed to read their agent status and write their service status. So let's create /srv/salt/data/acls/node-docker-01.hcl with:

node_prefix "" {
  policy = "read"
}

agent "{{ node_id }}" {
  policy = "read"
}

service "{{ node_id }}" {
  policy = "write"
}

service_prefix "" {
  policy = "read"
}

And create a symlink for our second docker host:

$ ( cd /srv/salt/data/acls && ln -s node-docker-01.hcl node-docker-02.hcl )

Ok. All node specific policies are done. Consul luckily also supports ACL roles to group role-specific policies into one place. Both our Nomad servers and Docker hosts will need some more access to work properly, but as it's not node-specific (i.e. not dependent on the node's name/id), we can more cleanly put it into a role instead.

So let's start with our Nomad server role in /srv/salt/data/acls/role-nomad-server.hcl:

service "nomad" {
  policy = "write"
}

agent_prefix "" {
  policy = "read"
}

Within Consul's service discovery the Nomad servers will register and update the nomad service, so they should all be allowed to update it. And although not really documented, it needs to read its own agent status in order to properly start.

The docker hosts need a lot more access to update all relevant information in our cluster that they are keeping up and running. So let's create /srv/salt/data/acls/role-docker-hosts.hcl with:

service "nomad-client" {
  policy = "write"
}

service_prefix "nomad-task" {
  policy = "write"
}

key_prefix "job_tokens" {
  policy = "read"
}

agent_prefix "" {
  policy = "read"
}

# Traefik
service_prefix "traefik" {
  policy = "write"
}

# Whoami
service_prefix "whoami" {
  policy = "write"
}

Our docker hosts will be allowed to update the nomad-client service in Consul, to update the running Nomad tasks, to read tokens we can make available to jobs via Consul KV, and to update the services traefik and whoami that we specified earlier.

And finally we allow the Traefik service, to read from the traefik key in the Consul KV store, and to write sessions by creating /srv/salt/data/acls/service-traefik.hcl:

key_prefix "traefik" {
  policy = "write"
}
session "" {
  policy = "write"
}
node_prefix "" {
  policy = "read"
}
service_prefix "" {
  policy = "read"
}

Ok.. That should do it. Now let's make sure your policies are delivered to your main Consul server by creating /srv/salt/consul_acls.sls:

/etc/consul.d/acls:
  file.directory:
    - makedirs: True
    - user: consul
    - group: consul
    - dir_mode: 755
    - file_mode: 644

consul acls:
  file.managed:
    - names:
{% for node_id in cluster.nodes %}
      - /etc/consul.d/acls/node-{{ node_id }}.hcl:
        - source:
          - salt://data/acls/node-{{ node_id }}.hcl
          - salt://data/acls/node-default.hcl
        - defaults:
            node_id: {{ node_id }}
      - /etc/consul.d/acls/agent-{{ node_id }}.hcl:
        - source:
          - salt://data/acls/agent-{{ node_id }}.hcl
          - salt://data/acls/agent-default.hcl
        - defaults:
            node_id: {{ node_id }}
{% endfor %}
{% for role in ['nomad-server', 'nomad-client'] %}
      - /etc/consul.d/acls/role-{{ role }}.hcl:
        - source: salt://data/acls/role-{{ role }}.hcl
{% endfor %}
{% for service in ['traefik'] %}
      - /etc/consul.d/acls/service-{{ service }}.hcl:
        - source: salt://data/acls/service-{{ service }}.hcl
{% endfor %}
    - user: consul
    - group: consul
    - mode: 644
    - template: jinja

This state makes sure there is a directory /etc/consul.d/acls with two policy files for each node (one default policy and one agent policy), a policy file for each role and for each service. Salt's smart logic uses a node-specific policy file if it's present and otherwise the default policy file.

We'll need to define that cluster.nodes parameter first in /srv/salt/config.jinja:

{% set cluster = ({
    "nodes": ["consul-server-01", "consul-server-02", "consul-server-03",
              "nomad-server-01", "nomad-server-02", "nomad-server-03",
              "docker-01", "docker-02", "docker-03",
              "gluster-01", "gluster-02", "gluster-03"],
    "subnet": "192.168.0.0/24",
    "ip_start": "192.168.0.",
    "consul_server_ips": ["192.168.0.10", "192.168.0.11", "192.168.0.12"],
    "storage_server_mappings": { "gluster-01": "192.168.0.100", "gluster-02": "192.168.0.101", "gluster-03": "192.168.0.102" },
}) %}

Let's only push these to our main Consul server node and enable this state by adding to /srv/salt/top.sls:

<SNIPPED>
  'consul-server-01':
    - consul_acls
<SNIPPED>

After we apply our new states, our node consul-server-01 will have all our ACL policy files:

$ sudo salt '*' state.apply

Create the Consul policies and roles

Now you actually need to create the policies within Consul and create the relevant tokens for them so you can distribute them to your nodes. For the sake of brevity, let's go to our actual node and perform the relevant commands there. First let's create the default and agent policies for each node:

$ ssh boss@191.168.0.10
boss@consul-server-01:~$ export CONSUL_HTTP_TOKEN=<BOOTSTRAP_SECRET>
boss@consul-server-01:~$ for i in consul-server-01 consul-server-02 consul-server-03 nomad-server-01 nomad-server-02 nomad-server-03 docker-01 docker-02 gluster-01 gluster-02 gluster-03; do consul acl policy create -name agent-$i -rules @/etc/consul.d/acls/agent-$i.hcl; consul acl policy create -name default-$i -rules @/etc/consul.d/acls/node-$i.hcl; done
ID:           XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
Name:         agent-consul-server-01
Description:
Datacenters:
Rules:
node "consul-server-01" {
  policy = "write"
}
agent "consul-server-01" {
  policy = "write"
}

<SNIPPED>

That will create 22 policies. Now let's create the role policies and actual roles we defined:

boss@consul-server-01:~$ for i in nomad-server nomad-client; do consul acl policy create -name role-$i -rules @/etc/consul.d/acls/role-$i.hcl;done
<SNIPPED>

boss@consul-server-01:~$ consul acl role create -description "Nomad servers" -name nomad-server -policy-name=role-nomad-server
ID:           XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
Name:         nomad-server
Description:  Nomad servers
Policies:
   XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX - role-nomad-server

boss@consul-server-01:~$ consul acl role create -description "Nomad clients" -name nomad-client -policy-name=role-nomad-client
ID:           XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
Name:         nomad-client
Description:  Nomad clients
Policies:
   XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX - role-nomad-client

And finally let's create the service policy for the Traefik service that we'll need:

boss@consul-server-01:~$ consul acl policy create -name service-traefik -rules @/etc/consul.d/acls/service-traefik.hcl
<SNIPPED>

Ok.. All policies are now loaded into Consul's ACL system.

Generate Consul ACL tokens

Now we need to generate both a default and an agent token for each node taking into account a role if relevant and to generate our Traefik service token.

The agent tokens for each node are all generated the same way:

boss@consul-server-01:~$ for i in consul-server-01 consul-server-02 consul-server-03 nomad-server-01 nomad-server-02 nomad-server-03 docker-01 docker-02 gluster-01 gluster-02 gluster-03; do consul acl token create -description "$i agent token" -policy-name agent-$i; done
AccessorID:       XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
SecretID:         XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
Description:      consul-server-01 agent token
Local:            false
Create Time:      2021-03-27 19:56:40.447714661 +0000 UTC
Policies:
  XXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX - agent-consul-server-01

<SNIPPED>

You'll need to save all the SecretID agent token values for each node somewhere at this point as we'll need them later when we distribute them to each node.

The default token for the consul servers and the gluster nodes are similar as well, but those of the Nomad servers and Docker hosts are different. So that will take three commands:

boss@consul-server-01:~$ for i in consul-server-01 consul-server-02 consul-server-03 gluster-01 gluster-02 gluster-03; do consul acl token create -description "$i default token" -policy-name default-$i; done
<SNIPPED>

boss@consul-server-01:~$ for i in nomad-server-01 nomad-server-02 nomad-server-03; do consul acl token create -description "$i agent token" -policy-name agent-$i -role-name role-nomad-server; done
<SNIPPED>

boss@consul-server-01:~$ for i in docker-01 docker-02; do consul acl token create -description "$i agent token" -policy-name agent-$i -role-name role-nomad-client; done
<SNIPPED>

As with the previous command, you will need to save all these SecretID default tokens as well.

And our final token is the Traefik service token:

boss@consul-server-01:~$ consul acl token create -description "traefik service token" -policy-name service-traefik
<SNIPPED>

For which you'll need to save the SecretID as well.

Ok.. Now you shouVld have a list of:

11 Node default tokens
11 Node agent tokens
1 Traefik service token

Distribute Consul ACL tokens

The node tokens need to be distributed to their respective nodes. How you place or distribute these files to your nodes I'll leave up to you. You could do it by hand, or if you want you could also use Pillar Data and a salt state to distribute those tokens, but that's a security design decision you'll have to take for yourself if you want that.

The result of your distribution should be that each node contains a file called /etc/consul.d/tokens.hcl with filemode 0600 with the following content. And of course you need to replace the agent and default token values with the SecretID values that were generated with the previous command:

acl = {
  tokens = {
    agent = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    default = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
  }
}

Now, back on the Salt Master, you can restart Consul on all your nodes so they can load their tokens and get back online now the ACL system is in place:

$ sudo salt '*' cmd.run 'service consul restart'

All the nodes should now be able to properly operate, but Traefik cannot read its configuration info from the Consul KV store anymore or read the services and their tags from the Consul Service Discovery anymore. That's why we made a special token for Traefik as well. For simplicity of this example series and to show the impact ACLs have, I've chose to you distribute Traefik's token via Consul KV. As you might remember, the docker hosts default policy grants it access to the key_prefix of job_tokens, which is where we'll store Traefik's service token. That way the docker hosts can retrieve the token and pass it onto Traefik which can then use it to read it's configuration.

Let's start by adding the value to the Consul Key Value store:

$ ssh boss@191.168.0.10
boss@consul-server-01:~$ export CONSUL_HTTP_TOKEN=<BOOTSTRAP_SECRET>
boss@consul-server-01:~$ consul kv put job_tokens/traefik <TRAEFIK_SERVICE_SECRET_ID>

We already made sure only the docker hosts can read that service token back, so now let's adapt Traefik's job file to receive the token and set it as an environment variable inside its Docker container, by adding the following template stanza to the traefik task after its configuration template stanze within /srv/salt/data/plans/traefik.nomad:

job "traefik" {
<SNIPPPED>

    task "traefik" {
<SNIPPED>

    template {
        data = <<EOF
CONSUL_HTTP_TOKEN = "{{ key "job_tokens/traefik" }}"
EOF

        destination = "secrets/file.env"
        env         = true
      }
    }
  }
}

Now redeploy Traefik, and everything should be back up:

$ sudo salt 'nomad-server-01' state.apply
$ sudo salt 'nomad-server-01' cmd.run "nomad job plan /home/boss/plans/traefik.nomad"

<SNIPPED>

    nomad job run -check-index XXXX /home/boss/plans/traefik.nomad

<SNIPPED>

And then running the job with the correct index (replace XXXX):

$ sudo salt 'nomad-server-01' cmd.run "nomad job run -check-index XXXX /home/boss/plans/traefik.nomad"

After a minute or so, all your dashboards should be back up, and everything deployed as planned. Great work!

Nomad security

Nomad gossip encryption

Similar to Consul, you can encrypt Nomad servers' gossip protocol with symmetric encryption. Installation is simply. First generate a symmetric key, and then bootstrap all Nomad servers with that key to initialize them.

To generate a new gossip encryption key run:

$ sudo salt 'nomad-server-01' cmd.run 'nomad operator keygen'
nomad-server-01:
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX=

Then add this key to /srv/salt/nomad.d/server.hcl:

server {
  enabled = true
  bootstrap_expect = 3
  encrypt = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX="
}

Then apply the config change to all our Nomad server nodes.

$ sudo salt 'nomad-server-*' state.apply

And that's that. Well done!

Nomad ACL security

Just as with Consul, Nomad supports ACLs. So let's start with that first. But with Nomad configuration ACLs and policies in our cluster is simpler. In Consul every agent needs a token. In Nomad's case, tokens are more persona-based. So you need a token to perform an action with Nomad or read back a status. But the nomad agents itself have no need for tokens to communicate with each other.

So we just need to enable the ACL system within nomad, and then generate a management token for you to operate the cluster. At that point others without the token cannot perform any commands within nomad.

Let's start by enabling Nomad's ACL system in /srv/salt/nomad.d/nomad.hcl:

datacenter = "dc1"
data_dir = "/opt/nomad/data"

acl {
  enabled = true
}

Then apply the config change to all our nodes, and check if we can still access Nomad's job status:

$ sudo salt '*' state.apply
$ sudo salt 'docker-01' cmd.run 'nomad job status -address=http://192.168.0.30:4646'
Error querying jobs: Unexpected response code: 403 (Permission denied)

Great. Now create the bootstrap token:

$ sudo salt 'nomad-server-01' cmd.run 'nomad acl bootstrap'
Accessor ID  = XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
Secret ID    = XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
Name         = Bootstrap Token
Type         = management
Global       = true
Policies     = n/a
Create Time  = 2021-04-06 19:03:12.773503056 +0000 UTC
Create Index = 32444
Modify Index = 32444

If you need more granular policies within your cluster, you'll need to define policies and create tokens for them as well, but for this series, we are done here.

Nomad TLS Security

Just like Consul, Nomad supports TLS security by supporting certificates. We can use Consul's built-in CA functionality to generate the Nomad certificates as well. The big difference is that you only need a single certificate for server nodes and one for client nodes. Individual certificates for servers and client don't add any additional security. So let's create our CA, and server certificates for our three nodes:

$ ( cd /srv/data/ca && consul tls ca create -common-name="Nomad Agent CA" -domain=nomad )
==> Saved nomad-agent-ca.pem
==> Saved nomad-agent-ca-key.pem
$ ( cd /srv/data/ca && consul tls cert create -server -domain=nomad -dc=global )
==> WARNING: Server Certificates grants authority to become a
    server and access all state in the cluster including root keys
    and all ACL tokens. Do not distribute them to production hosts
    that are not server nodes. Store them as securely as CA keys.
==> Using nomad-agent-ca.pem and nomad-agent-ca-key.pem
==> Saved global-server-nomad-0.pem
==> Saved global-server-nomad-0-key.pem
$ ( cd /srv/data/ca && consul tls cert create -client -domain=nomad -dc=global )
==> Using nomad-agent-ca.pem and nomad-agent-ca-key.pem
==> Saved global-client-nomad-0.pem
==> Saved global-client-nomad-0-key.pem

Ok.. Now we have a CA key and certificate, one server certificate and one client certificate.

Warning: It's never a good plan to put keys and certificates in a repository. So, if you plan to put /srv/ in a repository, add /srv/data to .gitignore!

Now it's time to distribute the right key and certificate to the right node type. The certificate certificate and key to the Nomad servers and the client certificate and key to the Docker hosts.

Similarly to the Consul certificate info, we'll need to create pillars for that. One for storing the client information and one for the server information. The client one goes into /srv/pillar/data/nomad_client_tls.sls:

nomad_client_tls:
  user: nomad
  group: nomad
  cert:
    filename: 'client.pem'
    mode: 640
    contents: |
      -----BEGIN CERTIFICATE-----
      <SNIPPED>
      -----END CERTIFICATE-----
  key:
    filename: 'client-key.pem'
    mode: 640
    contents: |
      -----BEGIN EC PRIVATE KEY-----
      <SNIPPED>
      -----END EC PRIVATE KEY-----
  ca:
    filename: 'nomad-ca.pem'
    mode: 640
    contents: |
      -----BEGIN CERTIFICATE-----
      <SNIPPED>
      -----END CERTIFICATE-----

And the server one in /srv/pillar/data/nomad_server_tls.sls:

nomad_server_tls:
  user: nomad
  group: nomad
  cert:
    filename: 'server.pem'
    mode: 640
    contents: |
      -----BEGIN CERTIFICATE-----
      <SNIPPED>
      -----END CERTIFICATE-----
  key:
    filename: 'server-key.pem'
    mode: 640
    contents: |
      -----BEGIN EC PRIVATE KEY-----
      <SNIPPED>
      -----END EC PRIVATE KEY-----
  ca:
    filename: 'nomad-ca.pem'
    mode: 640
    contents: |
      -----BEGIN CERTIFICATE-----
      <SNIPPED>
      -----END CERTIFICATE-----

And then add them to /srv/pillar/top.sls:

<snipped everything before>
  'nomad-*':
    - secure/nomad_server_tls
  'client-*':
    - ufw_client
    - secure/nomad_client_tls
<snipped everything after>

To start in the states we add one new state for the Nomad Server certificate in /srv/salt/nomad_server.sls at the end:

<SNIPPED>

nomad server certificate files:
  file.managed:
    - names:
{% for item in ['cert', 'key', 'ca'] %}
      - /etc/nomad.d/{{ pillar['nomad_server_tls'][item]['filename'] }}:
        - contents_pillar: nomad_server_tls:{{ item }}:contents
        - user: {{ pillar['nomad_server_tls']['user'] }}
        - group: {{ pillar['nomad_server_tls']['group'] }}
        - mode: {{ pillar['nomad_server_tls'][item]['mode'] }}
{% endfor %}
    - watch_in:
      - service: nomad running

And similarly you can then add one new state for the Nomad Client certificate in /srv/salt/nomad_client.sls at the end:

<SNIPPED>

nomad client certificate files:
  file.managed:
    - names:
{% for item in ['cert', 'key', 'ca'] %}
      - /etc/nomad.d/{{ pillar['nomad_client_tls'][item]['filename'] }}:
        - contents_pillar: nomad_client_tls:{{ item }}:contents
        - user: {{ pillar['nomad_client_tls']['user'] }}
        - group: {{ pillar['nomad_client_tls']['group'] }}
        - mode: {{ pillar['nomad_client_tls'][item]['mode'] }}
{% endfor %}
    - watch_in:
      - service: nomad running

Now for the actual configuration files. Let's add the TLS configuration as required in /srv/salt/nomad.d/server.hcl by appending:

tls {
  http = true
  rpc  = true

  ca_file   = "/etc/nomad.d/nomad-ca.pem"
  cert_file = "/etc/nomad.d/server.pem"
  key_file  = "/etc/nomad.d/server-key.pem"

  verify_server_hostname = true
  verify_https_client    = true
}

And similarly for the Consul clients in /srv/salt/consul.d/client.hcl by appending:

tls {
  http = false
  rpc  = true

  ca_file   = "/etc/nomad.d/nomad-ca.pem"
  cert_file = "/etc/nomad.d/client.pem"
  key_file  = "/etc/nomad.d/client-key.pem"

  verify_server_hostname = true
  verify_https_client    = true
}

As you can see, we don't use HTTPS for the UI / Dashboard on the docker nodes. It requires a lot of Traefik hassle with serversTransports to get it working, and for now that's too much. If you do get it to work, leave your Traefik dynamic config below in the comments and I'll add it.

Now finally apply our new config:

$ sudo salt '*' saltutil.refresh_pillar
$ sudo salt '*' state.apply

Now all communication between our Nomad nodes is TLS protected, and it's not possible anymore to turn a compromised Nomad client into a new server.

Gluster security

Gluster only supports two types of security natively. Support for TLS if you assign all nodes a certificate, and IP-based access control.

Without any further configuration, all systems with access to the private network can mount the Gluster volumes. From a containment perspective we want to limit mounting of Gluster volumes to just the Docker hosts.

To enable IP-based access control you can run a command on one of the Gluster server nodes to limit access to a list of IP addresses. To limit to our 3 Gluster nodes and 2 Docker nodes run:

$ sudo salt '*' cmd.run "gluster volume set volume_name auth.allow 192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.30,192.168.0.31"

If you want to go a step further you can also require certificates and TLS for all communication. But that is outside of the scope of this article.

Conclusion

This concludes (at least for now) this series on Avoiding the Cloud. Any questions or requests? Dump them in the comments below and I might use that for a follow-up part in this Series.