skip to content
Kunal Virwal
Table of Contents

Maintaining a Kubernetes cluster is a complex process where a million things can go wrong at any time inside the cluster. Imagine your cluster is showing high latency. How will you know what the problem is? Is it a faulty deployment, or is it old RAM sticks? Diagnosing such issues can lead to wasting a lot of time, after which you will realise it was just some configurations that caused excessive resource utilization. To make the task of fault isolation easy, we deploy monitoring for our cluster.

K3s monitoring Grafana Dashboard

Now monitoring for a cluster can be deployed in many ways easiest of which might be using the Helm Chart for prometheus-operator, more recently known as kube-prometheus-stack. This way is easy and requires minimal configuration, but might not be a perfect fit for all the Homelabbers who can’t run heavy workloads or want to handle things themselves. So the other way for this is to host your Prometheus server externally outside the cluster on possibly some other node connected using Tailscale.

Creating External Monitoring Setup

Now let’s just get the constraints for this project straight. We have multiple nodes running a K3s cluster, which we want to monitor for both node and cluster metrics. Let’s also consider that the nodes don’t have publicly routable IP addresses, which is also a common problem faced by homelabbers, and thus are connected by Tailscale.

So first, we need to set up an external suite that houses a few containers necessary for collecting useful metrics and non-K3s containers like standalone Docker containers, if any. These include Prometheus, Grafana, cAdvisor, and Node-exporter. The only con I feel in this method, as compared to the Prometheus-operator Helm Chart, is that the Helm Chart runs Node-exporter as a DaemonSet, so we don’t need to manually deploy Node-exporter on each node.

To deploy the monitoring suite, we can use a common Docker compose file, which also allows us to use container names as hostnames for routing requests in prometheus.yml, because all the containers in a compose belong to a common bridge network.

./docker-compose.yaml
version: "2"
services:
# This container should only be used if you want to scrape metrics from your tailscale nodes
prometheus-sidecar:
image: tailscale/tailscale:latest
hostname: prometheus-sidecar
environment:
- TS_AUTHKEY=tskey-client-notAReal-OAuthClientSecret1Atawk
- "TS_EXTRA_ARGS=--advertise-tags=tag:prometheus --reset" # for an OAuth token the joining container will be by default epimeral and have key expiry disabled.
- TS_STATE_DIR=/var/lib/tailscale
- TS_USERSPACE=false
volumes:
- ./prometheus-sidecar/config:/config
- ./prometheus-sidecar/state:/var/lib/tailscale
- /dev/net/tun:/dev/net/tun
cap_add:
- net_admin
- sys_module
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
network_mode: service:prometheus-sidecar # use network_mode and depends_on field only if using a tailscale sidecar containers
depends_on:
- prometheus-sidecar
command:
- '--config.file=/etc/prometheus/prometheus.yml'
# ports: # use this only if not using for scraping tailscale nodes with sidecar container
# - 7070:9090
volumes:
- ./prometheus-dev:/etc/prometheus
- ./prometheus-data:/prometheus
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- 3030:3000
volumes:
- ./grafana:/var/lib/grafana
node_exporter:
image: quay.io/prometheus/node-exporter:latest
container_name: node_exporter
restart: unless-stopped
pid: host
command:
- '--path.rootfs=/host'
volumes:
- '/:/host:ro,rslave'
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
devices:
- /dev/kmsg
ports:
- 5050:8080
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro

A few things to note, cAdvisor and Node-exporter require mounting the root file system as they require metrics related to system information. The Prometheus-sidecar container is used to connect the Prometheus Docker container to your tailnet with the tag:prometheus , which is a custom-created tag that you can use to put restrictions on resources this container can access in your tailnet using tailscale ACLs.

When using multiple nodes, you only need to deploy Prometheus and Grafana on one node and Node-exporter and cAdvisor on all the nodes. The benefit of this setup is that you can monitor both K3s and Docker containers now because K3s has its cAdvisor built in, which only exports the metrics of K3s containers. If you don’t want to monitor any Docker containers, then you can omit the cAdvisor parts in the Docker compose and in prometheus.yml.

The OAuth client token you use here should have DNS read, Device: Core write, and Auth_keys write permissions.

If you performed all steps, you would be able to access your Prometheus UI on “prometheus-sidecar:9090 ” and Grafana on “nodeIP:3030 ” in your tailnet.

Sidecar container added in the Tailnet

K3s Setup for Exporting Metrics to Prometheus

Before configuring Prometheus, we need to set up K3s to export all metrics. By default, K3s only exposes metrics of Kubelet and Kube-APIserver. To expose the metrics of Kube-controller-manager and Kube-scheduler, we need to start K3s with the following values. The values can be added in /etc/rancher/k3s/config.yaml.

/etc/rancher/k3s//config.yaml
# This config file is present in /etc/rancher/k3s/
# It defines the configs with which the cluster should run with.
# If any change is done in this config then the k3s service must be restarted.
kube-controller-manager-arg:
- "bind-address=0.0.0.0"
kube-scheduler-arg:
- "bind-address=0.0.0.0"
tls-san:
- "localhost"
- "127.0.0.1"
- "<Node-TailnetIP>"
- "<Node-Tailnet-Hostname>"
etcd-expose-metrics: true
# etcd logs are exported over http on port 2381
etcd-arg:
- "listen-metrics-urls=http://127.0.0.1:2381,http://Node-TailnetIP:2381" # for metrics
node-ip: <Node-TailnetIP>
node-external-ip: <Node-TailnetIP>
advertise-address: <Node-TailnetIP>
flannel-iface: tailscale0

These settings make all control plane and node-level resources’ metrics accessible over all IP interfaces, to be scraped by Prometheus. Note, this might be fine for dev setups, but in prod, make sure to shift these to the Tailscale interface or even to the loopback interface. In both scenarios, these endpoints will require a TLS certificate and an SA token to connect.

You can check “/var/lib/rancher/k3s/server/db/etcd/config” to know the state of etcd and where it is listening for metrics. If these settings don’t apply even after restarting K3s, try deleting “/var/lib/rancher/k3s/server/db” to reset the cluster, but any previously done configurations on the cluster will be deleted permanently.

Configuring RBAC in K3s

Now that we have exposed endpoints for Prometheus to scrape, we still need to authenticate with the cluster to access any of the metrics. To do so, we need to configure an RBAC (Role-Based Access Control).

Service Account

To keep things clean, we will create a new namespace monitoring for all Prometheus-related components. The following YAML file will create both a new namespace and a Service Account in that namespace. Service Accounts are used in Kubernetes to identify non-user agents that need to access the cluster’s resources, like Prometheus.

./namespace-and-sa.yaml
# This namespace is created for all monitoring related components.
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
# This service account is for external prometheus server to scrape k3s metrics.
apiVersion: v1
kind: ServiceAccount
metadata:
name: prom-ext-api-user
namespace: monitoring

ClusterRole

We need to give this service account read-only access to all resources in our cluster. So we need to define a ClusterRole that will provide our service account cluster-wide access to scrape metrics. We could also use the pre-built cluster-admin role, but it is less secure as it is like root for the cluster and has full access to create, read, update, and delete all resources in the cluster. This is not only an overkill but also a security risk.

./prometheus-cluster-role.yaml
# This cluster role is for getting cluster wide read only access to all resources for pr.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-external
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]

This ClusterRole defined in this file has access to metrics related to nodes, services, pods, and the /metrics endpoint of the kube-APIserver.

The metrics exported under nodes/metrics are considered a subresource as they can also be found using kubectl. But the /metrics from kube-APIserver is not considered a resource, as it is only accessible by making an HTTPS request and exports data in Prometheus format. This is why it is included in nonResourceURLs.

The get, list, and watch verbs are used to provide read-only access to resources. Kubernetes resources are grouped into API groups, which is why the API version here is rbac.authorization.k8s.io/v1, which is the RBAC group and not just v1.

ClusterRoleBindings

Now we want to create a ClusterRoleBinding between our cluster role and the service account that we created to grant read access to our service account. As defined in the YAML, it links the prom-ext-api-user to the cluster role of prometheus-external, both of which we have defined earlier.

./prometheus-cluster-role-binding.yaml
# This cluster role binding is between prom-ext-api-user in monitoring namespace and prometheus-external cluster role to grant read
# access of cluster resources to the external prometheus.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus-external
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus-external
subjects:
- kind: ServiceAccount
name: prom-ext-api-user
namespace: monitoring

Service Account token and CA certificate

Lastly, we need the service account token and CA cert to authenticate as the service account.

In Kubernetes v1.24 and later, Secrets are not automatically created when a ServiceAccount is created. My k3s version is based on Kubernetes v1.32, so we will need to manually create a Secret for our service account.

./sa-secret.yaml
# This is a secret to access the prom-ext-api-user ServiceAccount in the monitoring
namespace.
apiVersion: v1
kind: Secret
metadata:
name: prom-ext-api-user-token
namespace: monitoring
annotations:
kubernetes.io/service-account.name: prom-ext-api-user
type: kubernetes.io/service-account-token

We need to describe the secret to get the base64 encoded SA token, or we can directly derive the decoded SA token in a file kube-sa-token, using:

Terminal window
kubectl -n monitoring get secret prom-ext-api-user-token -o jsonpath=’{.data.token}’ | base64 -d > kube-sa-token

Similarly, the CA certificate can be derived in a file ca.crt using:

Terminal window
kubectl -n monitoring get secret prom-ext-api-user-token -o jsonpath=’{.data.ca\.crt}’ | base64 -d > ca.crt

Now we are done with our cluster side configurations.

Prometheus COnfiguration

Now we need to configure Prometheus to pull metrics from the cluster, additional cAdvisors, and Node-exporters. For configuring Prometheus, we use a prometheus.yml file from which we can define jobs that we want Prometheus to scrape.

./promethes.yaml
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node_exporter"
static_configs:
- targets: ["node_exporter:9100"]
- job_name: "cadvisor"
static_configs:
- targets: ["cadvisor:8080"]
# k3sCongrats!🎉 You have now set up a robust external monitoring system for your K3s cluster.
static_configs:
- targets: ["<MasterNode>:6443"] # masternode IP:6443
- job_name: "kube-cadvisor" # internal cadvisor managed by kubelets
scheme: https
metrics_path: /metrics/cadvisor
tls_config:
ca_file: /etc/prometheus/ca.crt
bearer_token_file: /etc/prometheus/kube-sa-token
static_configs:
- targets: ["<WorkerNodes>:10250",] # all worker nodes running kubelet in the cluster, kubelet exposes metrics on port 10250.
- job_name: "kube-scheduler"
scheme: https
metrics_path: /metrics
tls_config:
ca_file: /etc/prometheus/ca.crt
bearer_token_file: /etc/prometheus/kube-sa-token
static_configs:
- targets: ["<MasterNode>:10259"] # Master nodes run kube-scheduler which exposes its metrics on port 10259.
- job_name: "kube-controller-manager"
scheme: https
metrics_path: /metrics
tls_config:
ca_file: /etc/prometheus/ca.crt
bearer_token_file: /etc/prometheus/kube-sa-token
static_configs:
- targets: ["<MasterNode>:10257"] # Master nods run the kube-controller-manager and expose their metrics on port 10257.
- job_name: "etcd"
scheme: http # etcd metrics are usually served over HTTP on port 2381
metrics_path: /metrics
static_configs:
- targets: ["<MasterNode>:2381"] # Master nodes run embedded etcd and their metrics are exposed on port 2381.

Now we can copy the prometheus.yml, ca.crt, and kube-sa-token files into the prometheus-dev/ directory with respect to the Docker compose file.

On starting the containers stack, Prometheus will start to fetch metrics from all these sources, and metrics will start to show in the Prometheus UI. NOTE: We are running another cadvisor for Docker containers, so in all the Grafana dashboards that we had set up for K3s pods, we need to add the job filter with value “kube-cadvisor”, and all Docker-specific dashboards will have the value of “cadvisor” in the promQL queries.

Using job=”kube-cadvisor” in promQL queries

Congrats!🎉 You have now set up a robust external monitoring system for your K3s cluster.

Read this on Medium