通过夜莺监控etcd

ETCD 是 Kubernetes 控制面的重要组件和依赖,k8s中各个组件的信息都存储在etcd中,因此监控etcd也很重要

ETCD 的数据采集通常使用 3 种方式

  • 使用 ETCD 所在宿主的 agent 直接来采集,因为 ETCD 是个静态 Pod,所以 agent 直接连上去采集即可
  • 把采集器和 ETCD 做成 sidecar 的模式
  • 使用服务发现机制,在中心端部署采集器,与采集apiserver、controller-manager一样

etcd内置了/metrics接口,通过2379端口可以用来获取指标数据,但是访问是需要证书的,如图:

通过kubeadm安装的集群,证书位置一般在/etc/kubernetes/pki/etcd路径下,通过证书以及categraf的prometheus插件即可获取数据

除了2379端口外,etcd还提供了2381这个端口用来获取指标数据,此端口不需要证书,查看etcd.yaml,如图:

cat /etc/kubernetes/manifests/etcd.yaml

通过此地址测试发现已经可以采集到数据,注意是http,如图:

1、本例子中依然通过服务发现机制来采集数据,继续修改采集apiserver和controller时用的prometheus.yml文件,添加etcd相关内容,如下:

- job_name: 'etcd'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: http
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;etcd;http

注意:上图中的source_labels的三个值,对应的就是下面regex的三个值,命名空间为kube-system,服务名为etcd,ep的端口名为http

2、上述例子中的服务名etcd和endpoint的端口名并不存在,因此需要新建,如下:

apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: etcd
  labels:
    k8s-app: etcd
spec:
  selector:
    component: etcd                    #标签是安装etcd时就已经自带的了
  type: ClusterIP
  clusterIP: None
  ports:
    - name: http
      port: 2381
      targetPort: 2381
      protocol: TCP

2、修改etcd配置文件地址,改为0.0.0.0,如图:

vim /etc/kubernetes/manifests/etcd.yaml

3、重启prometheus agent mode的pod,并且应用资源文件,如下:

kubectl rollout restart deploy prometheus-agent -n monitor
kubectl apply -f prometheus.yml

4、在夜莺控制面板即时查询里查看监控指标,如图:

注:如果没有指标,可检查prometheus agent 容器内部/etc/prometheus/prometheus.yml中的etcd配置是否加载进来(通过configMap挂载进去的)

附加:

如果有多个相同的监控指标,可根据job名字区分,如图:

关键指标如下:

# HELP etcd_server_is_leader Whether or not this member is a leader. 1 if is, 0 otherwise.
# TYPE etcd_server_is_leader gauge
etcd leader 表示 ,1 leader 0 learner

# HELP etcd_server_health_success The total number of successful health checks
# TYPE etcd_server_health_success counter
etcd server 健康检查成功次数

# HELP etcd_server_health_failures The total number of failed health checks
# TYPE etcd_server_health_failures counter
etcd server 健康检查失败次数

# HELP etcd_disk_defrag_inflight Whether or not defrag is active on the member. 1 means active, 0 means not.
# TYPE etcd_disk_defrag_inflight gauge
是否启动数据压缩,1表示压缩,0表示没有启动压缩

# HELP etcd_server_snapshot_apply_in_progress_total 1 if the server is applying the incoming snapshot. 0 if none.
# TYPE etcd_server_snapshot_apply_in_progress_total gauge
是否再快照中,1 快照中,0 没有

# HELP etcd_server_leader_changes_seen_total The number of leader changes seen.
# TYPE etcd_server_leader_changes_seen_total counter
集群leader切换的次数

# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
grpc 调用总数

# HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by WAL.
# TYPE etcd_disk_wal_fsync_duration_seconds histogram
etcd wal同步耗时

# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd proposal(提议)失败总次数(proposal就是完成raft协议的一次请求)

# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd proposal(提议)pending总次数(proposal就是完成raft协议的一次请求)

# HELP etcd_server_read_indexes_failed_total The total number of failed read indexes seen.
# TYPE etcd_server_read_indexes_failed_total counter
读取索引失败的次数统计(v3索引为所有key都建了索引,索引是为了加快range操作)

# HELP etcd_server_slow_read_indexes_total The total number of pending read indexes not in sync with leader's or timed out read index requests.
# TYPE etcd_server_slow_read_indexes_total counter
读取到过期索引或者读取超时的次数

# HELP etcd_server_quota_backend_bytes Current backend storage quota size in bytes.
# TYPE etcd_server_quota_backend_bytes gauge
当前后端的存储quota(db大小的上限)
通过参数quota-backend-bytes调整大小,默认2G,官方建议不超过8G

# HELP etcd_mvcc_db_total_size_in_bytes Total size of the underlying database physically allocated in bytes.
# TYPE etcd_mvcc_db_total_size_in_bytes gauge
etcd 分配的db大小(使用量大小+空闲大小)

# HELP etcd_mvcc_db_total_size_in_use_in_bytes Total size of the underlying database logically in use in bytes.
# TYPE etcd_mvcc_db_total_size_in_use_in_bytes gauge
etcd db的使用量大小

# HELP etcd_mvcc_range_total Total number of ranges seen by this member.
# TYPE etcd_mvcc_range_total counter
etcd执行range的数量

# HELP etcd_mvcc_put_total Total number of puts seen by this member.
# TYPE etcd_mvcc_put_total counter
etcd执行put的数量

# HELP etcd_mvcc_txn_total Total number of txns seen by this member.
# TYPE etcd_mvcc_txn_total counter
etcd实例执行事务的数量

# HELP etcd_mvcc_delete_total Total number of deletes seen by this member.
# TYPE etcd_mvcc_delete_total counter
etcd实例执行delete操作的数量

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
etcd cpu使用量

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
etcd 内存使用量

# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
etcd 打开的fd数目

标签