VictoriaMetrics在抓取目标时遇到了无法获取数据的问题吗?
- 内容介绍
- 文章标签
- 相关推荐
本文共计949个文字,预计阅读时间需要4分钟。
VictoriaMetrics无法获取抓取target的问题描述:最近在新环境中部署了一个服务,其暴露的指标路径为::10299/metrics,配置文件如下(名称字段有修改):yamlapiVersion: v1items: - apiVersion: operator
victoriaMetrics无法获取抓取target的问题 问题描述最近在新环境中部署了一个服务,其暴露的指标路径为:10299/metrics,配置文件如下(名称字段有修改):
apiVersion: v1
items:
- apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
labels:
app_id: audit
name: audit
namespace: default
spec:
endpoints:
- path: /metrics
targetPort: 10299
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
app_id: audit
但在vmagent上查看其状态如下,vmagent无法发现该target:
一般排查方式- 确保服务本身没问题,可以通过
${podIp}:10299/metrics访问到指标 - 确保vmservicescrape-->service-->enpoints链路是通的,即配置的
selector字段能够正确匹配到对应的资源 - 确保vmservicescrape格式正确。注:vmservicescrape资源格式不正确可能会导致vmagent无法加载配置,可以通过第5点检测到
- 确保vmagent中允许发现该命名空间中的target
- 在vmagent的UI界面执行
reload,查看vmagent的日志是否有相关错误提示
经过排查发现上述方式均无法解决问题,更奇怪的是在vmagent的api/v1/targets中无法找到该target,说明vmagent压根没有发现该服务,即vmservicescrape配置没有生效。在vmagent中查看上述vmservicescrape生成的配置文件如下(其拼接了静态配置),可以看到它使用了kubernetes_sd_configs的方式来发现target:
- job_name: serviceScrape/default/audit/0
metrics_path: /metrics
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app_id]
regex: audit
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "10299"
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
target_label: node
regex: Node;(.*)
replacement: ${1}
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
target_label: pod
regex: Pod;(.*)
replacement: ${1}
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: service
- source_labels: [__meta_kubernetes_service_name]
target_label: job
replacement: ${1}
- target_label: endpoint
replacement: "8080"
kubernetes_sd_configs:
- role: endpoints
namespaces:
own_namespace: false
names:
- default
代码分析
既然配置没有问题,那只能通过victoriametrics的kubernetes_sd_configs的运作方式看下到底是哪里出问题了。在victoriametrics的源码可以看到其拼接的target url如下:
scrapeURL := fmt.Sprintf("%s://%s%s%s%s", schemeRelabeled, addressRelabeled, metricsPathRelabeled, optionalQuestion, paramsStr)
其中:
- schemeRelabeled:默认是kubernetes.io/docs/reference/labels-annotations-taints/#endpoints-kubernetes-io-over-capacity
// and github.com/kubernetes/kubernetes/pull/99975
switch eps.Metadata.Annotations.GetByName("endpoints.kubernetes.io/over-capacity") {
case "truncated":
logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and has been truncated; please use "role: endpointslice" instead`, eps.Metadata.key())
case "warning":
logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and will be truncated in the next k8s releases; please use "role: endpointslice" instead`, eps.Metadata.key())
}
// Append labels for skipped ports on seen pods.
portSeen := func(port int, ports []int) bool {
for _, p := range ports {
if p == port {
return true
}
}
return false
}
for p, ports := range podPortsSeen {
for _, c := range p.Spec.Containers {
for _, cp := range c.Ports {
if portSeen(cp.ContainerPort, ports) {
continue
}
addr := discoveryutils.JoinHostPort(p.Status.PodIP, cp.ContainerPort)
m := map[string]string{
"__address__": addr,
}
p.appendCommonLabels(m)
p.appendContainerLabels(m, c, &cp)
if svc != nil {
svc.appendCommonLabels(m)
}
ms = append(ms, m)
}
}
}
return ms
}
可以看到,
"__address__"其实就是拼接了p.Status.PodIP和cp.ContainerPort,而p则代表一个kubernetes的pod数据结构,因此要求:- pod状态是running的,且能够正确分配到PodIP
p.Spec.Containers[].ports[].ContainerPort中配置了暴露metrics target的端口
鉴于上述分析,查看了一下环境中的deployment,发现该deployment只配置了8080端口,并没有配置暴露指标的端口10299。问题解决。
apiVersion: apps/v1 kind: Deployment metadata: labels: app_id: audit name: audit namespace: default spec: ... template: metadata: ... spec: containers: - env: - name: APP_ID value: audit ports: - containerPort: 8080 protocol: TCP ...总结kubernetes_sd_configs方式其实就是通过listwatch的方式获取对应role的配置,然后拼接出target的
__address__,此外它还会暴露一些额外的指标,如:__meta_kubernetes_endpoint_hostname: Hostname of the endpoint.__meta_kubernetes_endpoint_node_name: Name of the node hosting the endpoint.__meta_kubernetes_endpoint_ready: Set totrueorfalsefor the endpoint's ready state.__meta_kubernetes_endpoint_port_name: Name of the endpoint port.__meta_kubernetes_endpoint_port_protocol: Protocol of the endpoint port.__meta_kubernetes_endpoint_address_target_kind: Kind of the endpoint address target.__meta_kubernetes_endpoint_address_target_name: Name of the endpoint address target.
本文共计949个文字,预计阅读时间需要4分钟。
VictoriaMetrics无法获取抓取target的问题描述:最近在新环境中部署了一个服务,其暴露的指标路径为::10299/metrics,配置文件如下(名称字段有修改):yamlapiVersion: v1items: - apiVersion: operator
victoriaMetrics无法获取抓取target的问题 问题描述最近在新环境中部署了一个服务,其暴露的指标路径为:10299/metrics,配置文件如下(名称字段有修改):
apiVersion: v1
items:
- apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
labels:
app_id: audit
name: audit
namespace: default
spec:
endpoints:
- path: /metrics
targetPort: 10299
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
app_id: audit
但在vmagent上查看其状态如下,vmagent无法发现该target:
一般排查方式- 确保服务本身没问题,可以通过
${podIp}:10299/metrics访问到指标 - 确保vmservicescrape-->service-->enpoints链路是通的,即配置的
selector字段能够正确匹配到对应的资源 - 确保vmservicescrape格式正确。注:vmservicescrape资源格式不正确可能会导致vmagent无法加载配置,可以通过第5点检测到
- 确保vmagent中允许发现该命名空间中的target
- 在vmagent的UI界面执行
reload,查看vmagent的日志是否有相关错误提示
经过排查发现上述方式均无法解决问题,更奇怪的是在vmagent的api/v1/targets中无法找到该target,说明vmagent压根没有发现该服务,即vmservicescrape配置没有生效。在vmagent中查看上述vmservicescrape生成的配置文件如下(其拼接了静态配置),可以看到它使用了kubernetes_sd_configs的方式来发现target:
- job_name: serviceScrape/default/audit/0
metrics_path: /metrics
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app_id]
regex: audit
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "10299"
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
target_label: node
regex: Node;(.*)
replacement: ${1}
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
target_label: pod
regex: Pod;(.*)
replacement: ${1}
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: service
- source_labels: [__meta_kubernetes_service_name]
target_label: job
replacement: ${1}
- target_label: endpoint
replacement: "8080"
kubernetes_sd_configs:
- role: endpoints
namespaces:
own_namespace: false
names:
- default
代码分析
既然配置没有问题,那只能通过victoriametrics的kubernetes_sd_configs的运作方式看下到底是哪里出问题了。在victoriametrics的源码可以看到其拼接的target url如下:
scrapeURL := fmt.Sprintf("%s://%s%s%s%s", schemeRelabeled, addressRelabeled, metricsPathRelabeled, optionalQuestion, paramsStr)
其中:
- schemeRelabeled:默认是kubernetes.io/docs/reference/labels-annotations-taints/#endpoints-kubernetes-io-over-capacity
// and github.com/kubernetes/kubernetes/pull/99975
switch eps.Metadata.Annotations.GetByName("endpoints.kubernetes.io/over-capacity") {
case "truncated":
logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and has been truncated; please use "role: endpointslice" instead`, eps.Metadata.key())
case "warning":
logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and will be truncated in the next k8s releases; please use "role: endpointslice" instead`, eps.Metadata.key())
}
// Append labels for skipped ports on seen pods.
portSeen := func(port int, ports []int) bool {
for _, p := range ports {
if p == port {
return true
}
}
return false
}
for p, ports := range podPortsSeen {
for _, c := range p.Spec.Containers {
for _, cp := range c.Ports {
if portSeen(cp.ContainerPort, ports) {
continue
}
addr := discoveryutils.JoinHostPort(p.Status.PodIP, cp.ContainerPort)
m := map[string]string{
"__address__": addr,
}
p.appendCommonLabels(m)
p.appendContainerLabels(m, c, &cp)
if svc != nil {
svc.appendCommonLabels(m)
}
ms = append(ms, m)
}
}
}
return ms
}
可以看到,
"__address__"其实就是拼接了p.Status.PodIP和cp.ContainerPort,而p则代表一个kubernetes的pod数据结构,因此要求:- pod状态是running的,且能够正确分配到PodIP
p.Spec.Containers[].ports[].ContainerPort中配置了暴露metrics target的端口
鉴于上述分析,查看了一下环境中的deployment,发现该deployment只配置了8080端口,并没有配置暴露指标的端口10299。问题解决。
apiVersion: apps/v1 kind: Deployment metadata: labels: app_id: audit name: audit namespace: default spec: ... template: metadata: ... spec: containers: - env: - name: APP_ID value: audit ports: - containerPort: 8080 protocol: TCP ...总结kubernetes_sd_configs方式其实就是通过listwatch的方式获取对应role的配置,然后拼接出target的
__address__,此外它还会暴露一些额外的指标,如:__meta_kubernetes_endpoint_hostname: Hostname of the endpoint.__meta_kubernetes_endpoint_node_name: Name of the node hosting the endpoint.__meta_kubernetes_endpoint_ready: Set totrueorfalsefor the endpoint's ready state.__meta_kubernetes_endpoint_port_name: Name of the endpoint port.__meta_kubernetes_endpoint_port_protocol: Protocol of the endpoint port.__meta_kubernetes_endpoint_address_target_kind: Kind of the endpoint address target.__meta_kubernetes_endpoint_address_target_name: Name of the endpoint address target.

