hello云胜

技术与生活

0%

Skywalking配置告警

在skywalking目录的config目录下的alarm-setings.yml文件中进行配置。

我前面使用docker部署的skywalking

docker下修改

docker ps 查看下skywalking-oap容器的id

1
docker exec -it 容器id bash

进入oap容器

1
2
3
4
5
6
7
8
bash-5.0# ls
LICENSE README.txt cli docker-entrypoint.sh ext-libs oap-libs
NOTICE bin config ext-config licenses tools
bash-5.0# cd config/
bash-5.0# ls
alarm-settings-sample.yml application.yml endpoint-name-grouping.yml gateways.yml meter-analyzer-config otel-oc-rules ui-initialized-templates
alarm-settings.yml component-libraries.yml fetcher-prom-rules log4j2.xml oal service-apdex-threshold.yml

修改alarm-settings.yml文件

重启容器

配置规则

触发条件:在一定的时间段内满足告警条件,通过webhook接口对外发送告警信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
rules:
# Rule unique name, must be ended with `_rule`.
endpoint_percent_rule:
# Metrics value need to be long, double or int
metrics-name: endpoint_percent
threshold: 75
op: <
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 3
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 10
message: Successful rate of endpoint {name} is lower than 75%
  • 规则名:必须唯一,必须以_rule结尾。message中的{name}会替换成该名字
  • metrics-name:指标名。oal脚本中的指标名。
  • threshold:阈值。单个指标的很好理解。对于多个值指标,例如百分比percentile,阈值是一个数组。像value1 value2 value3 value4 value5这样描述. 每个值可以作为度量中每个值的阈值。如果不想通过此值或某些值触发警报,则将值设置为 -.
    例如在百分比中,value1是P50的阈值,且 -,-,value3, value4, value5的意思是,没有阈值的P50和P75百分位告警规则
  • op:操作符,目前支持>>=<<==
  • period:周期窗口。
  • count:计数次数或时间。在周期窗口内,如果超过阈值的次数达到设定的这个count,触发告警
  • silence-period:静默时间。触发告警后,多少时间保持静默,不要告警。
  • message:告警信息

默认规则

alarm-setings.yml文件中已经配置了7个默认规则

1.最近 3 分钟内服务平均响应时间超过 1 秒。

2.服务成功率在最近 2 分钟内低于80%。

3.服务响应时间在最近 3 分钟内低于 1000 毫秒.

4.服务实例在最近 2 分钟内的平均响应时间超过 1 秒。

5.端点平均响应时间在最近 2 分钟内超过1秒。

6.数据库访问平均响应时间在过去 2 分钟内超过 1 秒。

7.端点之间平均响应时间在最近 2 分钟内超过 1 秒。

webhooks

1
2
webhooks:
- http://127.0.0.1/notify/

webhooks里配置接收告警的服务。这个就需要我们自己开发了。

告警的消息会通过 HTTP 请求进行发送, 请求方法为 POST, Content-Typeapplication/json, JSON 格式基于 List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage, 包含以下信息.

  • scopeId. 所有可用的 Scope 请查阅 org.apache.skywalking.oap.server.core.source.DefaultScopeDefine.
  • name. 目标 Scope 的实体名称.
  • id0. Scope 实体的 ID.
  • id1. 未使用.
  • ruleName. 您在 alarm-settings.yml 中配置的规则名.
  • alarmMessage. 报警消息内容.
  • startTime. 告警时间, 位于当前时间与 UTC 1970/1/1 之间.

源码中的类

1
2
3
4
5
6
7
8
9
10
11
12
13
@Setter
@Getter
public class AlarmMessage {
private int scopeId;
private String scope;
private String name;
private String id0;
private String id1;
private String ruleName;
private String alarmMessage;
private long startTime;
private transient boolean onlyAsCondition;
}

动态配置

在config/application.yml文件中,有以下配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
configuration:
selector: ${SW_CONFIGURATION:none}
none:
grpc:
host: ${SW_DCS_SERVER_HOST:""}
port: ${SW_DCS_SERVER_PORT:80}
clusterName: ${SW_DCS_CLUSTER_NAME:SkyWalking}
period: ${SW_DCS_PERIOD:20}
apollo:
apolloMeta: ${SW_CONFIG_APOLLO:http://localhost:8080}
apolloCluster: ${SW_CONFIG_APOLLO_CLUSTER:default}
apolloEnv: ${SW_CONFIG_APOLLO_ENV:""}
appId: ${SW_CONFIG_APOLLO_APP_ID:skywalking}
period: ${SW_CONFIG_APOLLO_PERIOD:5}
zookeeper:
period: ${SW_CONFIG_ZK_PERIOD:60} # Unit seconds, sync period. Default fetch every 60 seconds.
nameSpace: ${SW_CONFIG_ZK_NAMESPACE:/default}
hostPort: ${SW_CONFIG_ZK_HOST_PORT:localhost:2181}
# Retry Policy
baseSleepTimeMs: ${SW_CONFIG_ZK_BASE_SLEEP_TIME_MS:1000} # initial amount of time to wait between retries
maxRetries: ${SW_CONFIG_ZK_MAX_RETRIES:3} # max number of times to retry
etcd:
period: ${SW_CONFIG_ETCD_PERIOD:60} # Unit seconds, sync period. Default fetch every 60 seconds.
group: ${SW_CONFIG_ETCD_GROUP:skywalking}
serverAddr: ${SW_CONFIG_ETCD_SERVER_ADDR:localhost:2379}
clusterName: ${SW_CONFIG_ETCD_CLUSTER_NAME:default}
consul:
# Consul host and ports, separated by comma, e.g. 1.2.3.4:8500,2.3.4.5:8500
hostAndPorts: ${SW_CONFIG_CONSUL_HOST_AND_PORTS:1.2.3.4:8500}
# Sync period in seconds. Defaults to 60 seconds.
period: ${SW_CONFIG_CONSUL_PERIOD:60}
# Consul aclToken
aclToken: ${SW_CONFIG_CONSUL_ACL_TOKEN:""}
k8s-configmap:
period: ${SW_CONFIG_CONFIGMAP_PERIOD:60}
namespace: ${SW_CLUSTER_K8S_NAMESPACE:default}
labelSelector: ${SW_CLUSTER_K8S_LABEL:app=collector,release=skywalking}
nacos:
# Nacos Server Host
serverAddr: ${SW_CONFIG_NACOS_SERVER_ADDR:127.0.0.1}
# Nacos Server Port
port: ${SW_CONFIG_NACOS_SERVER_PORT:8848}
# Nacos Configuration Group
group: ${SW_CONFIG_NACOS_SERVER_GROUP:skywalking}
# Nacos Configuration namespace
namespace: ${SW_CONFIG_NACOS_SERVER_NAMESPACE:}
# Unit seconds, sync period. Default fetch every 60 seconds.
period: ${SW_CONFIG_NACOS_PERIOD:60}
# Nacos auth username
username: ${SW_CONFIG_NACOS_USERNAME:""}
password: ${SW_CONFIG_NACOS_PASSWORD:""}
# Nacos auth accessKey
accessKey: ${SW_CONFIG_NACOS_ACCESSKEY:""}
secretKey: ${SW_CONFIG_NACOS_SECRETKEY:""}

可以看到,skywalking支持非常多的外部配置中心。默认是不开启的,设置为none。

要使用哪个,就将selector配置为哪个。

配置的key为alarm.default.alarm-settings

配置的值格式必须和alarm-settings.yml 文件一致