본문 바로가기

Tech/Tips

Telegraf로 ElasticSearch에 시스템, 도커 모니터링 메트릭 전달 및 Kibana로 시각화하기

반응형

이미지 출처 : Pixabay

이번 글에서는 지난번에 구축한 ElasticSearch에 Telegraf를 사용하여 각종 로그를 전송하고 이를 Kibana를 사용하여 시각화하는 방법에 대해서 다루려고 합니다.

2020/08/30 - [Tech/Tips] - Docker-compose로 ElasticSearch, Kibana 구동하기

 

Docker-compose로 ElasticSearch, Kibana 구동하기

안녕하세요! 한동안 바빠서 연재를 쭉 하지 못했습니다. 이번 글에서는 마이크로 서비스 아키텍쳐에서 필수적인 ELK 스택을 도커로 구축하는 법에 대해서 설명하려 합니다. (LogStash는 수집할 로��

sean-ma.tistory.com

2020/09/06 - [Tech/Tips] - Docker로 Mariadb, Postgresql, PGAdmin, Redis 구동하기

 

Docker로 Mariadb, Postgresql, PGAdmin, Redis 구동하기

안녕하세요! telegraf로 시스템 정보를 모니터링하기 전에, 각종 개발 테스트를 위해 서버에 mariadb, postgresql, redis를 띄웠습니다. 하나하나 docker run ~ 로 띄우기에는 너무 관리가 힘들어서, docker-comp.

sean-ma.tistory.com

ElasicSearch, Kibana 및 각종 DB를 띄우는 방법에 대해서는 위 포스트를 참조해 주세요!

 

Telegraf는 InfluxDB의 제작사인 InfluxData에서 만든 시스템, DB, IoT 센서 등에서 만들어진 메트릭을 수집하는 오픈소스 서버 에이전트이며, 플러그인을 기반으로 동작합니다.

Go 언어로 작성되어 단일 바이너리로 배포할 수 있어 여러 의존성을 덕지덕지 설치하지 않아도 되는데다 시스템 모니터링 기능을 기본적으로 제공하기 때문에 Logstash - Beat 와 같이 메트릭 수집기를 띄울 필요가 없어 선택하게 되었습니다.

 

기본적으로 Telegraf는 제작사인 InfluxData에서 만든 시계열 DB인 InfluxDB에 수집한 메트릭을 적재하고, 이를 Grafana로 시각화하는 것이 보편적으로 사용되는 스택이나, ElasticSearch output plugin을 포함하고 있어 Elasic 스택과도 연동이 가능하다는 장점이 있습니다.

 

Telegraf 공식 사이트 : www.influxdata.com/time-series-platform/telegraf/

 

Telegraf Open Source Server Agent | InfluxData

Telegraf is a plugin-driven server agent for collecting and reporting metrics for all kinds of data from databases, systems, and IoT devices. Connect to MongoDB, MySQL, Redis, InfluxDB time series database and others, collect metrics from cloud platforms a

www.influxdata.com

 

CentOS 7 기준으로 설치는 매우 간단합니다.

Telegraf 1.15.2 버전 기준 다음과 같이 두 줄의 명령어를 실행하는것으로 설치할 수 있습니다.

혹은 다운로드 링크 : portal.influxdata.com/downloads/

 

Downloads

Ubuntu & Debian SHA256: b50a115b30186da878a738c1eef8e912b96b98017d1ec2a5d0720ab442c68e06 wget https://dl.influxdata.com/telegraf/releases/telegraf_1.15.0~rc4-1_amd64.deb sudo dpkg -i telegraf_1.15.0~rc4-1_amd64.deb RedHat & CentOS SHA256: e8860d1fa10327e70

portal.influxdata.com

wget https://dl.influxdata.com/telegraf/releases/telegraf-1.15.2-1.x86_64.rpm
sudo yum localinstall telegraf-1.15.2-1.x86_64.rpm

설치 후 /etc/telegraf 로 이동하여 telegraf.conf 파일을 수정합니다.

 

1. ElasticSearch Output Plugin 설정

# # Configuration for Elasticsearch to send metrics to.
  [[outputs.elasticsearch]]
#   ## The full HTTP endpoint URL for your Elasticsearch instance
#   ## Multiple urls can be specified as part of the same cluster,
#   ## this means that only ONE of the urls will be written to each interval.
    urls = [ "http://localhost:{{YOUR_ELASTIC_SEARCH_PORT}}" ] # required.
#   ## Elasticsearch client timeout, defaults to "5s" if not set.
    timeout = "5s"
#   ## Set to true to ask Elasticsearch a list of all cluster nodes,
#   ## thus it is not necessary to list all nodes in the urls config option.
    enable_sniffer = false
#   ## Set the interval to check if the Elasticsearch nodes are available
#   ## Setting to "0s" will disable the health check (not recommended in production)
    health_check_interval = "10s"
#   ## HTTP basic authentication details
#   # username = "telegraf"
#   # password = "mypassword"
#
#   ## Index Config
#   ## The target index for metrics (Elasticsearch will create if it not exists).
#   ## You can use the date specifiers below to create indexes per time frame.
#   ## The metric timestamp will be used to decide the destination index name
#   # %Y - year (2016)
#   # %y - last two digits of year (00..99)
#   # %m - month (01..12)
#   # %d - day of month (e.g., 01)
#   # %H - hour (00..23)
#   # %V - week of the year (ISO week) (01..53)
#   ## Additionally, you can specify a tag name using the notation {{tag_name}}
#   ## which will be used as part of the index name. If the tag does not exist,
#   ## the default tag value will be used.
#   # index_name = "telegraf-{{host}}-%Y.%m.%d"
#   # default_tag_value = "none"
    index_name = "telegraf-%Y.%m.%d" # required.
#
#   ## Optional TLS Config
#   # tls_ca = "/etc/telegraf/ca.pem"
#   # tls_cert = "/etc/telegraf/cert.pem"
#   # tls_key = "/etc/telegraf/key.pem"
#   ## Use TLS but skip chain & host verification
#   # insecure_skip_verify = false
#
#   ## Template Config
#   ## Set to true if you want telegraf to manage its index template.
#   ## If enabled it will create a recommended index template for telegraf indexes
    manage_template = true
#   ## The template name used for telegraf indexes
    template_name = "telegraf"
#   ## Set to true if you want telegraf to overwrite an existing template
    overwrite_template = false

 

2. System Monitoring을 위한 Input Plugin설정 (기본 설정입니다.)

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states.
  report_active = false


# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]

  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]


# Read metrics about disk IO by device
[[inputs.diskio]]
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb", "vd*"]
  ## Uncomment the following line if you need disk serial numbers.
  # skip_serial_number = false
  #
  ## On systems which support it, device metadata can be added in the form of
  ## tags.
  ## Currently only Linux is supported via udev properties. You can view
  ## available properties for a device by running:
  ## 'udevadm info -q property -n /dev/sda'
  ## Note: Most, but not all, udev properties can be accessed this way. Properties
  ## that are currently inaccessible include DEVTYPE, DEVNAME, and DEVPATH.
  # device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]
  #
  ## Using the same metadata source as device_tags, you can also customize the
  ## name of the device via templates.
  ## The 'name_templates' parameter is a list of templates to try and apply to
  ## the device. The template may contain variables in the form of '$PROPERTY' or
  ## '${PROPERTY}'. The first template which does not contain any variables not
  ## present for the device is used as the device name tag.
  ## The typical use case is for LVM volumes, to get the VG/LV name instead of
  ## the near-meaningless DM-0 name.
  # name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]


# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration


# Read metrics about memory usage
[[inputs.mem]]
  # no configuration


# Get the number of processes and group them by status
[[inputs.processes]]
  # no configuration


# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration


# Read metrics about system load & uptime
[[inputs.system]]
  ## Uncomment to remove deprecated metrics.
  # fielddrop = ["uptime_format"]

 

3. 도커 모니터링을 위한 Input Plugin 설정

# # Read metrics about docker containers
 [[inputs.docker]]
#   ## Docker Endpoint
#   ##   To use TCP, set endpoint = "tcp://[ip]:[port]"
#   ##   To use environment variables (ie, docker-machine), set endpoint = "ENV"
   endpoint = "unix:///var/run/docker.sock"
#
#   ## Set to true to collect Swarm metrics(desired_replicas, running_replicas)
   gather_services = false
#
#   ## Only collect metrics for these containers, collect all if empty
   container_names = []
#
#   ## Set the source tag for the metrics to the container ID hostname, eg first 12 chars
   source_tag = false
#
#   ## Containers to include and exclude. Globs accepted.
#   ## Note that an empty array for both will include all containers
   container_name_include = []
   container_name_exclude = []
#
#   ## Container states to include and exclude. Globs accepted.
#   ## When empty only containers in the "running" state will be captured.
#   ## example: container_state_include = ["created", "restarting", "running", "removing", "paused", "exited", "dead"]
#   ## example: container_state_exclude = ["created", "restarting", "running", "removing", "paused", "exited", "dead"]
   # container_state_include = []
   # container_state_exclude = []
#
#   ## Timeout for docker list, info, and stats commands
   timeout = "5s"
#
#   ## Whether to report for each container per-device blkio (8:0, 8:1...) and
#   ## network (eth0, eth1, ...) stats or not
   perdevice = true
#
#   ## Whether to report for each container total blkio and network stats or not
   total = false
#
#   ## Which environment variables should we use as a tag
#   ##tag_env = ["JAVA_HOME", "HEAP_SIZE"]
#
#   ## docker labels to include and exclude as tags.  Globs accepted.
#   ## Note that an empty array for both will include all labels as tags
   docker_label_include = []
   docker_label_exclude = []
#
#   ## Optional TLS Config
#   # tls_ca = "/etc/telegraf/ca.pem"
#   # tls_cert = "/etc/telegraf/cert.pem"
#   # tls_key = "/etc/telegraf/key.pem"
#   ## Use TLS but skip chain & host verification
#   # insecure_skip_verify = false

 

4. MariaDB 모니터링을 위한 Input Plugin 설정

# # Read metrics from one or many mysql servers
 [[inputs.mysql]]
   ## specify servers via a url matching:
#   ##  [username[:password]@][protocol[(address)]]/[?tls=[true|false|skip-verify|custom]]
#   ##  see https://github.com/go-sql-driver/mysql#dsn-data-source-name
#   ##  e.g.
#   ##    servers = ["user:passwd@tcp(127.0.0.1:3306)/?tls=false"]
#   ##    servers = ["user@tcp(127.0.0.1:3306)/?tls=false"]
#   #
#   ## If no servers are specified, then localhost is used as the host.
   servers = ["root:{{YOUR_PASSWORD}}#@@tcp(127.0.0.1:{{YOUR_PORT}})/"]
#
#   ## Selects the metric output format.
#   ##
#   ## This option exists to maintain backwards compatibility, if you have
#   ## existing metrics do not set or change this value until you are ready to
#   ## migrate to the new format.
#   ##
#   ## If you do not have existing metrics from this plugin set to the latest
#   ## version.
#   ##
#   ## Telegraf >=1.6: metric_version = 2
#   ##           <1.6: metric_version = 1 (or unset)
#   metric_version = 2
#
#   ## if the list is empty, then metrics are gathered from all databasee tables
#   # table_schema_databases = []
#
#   ## gather metrics from INFORMATION_SCHEMA.TABLES for databases provided above list
#   # gather_table_schema = false
#
#   ## gather thread state counts from INFORMATION_SCHEMA.PROCESSLIST
#   # gather_process_list = false
#
#   ## gather user statistics from INFORMATION_SCHEMA.USER_STATISTICS
#   # gather_user_statistics = false
#
#   ## gather auto_increment columns and max values from information schema
#   # gather_info_schema_auto_inc = false
#
#   ## gather metrics from INFORMATION_SCHEMA.INNODB_METRICS
#   # gather_innodb_metrics = false
#
#   ## gather metrics from SHOW SLAVE STATUS command output
#   # gather_slave_status = false
#
#   ## gather metrics from SHOW BINARY LOGS command output
#   # gather_binary_logs = false
#
#   ## gather metrics from PERFORMANCE_SCHEMA.GLOBAL_VARIABLES
#   # gather_global_variables = true
#
#   ## gather metrics from PERFORMANCE_SCHEMA.TABLE_IO_WAITS_SUMMARY_BY_TABLE
#   # gather_table_io_waits = false
#
#   ## gather metrics from PERFORMANCE_SCHEMA.TABLE_LOCK_WAITS
#   # gather_table_lock_waits = false
#
#   ## gather metrics from PERFORMANCE_SCHEMA.TABLE_IO_WAITS_SUMMARY_BY_INDEX_USAGE
#   # gather_index_io_waits = false
#
#   ## gather metrics from PERFORMANCE_SCHEMA.EVENT_WAITS
#   # gather_event_waits = false
#
#   ## gather metrics from PERFORMANCE_SCHEMA.FILE_SUMMARY_BY_EVENT_NAME
#   # gather_file_events_stats = false
#
#   ## gather metrics from PERFORMANCE_SCHEMA.EVENTS_STATEMENTS_SUMMARY_BY_DIGEST
#   # gather_perf_events_statements = false
#
#   ## the limits for metrics form perf_events_statements
#   # perf_events_statements_digest_text_limit = 120
#   # perf_events_statements_limit = 250
#   # perf_events_statements_time_limit = 86400
#
#   ## Some queries we may want to run less often (such as SHOW GLOBAL VARIABLES)
#   ##   example: interval_slow = "30m"
#   # interval_slow = ""
#
#   ## Optional TLS Config (will be used if tls=custom parameter specified in server uri)
#   # tls_ca = "/etc/telegraf/ca.pem"
#   # tls_cert = "/etc/telegraf/cert.pem"
#   # tls_key = "/etc/telegraf/key.pem"
#   ## Use TLS but skip chain & host verification
#   # insecure_skip_verify = false

 

5. Redis 모니터링을 위한 Input Plugin 설정

# # Read metrics from one or many redis servers
 [[inputs.redis]]
#   ## specify servers via a url matching:
#   ##  [protocol://][:password]@address[:port]
#   ##  e.g.
#   ##    tcp://localhost:6379
#   ##    tcp://:password@192.168.99.100
#   ##    unix:///var/run/redis.sock
#   ##
#   ## If no servers are specified, then localhost is used as the host.
#   ## If no port is specified, 6379 is used
   servers = ["tcp://localhost:{{YOUR_REDIS_PORT}}"]
#
#   ## specify server password
    password = "{{YOUR_REDIS_PASSWORD}}"
#
#   ## Optional TLS Config
#   # tls_ca = "/etc/telegraf/ca.pem"
#   # tls_cert = "/etc/telegraf/cert.pem"
#   # tls_key = "/etc/telegraf/key.pem"
#   ## Use TLS but skip chain & host verification
#   # insecure_skip_verify = true

 

6. PostgreSQL 모니터링을 위한 Input Plugin 설정

# # Read metrics from one or many postgresql servers
 [[inputs.postgresql]]
#   ## specify address via a url matching:
#   ##   postgres://[pqgotest[:password]]@localhost[/dbname]\
#   ##       ?sslmode=[disable|verify-ca|verify-full]
#   ## or a simple string:
#   ##   host=localhost user=pqotest password=... sslmode=... dbname=app_production
#   ##
#   ## All connection parameters are optional.
#   ##
#   ## Without the dbname parameter, the driver will default to a database
#   ## with the same name as the user. This dbname is just for instantiating a
#   ## connection with the server and doesn't restrict the databases we are trying
#   ## to grab metrics for.
#   ##
   address = "host=localhost port={{YOUR_PG_PORT}} user={{YOUR_PG_USERNAME}} password={{YOUR_PG_PASSWORD}} sslmode=disable"
#   ## A custom name for the database that will be used as the "server" tag in the
#   ## measurement output. If not specified, a default one generated from
#   ## the connection address is used.
#   # outputaddress = "db01"
#
#   ## connection configuration.
#   ## maxlifetime - specify the maximum lifetime of a connection.
#   ## default is forever (0s)
#   max_lifetime = "0s"
#
#   ## A  list of databases to explicitly ignore.  If not specified, metrics for all
#   ## databases are gathered.  Do NOT use with the 'databases' option.
#   # ignored_databases = ["postgres", "template0", "template1"]
#
#   ## A list of databases to pull metrics about. If not specified, metrics for all
#   ## databases are gathered.  Do NOT use with the 'ignored_databases' option.
#   # databases = ["app_production", "testing"]

 

telegraf.conf 설정이 완료되었다면 config test를 통하여 정상적으로 메트릭이 수집되는지 확인해 줍니다.

telegraf -config telegraf.conf -test

 

(..어마어마한 양의 메트릭이 나오니 결과는 생략하겠습니다.)

 

여기서 docker의 메트릭들이 수집되지 않는다면 docker 그룹에 telegraf를 추가해 줍시다.

그 전에 docker 그룹이 생성되어 있는지 vi -R /etc/group 명령어를 통하여 확인합니다.

 

docker 그룹이 생성되어 있지 않다면 하나 만들어줍니다.

sudo groupadd docker

 

이후 telegraf를 docker 그룹에 추가해 줍니다.

sudo usermod -aG docker telegraf

 

vi -R /etc/group 명령어를 통하여 정상적으로 추가되었는지 확인합니다.

# Example
docker:x:1001:root,telegraf

 

그 다음, systemctl에 등록해 줍니다.

sudo systemctl enable telegraf
sudo systemctl start telegraf

 

 

이후 telegraf 서비스가 잘 실행되고 있는지 확인합니다.

sudo systemctl status telegraf

● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled)
   Active: active (running) since 수 2020-09-09 16:08:57 KST; 1 day 6h ago
     Docs: https://github.com/influxdata/telegraf
 Main PID: 21523 (telegraf)
    Tasks: 19
   Memory: 70.9M
   CGroup: /system.slice/telegraf.service
           └─21523 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

 

ElasticSearch에 telegraf를 사용하여 전달한 메트릭이 잘 저장되고 있는지 curl 명령어를 통하여 확인합니다.

저는 ElasticSearch가 9200번 포트에서 실행되고 있으므로 9200번 포트를 조회합니다.

curl -XGET 'localhost:9200/telegraf-YYYY.MM.DD/_search?pretty'

이 때, YYYY.MM.DD는 telegraf service가 시작된 날을 기준으로 ElasticSearch 토픽이 생성되므로, 2020년 9월 9일에 생성하였다면 telegraf-2020.09.09를 조회해 주어야 합니다.

 

(..어마어마한 양의 메트릭이 나오니 결과는 생략하겠습니다222)

 

이후 Kibana에서 Visualize 생성 시 Data Index Pattern은 telegraf-* 로 지정해주면 telegraf 서비스가 언제 재시작되더라도 정상적으로 조회가 가능합니다.

 

Kibana Visualize 예시 : System Memory 정보 모니터링

 

반응형