Monitoring Metrics

We’ve standardised on Grafana for viewing metric data via dashboards and sending alerts to PagerDuty and Slack - here’s a rough list of the different types of data we’re monitoring. The majority of these are collected via collectd and stored in graphite, although some of the dashboards will use data directly from our Elasticsearch logging infrastructure.

System Metrics

Standard system metrics collected by collectd from the individual VMs and push to graphite

  • CPU usage
  • Network utilisation
    • Socket states (e.g. count of established, are processes listening)
    • Transfer rate
  • Memory usage
    • system overall
    • process size (e.g. datapower)
  • Disk usage
  • Load

Product Metrics

API Invocation

  • Transactions per second [ Count of ExtLatency logs from datapower ] l
  • Rate of errors (for stable APIs) [ Analysis of response codes from analytics ] a
  • Response time a
  • Spread of traffic across gateways a
  • Deep dive into individual APIs a

Analytics

Ingestion of Analytics Events
  • Error level for inbound data to Analytics l
  • WSM Agent lost record count from DataPower cs
  • Logstash pipeline stats s
Health of Analytics Cluster
  • Cluster health cs
  • Shard Counts cs
  • Pending Tasks cs

API Manager

  • Apache connection levels cp
  • Kibana health s

Informix

  • Tablespace usage cs
  • Bufferpool usage cs

Overall Product usage

Internal REST endpoints used to regularly generate reports for data such as

  • Number of provider orgs
  • Number of API definitions
  • Policy usage

Externals

  • Synthetic monitoring - response status and times using Hem and Uptime
    • Sample APIs - echo, proxy, etc.
    • Product APIs - LB Healthcheck, v1/me/orgs

Key to data sources:

  • cs - CollectD custom script
  • cp - CollectD plugin
  • s - Custom script
  • a - Analytics records from ELK
  • l - Log analysis in ELK

Log Analysis

We use ELK as our centralised logging infrastructure with all of our systems offloading logs via syslog.

The logs are parsed and indexed by logstash on the way into the cluster. Everything gets indexed, and some patterns are identified to raise PagerDuty alerts.

Some examples of patterns we’re using to alert on include:

  • Higher than normal rate of 502 or 503 errors in the Analytics inbound logs
  • Errors for known error states - e.g. out of memory and database conditions.