Monitoring API Connect

Monitoring Metrics

We've standardised on Grafana for viewing metric data via dashboards and sending alerts to PagerDuty and Slack - here's a rough list of the different types of data we're monitoring. The majority of these are collected via collectd and stored in graphite, although some of the dashboards will use data directly from our Elasticsearch logging infrastructure.

System Metrics

Standard system metrics collected by collectd from the individual VMs and push to graphite

CPU usage
Network utilisation
- Socket states (e.g. count of established, are processes listening)
- Transfer rate
Memory usage
- system overall
- process size (e.g. datapower)
Disk usage
Load

Product Metrics

API Invocation

Transactions per second [ Count of ExtLatency logs from datapower ] ^l
Rate of errors (for stable APIs) [ Analysis of response codes from analytics ] ^a
Response time ^a
Spread of traffic across gateways ^a
Deep dive into individual APIs ^a

Analytics

Ingestion of Analytics Events

Error level for inbound data to Analytics ^l
WSM Agent lost record count from DataPower ^cs
Logstash pipeline stats ^s

Health of Analytics Cluster

Cluster health ^cs
Shard Counts ^cs
Pending Tasks ^cs

API Manager

Apache connection levels ^cp
Kibana health ^s

Informix

Tablespace usage ^cs
Bufferpool usage ^cs

Overall Product usage

Internal REST endpoints used to regularly generate reports for data such as

Number of provider orgs
Number of API definitions
Policy usage

Externals

Synthetic monitoring - response status and times using Hem and Uptime
- Sample APIs - echo, proxy, etc.
- Product APIs - LB Healthcheck, v1/me/orgs

Key to data sources:

^cs - CollectD custom script
^cp - CollectD plugin
^s - Custom script
^a - Analytics records from ELK
^l - Log analysis in ELK

Log Analysis

We use ELK as our centralised logging infrastructure with all of our systems offloading logs via syslog.

The logs are parsed and indexed by logstash on the way into the cluster. Everything gets indexed, and some patterns are identified to raise PagerDuty alerts.

Some examples of patterns we're using to alert on include:

Higher than normal rate of 502 or 503 errors in the Analytics inbound logs
Errors for known error states - e.g. out of memory and database conditions.