Monitoring

Even though a lot has changed within the API Connect product and the types and numbers of stacks we’re running since I first posted an overview of monitoring API Connect , the main areas we monitor haven’t.

We are still using Grafana as a central location for dashboarding and analysing data across different data sources but some of the tools we’re using to collect the data have changed. Having access to all the data in a single UI is really powerful, especially when troubleshooting or investigating events across the systems, being able to identify correlations between data from external load balancing, response times parsed from logs and pod utilisation metrics can really help narrrow in on specific components and how they impact the wider solution.

Metrics

Metrics flow

For metrics we’re making use of the IBM Cloud Monitoring with Sysdig to gather metrics from across the kubernetes deployment, including metrics from kubernetes itself and recognisable container applications such as nginx. We also supplement this with our own custom metrics exporter, Trawler, which we built for API Connect to extract key application specific data and expose them to a prometheus compatible monitoring tool or send them to graphite. Examples of data gathererd are counts of objects within API Manager and DataPower and analytics call counts. For endpoint and availability monitoring we are continuing to use Hem which is a simple python application to call HTTP(s) endpoints and send the metrics to our graphtie stack. All of these then come together to view within our grafana dashboards - and to be used within new exploratory dashboards whilst problem solving as needed.

Logging

Logging flow

For our logging infrastructure, we continue to use Elastic, making use of the filebeat agent within our clusters to gather and tag the container logs, then some custom parsing in logstash to parse out the significant elements from the different logs so that we can easily correlate these with events going on in the system. A lot of the time this data is then viewed in timeseries graphs within grafana, but also linked to Kibana views to dig deeper in the logs themselves.

As part of our work in running and monitoring our API Connect cloud deployments we’ve built some of our own tooling to assist with monitoring what is going on within the deployments. Trawler is one of these items which is used to gather metrics from a Kubernetes based deployment of API Connect.

Trawler runs within kubernetes alongside API Connect and identifies the API Connect components and exposes metrics to prometheus (or other compatible monitoring tooling)

This data can then be used to feed into dashboards such as this one in Grafana: Grafana dashboard

Trawler is open-source and available on github and docker hub - See the installation guide for more information on using trawler for yourself.

The kind of metrics that trawler collects are currently as follows:

Management subsystem:

  • API Connect version information (apiconnect_build_info)
  • Total users (apiconnect_users_total)
  • Number of provider_orgs (apiconnect_provider_orgs_total)
  • Number of consumer orgs (apiconnect_consumer_orgs_total)
  • Number of catalogs (apiconnect_catalogs_total)
  • Number of draft products / apis (apiconnect_draft_products_total / apiconnect_draft_apis_total)
  • Number of products / apis (apiconnect_products_total / apiconnect_apis_total)
  • Number of subscriptions (apiconnect_subscriptions_total)

DataPower subsystem:

  • TCP connection stats (datapower_tcp…)
  • Log target stats: events processed, dropped, pending (datapower_logtarget…)
  • Object counts e.g. SSLClientProfile, APICollection, APIOperation etc. (datapower_{object}_total)
  • HTTP Stats (datapower_http_tenSeconds/oneMinute/tenMinutes/oneDay)

Analytics subsystem

  • Cluster health status (analytics_cluster_status)
  • Number of nodes in the cluster (analytics_data_nodes_total/analytics_nodes_total)
  • Number of shards in states - active, relocating, initialising, unassigned (analytics_{state}_shards_total)
  • Number of pending tasks (analytics_pending_tasks_total)

Introducing hem

hem is a synthetic monitoring tool which monitors HTTP resources on a regular schedule, storing details of the time taken and the reponse code returned.

I’ve been using Uptime at work for a while for endpoint monitoring and over the time we’ve been using it made a few tweaks or plugins for it - in particular being able to send metrics from Uptime to Graphite. There were also some more substantial changes we were considering making and we’d built up a number of supporting scripts to populate the checks via the Uptime API when hosts changed. We also have all our other monitoring dashboards in Grafana. In this context I decided that what would be nice is a simple tool that could replace the checking piece and feed that data into our graphite data store to be viewed and alerted on from Grafana.

hem runs from a simple config file with three main sections in it - discovery, tests and metrics. Both discovery and metrics have been designed as pluggable to give hem versatility - so far I’ve built discovery drivers for dns, consul and json/yaml and metrics drivers for graphite, kafka and the console. hem will iterate over the tests on a custom interval performing discovery each time to ensure it has the latest list of hosts for that test.

hem stats in Grafana steps

Getting started with hem

To start using hem, you can install it from PyPI with pip:

pip install hemApp

Then create a config file - it will look something like this:

    discovery:
      type: dns
    metrics:
      type: graphite
      server: 127.0.0.1
      port: 2003
    tests:
      homepage:
        path: /index.html
        secure: false
        hosts:
           - example.com
           - example.org

Run hem and start to see metrics flowing to graphite

hem -c config.yaml

In grafana I have the Discrete plugin installed to give the coloured bar look you see above.