Brain Droppings

From the mind of Chris Weibel


Maintenance For Cloud Foundry Autoscaler

22 Mar 2021

map

Photo by Bofu Shaw on Unsplash

This is a pretty niche area of Cloud Foundry, move along if you aren’t using CF Autoscaler for, oh, 4 years with several thousand apps being monitored for scaling. My copy of autoscaler is a wee bit older as well, modern versions I believe store much of this in memory.

Patient Symptoms

I had an issue where the SHIELD backup of the autoscaler databases was taking longer and longer, finally got a ticket so I could take a look. On first inspection the db was pretty chubby, don’t run into 50GB+ postgres boxes often anymore. Hopped on the instance and dumped the top 10 relations in size:

    schema_name     |       relname        |  size   | table_size
--------------------+----------------------+---------+-------------
 public             | idx_instance_metrics | 23 GB   | 24695881728
 public             | appinstancemetrics   | 15 GB   | 15856132096
 public             | index_app_metrics    | 4722 MB |  4951744512
 public             | app_metric           | 2753 MB |  2886754304
 public             | scalinghistory       | 686 MB  |   719765504
 public             | scalingcooldown      | 784 kB  |      802816
 public             | policy_json          | 96 kB   |       98304
 public             | binding              | 72 kB   |       73728
 public             | pk_binding           | 64 kB   |       65536
 information_schema | sql_features         | 56 kB   |       57344
(10 rows)

After a bit of poking, grabbed the schemas for the table appinstancemetrics, app_metric and scalinghistory and eventually figured out each contains 30 days of history.

At this point, cool, I don’t have to do anything fancy to rotate out the data but the INDEXes are bigger than the tables themselves. Figured out the code does DELETE FROM queries which leads to index bloat. After doing REINDEX TABLE commands was able to shrink down the monster indexes:

    schema_name     |       relname        |  size   | table_size
--------------------+----------------------+---------+-------------
 public             | appinstancemetrics   | 15 GB   | 15856132096
 public             | idx_instance_metrics | 8746 MB |  9170403328
 public             | app_metric           | 2753 MB |  2886754304
 public             | index_app_metrics    | 1783 MB |  1869365248
 public             | scalinghistory       | 686 MB  |   719765504
 public             | scalingcooldown      | 784 kB  |      802816
 public             | policy_json          | 96 kB   |       98304
 public             | binding              | 72 kB   |       73728
 public             | pk_binding           | 64 kB   |       65536
 information_schema | sql_features         | 56 kB   |       57344
(10 rows)

Surgery To Fix The Problem

Cool, select statements are peppier now and its using less disk, but the time to do a backup hasn’t changed because INDEXes are included in pg_dump To make this faster I need to either throw more hardware (nope) or configure for less data to be held. Eventually found that the number of days of history in each of these tables is controlled in the bosh release at:

  autoscaler.operator.instance_metrics_db.cutoff_days:
    description: "the cutoff days when pruning instancemetrics database"
    default: 30
  autoscaler.operator.app_metrics_db.cutoff_days:
    description: "the cutoff days when pruning appmetrics database"
    default: 30
  autoscaler.operator.scaling_engine_db.cutoff_days:
    description: "the cutoff days when pruning scalingengine database"
    default: 30

I got to play with Postgres. T’was a good day.