Amazon Web Services (AWS) is one of the most popular public cloud providers today. Over the years, AWS’ services have expanded from cloud computing to application development and security. To retain the reliability, availability, and performance of your AWS instances, an AWS cloud monitoring solution is a must. It’s critical for AWS monitoring tools to collect data from all parts of your AWS service, so that multi-point failure can be easily debugged.
In this blog, we’ll focus on some key metrics that most AWS monitoring tools use to monitor a widely-used AWS service: Amazon Relational Database Service (RDS). We’ll also touch on Amazon Aurora, a MySQL and PostgreSQL-compatible database (DB) that’s available as part of Amazon RDS.
Security, scalability, easy setup, high availability, and cost-effectiveness are some of the most prominent features of Amazon RDS. RDS supports six major DB engines: MySQL, PostgreSQL, MariaDB, Oracle Database, SQL Server, and Amazon Aurora. This wide support helps any application or tool work seamlessly with Amazon RDS. To comprehensively monitor Amazon RDS using an AWS monitoring tool like Applications Manager, there are some key metrics you need to track.
CPU utilization measures the percentage of allocated compute units currently used by your RDS instances, and can also be used to track CPU performance regressions or improvements. Applications can become unavailable when they reach their upper limits on CPU usage. Each instance is limited to a certain amount of CPU. Tracking CPU utilization across your RDS instance can help you determine if your applications are overworked or underworked.
Maintain system performance and availability by setting up alerts for changes in memory usage patterns. Lack of storage space in DB instances can lead to data loss and application bottlenecks. Scale up your DB instance when you approach your storage capacity limits. To accomodate any unforseen demands from your applications, it’s critical to have some buffer in storage and memory.
A very low free memory value indicates that the DB is under memory pressure. If you encounter performance issues or there’s no free memory left, you need to upgrade to a larger instance. Also, for optimal RDS monitoring, always ensure that your DB instance isn’t memory-constrained.
Network traffic is highly dependent on the expected throughput. Maintain the expected throughput for your network by keeping track of critical network traffic metrics like receive throughput and transmit throughput. Like CPU, memory, and storage, each instance needs to have a certain amount of network bandwidth dedicated to it.
The amount of network bandwidth allocated to your DB instance is determined by the instance size. Smaller instances have low bandwidth, whereas bigger instances have more.
Capture query latency to measure how long your I/O operations take at the disk level. To maintain the expected values of your IOPS metrics, set up a baseline value and investigate if the results vary from it.
Read IOPS: Sudden spikes in read IOPS might indicate runaway queries.
Write IOPS: Sudden spikes in write IOPS might indicate large data modification.
Keep your storage volumes in pace with the volume of read and write requests by tracking the I/O operations queue. To minimize read and write operations and optimize your applications’ performance, ensure that your typical working set fits into the allocated memory.
Measuring latency can help you identify and investigate resource constraints affecting DB performance. Monitor latency in transactions for slow reads or writes of any application running in your RDS environment.
Amazon Aurora consists of one or more primary DB instances and allows you to distribute up to 15 Aurora replicas across multiple availability zones of a DB cluster. It supports read-only queries. You can manage data volume for the primary instances and Aurora replicas using a common cluster volume. As the cluster volume is shared by all instances of your DB, it’s easy to replicate data for each Aurora replica.
The replicas are assigned three separate endpoints:
Cluster endpoint: Connects to the primary DB instance for that DB cluster.
Reader endpoint: Connects to one of the available Aurora replicas for that DB cluster. The cluster and reader endpoints provide support for high-availability scenarios.
Instance endpoint: Connects to a specific DB instance within an Aurora cluster. Each DB instance in a DB cluster, regardless of instance type, has its own unique instance endpoint.
Read replica metrics
Aurora replicas work perfectly for read scaling as they are fully dedicated to read operations on cluster volumes. High lag values indicate that read operations from the replica are not serving the current data.
Write operations are managed by the primary instance. Once the data is written into Aurora, Aurora writes the data into all data copies. After the primary instance written update is done, Aurora replicas return the same data for query results with minimal replica lag < 100 milliseconds. Replica lag varies depending on the rate of DB change. A large amount of write operations may cause an increase in replica lag.
If the primary instance fails, an Aurora replica is promoted as a primary instance to maintain high availability. In cases where you encounter failover and Aurora replicas are absent, your DB cluster will be unavailable for however long it takes the DB instance to recover from the failure event.
This metric measures the average amount of read or write operations from the cluster volume every five minutes. By default, the value of VolumeReadIOPS should be small and stable. If you witness any unusual spikes in your read IO, investigate your DB instances to identify the cause behind it.
Buffer cache hit ratio
This term refers to the percentage of queries served by the data already stored in memory. With this metric, you can get deep insight into the amount of data being served from memory. A high buffer cache hit ratio indicates that queries don’t have to access the disk to fetch data. A low buffer value indicates that the queries in the DB instance are going to the disk more often than not.
Query performance and throughput
Understand DB operations by tracking query throughput. Capture a critical measure of query performance, irrespective of whether the query is served from the query cache, by measuring DDL throughput and latency for all DDL requests. Avoid performance bottlenecks by setting up alerts when sudden changes in the query volume occur.
If you’re just getting started with AWS cloud monitoring or RDS monitoring, the above metrics will help you see the bigger picture of your applications. To learn how Applications Manager can help you with AWS monitoring, get a 30-day free trial.