How REA Group approaches Amazon MSK cluster capacity planning

This post was written by Eunice Aguilar and Francisco Rodera from REA Group.

Enterprises that need to share and access large amounts of data across multiple domains and services need to build a cloud infrastructure that scales as need changes. REA Group, a digital business that specializes in real estate property, solved this problem using Amazon Managed Streaming for Apache Kafka (Amazon MSK) and a data streaming platform called Hydro.

REA Group’s team of more than 3,000 people is guided by our purpose: to change the way the world experiences property. We help people with all aspects of their property experience—not just buying, selling, and renting—through the richest content, data and insights, valuation estimates, and home financing solutions. We deliver unparalleled value to our customers, Australia’s real estate agents, by providing access to the largest and most engaged audience of property seekers.

To achieve this, the different technical products within the company regularly need to move data across domains and services efficiently and reliably.

Within the Data Platform team, we have built a data streaming platform called Hydro to provide this capability across the whole organization. Hydro is powered by Amazon MSK and other tools with which teams can move, transform, and publish data at low latency using event-driven architectures. This type of structure is foundational at REA for building microservices and timely data processing for real-time and batch use cases like time-sensitive outbound messaging, personalization, and machine learning (ML).

In this post, we share our approach to MSK cluster capacity planning.

The problem

Hydro manages a large-scale Amazon MSK infrastructure by providing configuration abstractions, allowing users to focus on delivering value to REA without the cognitive overhead of infrastructure management. As the use of Hydro grows within REA, it’s crucial to perform capacity planning to meet user demands while maintaining optimal performance and cost-efficiency.

Hydro uses provisioned MSK clusters in development and production environments. In each environment, Hydro manages a single MSK cluster that hosts multiple tenants with differing workload requirements. Proper capacity planning makes sure the clusters can handle high traffic and provide all users with the desired level of service.

Real-time streaming is a relatively new technology at REA. Many users aren’t yet familiar with Apache Kafka, and accurately assessing their workload requirements can be challenging. As the custodians of the Hydro platform, it’s our responsibility to find a way to perform capacity planning to proactively assess the impact of the user workloads on our clusters.

Goals

Capacity planning involves determining the appropriate size and configuration of the cluster based on current and projected workloads, as well as considering factors such as data replication, network bandwidth, and storage capacity.

Without proper capacity planning, Hydro clusters can become overwhelmed by high traffic and fail to provide users with the desired level of service. Therefore, it’s very important to us to invest time and resources into capacity planning to make sure Hydro clusters can deliver the performance and availability that modern applications require.

The capacity planning approach we follow for Hydro covers three main areas:

The models used for the calculation of current and estimated future capacity needs, including the attributes used as variables in them
The models used to assess the approximate expected capacity required for a new Hydro workload joining the platform
The tooling available to operators and custodians to assess the historical and current capacity consumption of the platform and, based on them, the available headroom

The following diagram shows the interaction of capacity usage and the precalculated maximum usage.

Although we don’t have this capability yet, the goal is to take this approach one step further in the future and predict the approximate resource depletion time, as shown in the following diagram.

To make sure our digital operations are resilient and efficient, we must maintain a comprehensive observability of our current capacity usage. This detailed oversight allows us not only to understand the performance limits of our existing infrastructure, but also to identify potential bottlenecks before they impact our services and users.

By proactively setting and monitoring well-understood thresholds, we can receive timely alerts and take necessary scaling actions. This approach makes sure our infrastructure can meet demand spikes without compromising on performance, ultimately supporting a seamless user experience and maintaining the integrity of our system.

Solution overview

The MSK clusters in Hydro are configured with a PER_TOPIC_PER_BROKER level of monitoring, which provides metrics at the broker and topic levels. These metrics help us determine the attributes of the cluster usage effectively.

However, it wouldn’t be wise to display an excessive number of metrics on our monitoring dashboards because that could lead to less clarity and slower insights on the cluster. It’s more valuable to choose the most relevant metrics for capacity planning rather than displaying numerous metrics.

Cluster usage attributes

Based on the Amazon MSK best practices guidelines, we have identified several key attributes to assess the health of the MSK cluster. These attributes include the following:

In/out throughput
CPU usage
Disk space usage
Memory usage
Producer and consumer latency
Producer and consumer throttling

For more information on right-sizing your clusters, see Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost, Best practices for Standard brokers, Monitor CPU usage, Monitor disk space, and Monitor Apache Kafka memory.

The following table contains the detailed list of all the attributes we use for MSK cluster capacity planning in Hydro.

Attribute Name	Attribute Type	Units	Comments
Bytes in	Throughput	Bytes per second	Relies on the aggregate Amazon EC2 network, Amazon EBS network, and Amazon EBS storage throughput
Bytes out	Throughput	Bytes per second	Relies on the aggregate Amazon EC2 network, Amazon EBS network, and Amazon EBS storage throughput
Consumer latency	Latency	Milliseconds	High or unacceptable latency values usually indicate user experience degradation before reaching actual resource (for example, CPU and memory) depletion
CPU usage	Capacity limits	% CPU user + CPU system	Should stay under 60%
Disk space usage	Persistent storage	Bytes	Should stay under 85%
Memory usage	Capacity limits	% Memory in use	Should stay under 60%
Producer latency	Latency	Milliseconds	High or unacceptable sustained latency values usually indicate user experience degradation before reaching actual capacity limits or actual resource (for example, CPU or memory) depletion
Throttling	Capacity limits	Milliseconds, bytes, or messages	High or unacceptable sustained throttling values indicate capacity limits are being reached before actual resource (for example, CPU or memory) depletion

By monitoring these attributes, we can quickly evaluate the performance of the clusters as we add more workloads to the platform. We then match these attributes to the relevant MSK metrics available.

Cluster capacity limits

During the initial capacity planning, our MSK clusters weren’t receiving enough traffic to provide us with a clear idea of their capacity limits. To address this, we used the AWS performance testing framework for Apache Kafka to evaluate the theoretical performance limits. We conducted performance and capacity tests on the test MSK clusters that had the same cluster configurations as our development and production clusters. We obtained a more comprehensive understanding of the cluster’s performance by conducting these various test scenarios. The following figure shows an example of a test cluster’s performance metrics.

To perform the tests within a specific time frame and budget, we focused on the test scenarios that could efficiently measure the cluster’s capacity. For instance, we conducted tests that involved sending high-throughput traffic to the cluster and creating topics with many partitions.

After every test, we collected the metrics of the test cluster and extracted the maximum values of the key cluster usage attributes. We then consolidated the results and determined the most appropriate limits of each attribute. The following screenshot shows an example of the exported test cluster’s performance metrics.

Capacity monitoring dashboards

As part of our platform management process, we conduct monthly operational reviews to maintain optimal performance. This involves analyzing an automated operational report that covers all the systems on the platform. During the review, we evaluate the service level objectives (SLOs) based on select service level indicators (SLIs) and assess the monitoring alerts triggered from the previous month. By doing so, we can identify any issues and take corrective actions.

To assist us in conducting the operational reviews and to provide us with an overview of the cluster’s usage, we developed a capacity monitoring dashboard, as shown in the following screenshot, for each environment. We built the dashboard as infrastructure as code (IaC) using the AWS Cloud Development Kit (AWS CDK). The dashboard is generated and managed automatically as a component of the platform infrastructure, along with the MSK cluster.

By defining the maximum capacity limits of the MSK cluster in a configuration file, the limits are automatically loaded into the capacity dashboard as annotations in the Amazon CloudWatch graph widgets. The capacity limits annotations are clearly visible and provide us with a view of the cluster’s capacity headroom based on usage.

We determined the capacity limits for throughput, latency, and throttling through the performance testing. Capacity limits of the other metrics, such as CPU, disk space, and memory, are based on the Amazon MSK best practices guidelines.

During the operational reviews, we proactively assess the capacity monitoring dashboards to determine if more capacity needs to be added to the cluster. This approach allows us to identify and address potential performance issues before they have a significant impact on user workloads. It’s a preventative measure rather than a reactive response to a performance degradation.

Preemptive CloudWatch alarms

We have implemented preemptive CloudWatch alarms in addition to the capacity monitoring dashboards. These alarms are configured to alert us before a specific capacity metric reaches its threshold, notifying us when the sustained value reaches 80% of the capacity limit. This method of monitoring enables us to take immediate action instead of waiting for our monthly review cadence.

Value added by our capacity planning approach

As operators of the Hydro platform, our approach to capacity planning has provided a consistent way to assess how far we are from the theoretical capacity limits of all our clusters, regardless of their configuration. Our capacity monitoring dashboards are a key observability instrument that we review on a regular basis; they’re also useful while troubleshooting performance issues. They help us quickly tell if capacity constraints could be a potential root cause of any ongoing issues. This means that we can use our current capacity planning approach and tooling both proactively or reactively, depending on the situation and need.

Another benefit of this approach is that we calculate the theoretical maximum usage values that a given cluster with a specific configuration can withstand from a separate cluster without impacting any actual users of the platform. We spin up short-lived MSK clusters through our AWS CDK based automation and perform capacity tests on them. We do this quite often to assess the impact, if any, that changes made to the cluster’s configurations have on the known capacity limits. According to our current feedback loop, if these newly calculated limits change from the previously known ones, they are used to automatically update our capacity dashboards and alarms in CloudWatch.

Future evolution

Hydro is a platform that is constantly improving with the introduction of new features. One of these features includes the ability to conveniently create Kafka client applications. To meet the increasing demand, it’s essential to stay ahead of capacity planning. Although the approach discussed here has served us well so far, it’s by no means the final stage , and there are capabilities that we need to extend and areas we need to improve on.

Multi-cluster architecture

To support critical workloads, we’re considering using a multi-cluster architecture using Amazon MSK, which would also affect our capacity planning. In the future, we plan to profile workloads based on metadata, cross-check them with capacity metrics, and place them in the appropriate MSK cluster. In addition to the existing provisioned MSK clusters, we will evaluate how the Amazon MSK Serverless cluster type can complement our platform architecture.

Usage trends

We have added CloudWatch anomaly detection graphs to our capacity monitoring dashboards to track any unusual trends. However, because the CloudWatch anomaly detection algorithm only evaluates up to 2 weeks of metric data, we will reassess its usefulness as we onboard more workloads. Aside from identifying usage trends, we will explore options to implement an algorithm with predictive capabilities to detect when MSK cluster resources degrade and deplete.

Conclusion

Initial capacity planning lays a solid foundation for future improvements and provides a safe onboarding process for workloads. To achieve optimal performance of our platform, we must make sure that our capacity planning strategy evolves in line with the platform’s growth. As a result, we maintain a close collaboration with AWS to continually develop additional features that meet our business needs and are in sync with the Amazon MSK roadmap. This makes sure we stay ahead of the curve and can deliver the best possible experience to our users.

We recommend all Amazon MSK users not miss out on maximizing their cluster’s potential and to start planning their capacity. Implementing the strategies listed in this post is a great first step and will lead to smoother operations and significant savings in the long run.

About the Authors

Eunice Aguilar is a Staff Data Engineer at REA. She has worked in software engineering in various industries throughout the years and recently for property data. She’s also an advocate for women interested in transitioning into tech, along with the well-versed who she takes inspiration from.

Francisco Rodera is a Staff Systems Engineer at REA. He has extensive experience building and operating large-scale distributed systems. His interests are automation, observability, and applying SRE practices to business-critical services and platforms.

Khizer Naeem is a Technical Account Manager at AWS. He specializes in Efficient Compute and has a deep passion for Linux and open-source technologies, which he leverages to help enterprise customers modernize and optimize their cloud workloads.