0:00
Hello, and welcome to capacity planning.
In this module, we will look at some of
the considerations and focus when looking at capacity planning,
as well as looking at some of the calculations used when estimating capacity.
So, consideration and focus,
a short sentence but with broad scope.
Know your business, this is probably the most important aspect of capacity planning.
The business, are you aligned with how the business wants to use the platform?
This can mean being aware in advance of sales drives and marketing purchase,
as well as any seasonal variation.
It seems obvious, but it's often the stumbling point.
SLAs and infrastructure, how resilient do you need to be?
What are your SLAs around downtime?
How much failure do you need to be able to tolerate without breaching SLAs?
In multi-region installations, should each region be able to
handle 100 percent of traffic in the event of a region failing?
Factors such as this play a large part in
deciding how much hardware needs to be provisioned.
Your infrastructure itself will determine whether or not you need to
provision in advance or scale elastically as demand grows.
Understand traffic patterns and how they affect scaling.
You can safely assume that as traffic increases,
you will eventually need to increase numbers of servers in the underlying infrastructure.
Not all components scale at the same rate.
In the first example, A on the left,
we have an equal number of routers and
message processors and a standard Cassandra cluster.
This is fairly typical.
Routers can serve traffic to either message processor.
We probably have an excess of capacity in the router layer,
but we do have redundancy in all layers.
In the middle example B,
we've scaled out the message processor layer.
Routers are capable of sending traffic to multiple message processors.
A good rule of thumb is one router for every four message processors.
In practice, we would never use a single router in front of
our message processors as this would give us a single point of failure.
At only four message processors,
we can safely assume that the default size Cassandra ring will support any load.
In example C, we've doubled the number of routers and message processors.
At this point, you will probably also be looking to expand the Cassandra ring.
As mentioned in other modules,
this is normally done either by doubling
the ring size or increasing it by a multiple of the replication factor.
Plan for X, use everything you know about your business to plan ahead.
Where possible, think about adopting the cloud.
Again, where possible, use elastic scaling capability.
Understand your business strategy and market growth
and your company's expectations and targets for the platform.
And, finally, if you're using traditional infrastructure,
an upfront investment and capacity is often far
more cost-effective than reacting to accelerated change.
Next, know your proxies.
Not all proxies are created equal.
Bundle complexity may lead to your estimated capacity actually being reduced.
Code quality also plays a large part here.
The number and type of policies that are being applied within
an API proxy will drive execution time on the gateway.
We see some API proxies with only a handful of policies being applied.
Others may have over 100.
Obviously, these have different processing requirements.
Any policy that requires reads and writes to and from Cassandra will increase latency,
and these include things like quotas,
key lookups, et cetera.
Heavy transformations, SSL termination or
any other CPU intensive operation will also increase latency.
Execution time and I/O wait time has a direct impact on
the ability of a router or a message processor to execute concurrent calls.
All these factors come into play and must be
considered when going through a capacity planning exercise.
It's essential the operations teams get to another kind of API proxy
being deployed and work closely with development and Q&A teams.
In this way, you improve your ability to plan
future releases and ensure the platform is always optimized effectively.
Know the critical path.
By understanding your proxies,
you can understand the critical path along which they execute.
Routers, message processors, and Cassandra,
for some policies, all stand along this critical path.
Remember that everything fails, everything.
Embrace failure. We're increasingly moving
away from a world in which single points of failure exist.
The Edge platform has been specifically
designed in such a way that with careful planning,
you can ensure that failures will not impact runtime.
When planning for failure,
consider also what happens to capacity when things do go wrong.
Again, when planning, remember that API traffic,
analytics, and developer portal components can and should scale independently.
By understanding this, you can focus your investment,
resources, and processes for the most effective outcome.
Finally for this section,
we can look at hardware characteristics and requirements.
Given a medium complexity API proxy,
we would normally say that a single message processor
can handle around 1000 transactions per second.
Of course, this number will change depending on factors,
such as complexity of proxy,
other supporting components, responsiveness of backend systems, et cetera.
For Cassandra, unloads its components such as keypad and PostgreSQL,
we have fairly high-performance requirements.
All these components make heavy use of disk and
thus benefit from high performing disk subsystems.
In the case of Cassandra,
the random read and write nature makes SSDs an ideal choice.
Additionally, Cassandra's compaction process can substantially
increase the amount of disk space consumed while the process is in progress.
Next, let's look at some of the calculations we can
use when capacity planning around Cassandra on PostgreSQL.
When calculating usable disk space for Cassandra,
there are a number of things to consider.
Firstly, does the estimation of data volume based on factors like
API traffic and usage of Edge policies such as Oauth,
Cache, KVM, et cetera?
There are also, however,
operational processes that are executed automatically by Cassandra behind the scenes.
On-screen, you can see the calculations to estimate usable space per Cassandra node.
The important line is the third line.
Usable space is equal to
the final formatted space multiplied by between node 0.5 and node 0.8.
The multiplier, I cancel the space needed by compaction,
hint file handling, snapshots, et cetera.
This means that if you wanted to store 100 gigabytes of data on a node,
you would actually need to provision a disk of around 250 gigabytes.
We also need to calculate capacity for the overall ring.
When considering the ring,
we need to account for the replication factor.
On a 3 node ring, since replication factor for Cassandra in Edge is 3,
if we have 500 gigabytes of usable disk per node,
this does not mean that we have 1500 gigabytes of usable space.
Each data element will be copied to each node of the ring,
meaning our usable space remains at 500 gigabytes.
And, finally, PostgreSQL capacity planning.
Analytics is a core component of Edge,
and the PostgreSQL machines that are underpin require careful planning.
On the screen, you can see the requirements for PostgreSQL.
As the numbers of transactions increase,
the CPU memory and IOPS requirements increased correspondingly.
The PostgreSQL machines will be processing not
only the incoming route requests for new analytics events,
but also doing the aggregation of raw data and also reporting.
When we consider the actual storage requirements,
we need to look at how much data Edge creates for each analytics event.
Each analytics event is around two and a half K of
data plus any custom values you may be storing.
The calculation on screen can be used to determine
how much storage is required based on your retention requirements.
This concludes capacity planning.
For more information, you can visit docs.apigee.com.
To get involved in our community,
please go to community.apigee.com. Thank you.