Cloning a Database Cluster with Scylla Manager Backup

June 30, 2021, 11:39 am

≪ Previous: Building Event Streaming Architectures on Scylla and Confluent with Kafka

Cloning a database cluster is probably the most common usage of backup data. This process can be very useful in case of a catastrophic event — say you were running in a single DC and it just burnt down overnight. (For the record, we are always encouraging you to follow our high availability and disaster recovery best practices to avoid such catastrophic failures. For distributed topologies we have you covered via built-in multi-datacenter replication and Scylla’s fundamental high availability design.) When you have to restore your system from scratch, that process is going to require cloning your existing data from a backup onto your new database cluster. Beyond disaster recovery, cloning a cluster is very handy in case if you want to migrate a cluster to different hardware, or if you want to create a copy of your production system for analytical or testing purposes. This blog post describes how to clone a database cluster with Scylla Manager 2.4.

The latest release of Scylla Manager 2.4 adds a new Scylla Manager Agent download-files command. It replaces vendor specific tools, like AWS CLI or gcloud CLI for accessing and downloading remote files. With many features specific to Scylla Manager, it is a “Swiss army knife” data restoration tool.

The Scylla Manager Agent download-files command allows you to:

List clusters and nodes in a backup location, example:

scylla-manager-agent download-files -L <backup-location> --list-nodes

List the node’s snapshots with filtering by keyspace / table glob patterns, example:

scylla-manager-agent download-files -L <backup-location> --list-snapshots -K 'my_ks*'

Download backup files to Scylla upload directory, example:

scylla-manager-agent download-files -L <backup-location> -T <snapshot-tag> -d /var/lib/scylla/data/

In addition to that it can:

Download to table upload directories or keyspace/table directory structure suitable for sstable loader (flag --mode)
Remove existing sstables prior to download (flag --clear-tables)
Limit download bandwidth limit (flag --rate-limit)
Validate disk space and data dir owner prior to download
Printout execution plan (flag --dry-run)
Printout manifest JSON (flag --dump-manifest)

Restore Automation

Cloning a cluster from a Scylla Manager backup is automated using the Ansible playbook available in the Scylla Manager repository. The download-files command works with any backups created with Scylla Manager. The restore playbook works with backups created with Scylla Manager 2.3 or newer. It requires token information in the backup file.

With your backups in a backup location, to clone a cluster you will need to:

Create a new cluster with the same number of nodes as the cluster you want to clone. If you do not know the exact number of nodes you can learn it in the process.
Install Scylla Manager Agent on all the nodes (Scylla Manager server is not mandatory)
Grant access to the backup location to all the nodes.
Checkout the playbook locally.

The playbook requires the following parameters:

backup_location – the location parameter used in Scylla Manager when scheduling a backup of a cluster.
snapshot_tag – the Scylla Manager snapshot tag you want to restore
host_id – mapping from the clone cluster node IP to the source cluster host ID

The parameter values shall be put into a vars.yaml file, below an example file for a six node cluster.

Example

I created a 3 node cluster and filled each node with approx 350GiB of data (RF=2). Then I ran backup with Scylla Manager and deleted the cluster. Later I figured out that I want to get the cluster back. I created a new cluster of 3 nodes, based on i3.xlarge machines.

Step 1: Getting the Playbook

First thing to do is to clone the Scylla Manager repository from GitHub “git clone git@github.com:scylladb/scylla-manager.git“ and changing directory “cd scylla-manager/ansible/restore“. All restore parameters shall be put to vars.yaml file. We can copy vars.yaml.example as vars.yaml to get a template.

Step 2: Setting the Playbook Parameters

For each node in the freshly created cluster we assign the ID of the node it would clone. We do that by specifying the `host_id` mapping in the vars.yaml file. If the source cluster is running you can use “Host ID” values from “sctool status” or “nodetool status” command output. Below is a sample “sctool status” output.

If the cluster is deleted we can SSH one of the new nodes and list all backed up nodes in the backup location.

Based on that information we can rewrite the host_id

When we have node IDs we can list the snapshot tags for that node.

We now can set the snapshot_tag parameter as snapshot_tag: sm_20210624122942UTC.

If you have the source cluster running under Scylla Manager it’s easier to run “sctool backup list” command to get the listing of available snapshots.

Lastly we specify the backup location as in Scylla Manager. Below is a full listing of vars.yaml:

The IPs nodes in the new cluster must be put to Ansible inventory and saved as hosts:

Step 3: Check the Restore Plan

Before jumping into restoration right away, it may be useful to see the execution plan for a node first

With --dry-run you may see how other flags like --mode or --clear-tables would affect the restoration process.

Step 4: Press Play

It may be handy to configure default user and private key in ansible.cfg.

When done, “press play” (run the ansible-playbook) and get a coffee:

The restoration took about 15 minutes on a 10Gb network. The download saturated at approximately 416MB/s.

CQL Shell Works Just Fine.

After completing the restore, it’s recommended to run a repair with Scylla Manager.

Next Steps

If you want to try this out yourself, you can get a hold of Scylla Manager either as a Scylla Open Source user (for up to five nodes), or as a Scylla Enterprise customer (for any sized cluster). You can get started by heading to our Download Center, and then checking out the Ansible Playbook in our Scylla Manager Github repository.

If you have any questions, we encourage you to join our community on Slack.

CHECK OUT THE ANSIBLE PLAYBOOK FOR SCYLLA MANAGER

The post Cloning a Database Cluster with Scylla Manager Backup appeared first on ScyllaDB.

↧

Project Circe June Update

July 1, 2021, 10:28 am

≫ Next: DynamoDB Autoscaling Dissected: When a Calculator Beats a Robot

≪ Previous: Cloning a Database Cluster with Scylla Manager Backup

Summer’s here! Which means that we’re getting ready for our Scylla University LIVE Summer School session. We hope to meet you all there. Meanwhile, behind the scenes we’ve been diligently working to deliver new software across our product set — the database itself, drivers (Rust 0.2), our k8s operator, Spark Migrator, and the list goes on.

Project Circe aims to make Scylla, already a kickass database, even better. With that goal in mind, here’s a look at our progress for the month of June.

Scylla Open Source 4.5 Coming Soon!

We’re on the verge of releasing Scylla Open Source 4.5 (following RC2, which went out in early June). Let’s have a look at the new features and capabilities you can look forward to in the coming release.

Load and Stream SSTables

This feature extends nodetool refresh to allow loading arbitrary SSTables. It will help make restorations and migrations much easier. You can take an SSTable from a cluster and place it on any node in the new cluster. When you trigger the load and stream process, it will distribute and stream the data across the nodes in the new cluster. Previously, one had to carefully place the SSTable within every node that owned key ranges that belong to it. Today, this feature does the job for you.

For example, you could take SSTables created on a cluster of 9 small nodes, then load and stream them across a cluster of 5 large nodes. Best of all, there’s no need to run nodetool cleanup afterwards to remove unused data.

Project Alternator

We’re making improvements to our DynamoDB-compatible API in a number of ways:

The sstable loader utility will work with Alternator tables, beginning with 4.5.
Cross-Origin Resource Sharing (CORS) will allow client browsers to access the database via JavaScript, avoiding a middle tier.
You will be able to limit maximum concurrency, with queries exceeding that concurrency returning a RequestLimitExceeded error.
Nested attribute paths will allow the modification of just an object’s attributes instead of the entire object.
Slow query logging will allow you to find queries that exceed a threshold and log them to system_traces.node_slow_log.
Support for attribute paths in ConditionExpression, FilterExpression, and ProjectionExpression.

Raft

Raft implementation in Scylla is a core deliverable of Project Circe. While the changes made to the Scylla infrastructure will have no visible effect yet, we’re adding the major building blocks upon which a number of future capabilities will be delivered.

Schema Tables on Shard 0 — To date Scylla stored the database schema in a set of tables, sharded across all cores like ordinary user tables. With Scylla 4.5, this schema data will be maintained by shard 0 alone. This is the first step to letting Raft manage them.
Log Data — Raft will now be able to store its log data in a system table, implemented in a modular fashion.
Joint Consensus — Now merged, this provides the ability to change a Raft group from one set of nodes to another, which is a requisite for cluster topology changes and data migrations to different nodes.
Additional changes to the Raft implementation provide support for non-voting nodes, per-server timers, and leader step-down.

Change Data Capture (CDC)

We are thrilled that our users are eagerly looking for ways to leverage the new CDC capabilities in Scylla. (Have a look at our recent webinar with Confluent on how to build event streaming architectures using CDC with Kafka.) This month we optimized CDC enablement on large Scylla clusters with many partitions and streams: first was limiting the number of streams (though this does incur some loss of efficiency), and, secondly, we adopted a new format that uses partitions and clustering rows.

CDC is also an official part of the Enterprise 2021 release and in July it will be fully integrated with Scylla Cloud.

Other June Releases

Beyond Scylla Open Source, we also provided a new update to our Scylla Enterprise 2021 release, as well as updates to our supporting applications and utilities:

Scylla Enterprise 2021.1.2 — This release provides LDAP authorization and authentication, plus a new GCP image.
Scylla Monitoring Stack 3.8 — The Advisor section can now warn you about prepared statements cache eviction, plus there’s a bunch of new information you can find across the Grafana dashboards.
Scylla Manager 2.4 — This release supports Azure Blob Storage, automated restores and cloning of clusters from backup, and better Scylla Manager Agent functionality.
Scylla Operator 1.3 — The latest release of our production-ready Kubernetes operator.

Velocity of Software Delivery

We often hear from Scylla users that the velocity of software delivery matters to them when deciding what infrastructure components to implement in their ecosystems. Already in the first half of this year we delivered Scylla Open Source 4.3 and 4.4, plus early in the second half we will deliver 4.5. This strong, steady release cadence allows us to add new capabilities while also allowing us to fix bugs at a rapid and regular clip.

Meanwhile, Scylla Enterprise, offered as a separate deliverable, allows us to perform even greater testing for the resiliency and maturity needed for production-readiness.

If the frequency of software delivery is also a major concern of yours, here’s an interesting way to compare our team’s output to a couple of other well known open source big data projects. All information here is for the month of June 2021 (30 May to 30 June, to be precise):

scylladb/scylla (500k lines of code)

28 authors pushed
383 commits for the month
1,487 files were changed

apache/spark (2.1m lines of code)

84 authors pushed
393 commits for the month
1,441 files were changed

apache/kafka (892k lines of code)

52 authors pushed
122 commits for the month
737 files were changed

apache/cassandra (1m lines of code)

20 authors pushed
49 commits for the month
140 files were changed

To break the progress in our code base down to some salient real-world examples, we recommend checking out our CTO Avi Kivity’s series entitled “Last week in scylla.git master,” which include these more interesting changes over the past month:

June 06 — featuring a new process for making Docker images
June 13 — which enables off-strategy compaction for bootstrap and replace operations
June 20 — changes to how range tombstones are internally represented
June 27 — making the bootstrap process more robust

Sign Up for Scylla University LIVE!

We look forward to seeing you at Scylla University LIVE for our Summer Session. This is an event you won’t want to miss. Besides the tracks about Scylla operations and development, we’re also going to have sessions devoted to hooking up Scylla to the rest of your big data architecture, including integrating it with Apache Spark and Apache Kafka.

You can read more about the Scylla University LIVE agenda, as well as other new developments at Scylla University here. But meanwhile, don’t forget to reserve your seat in our live, online classes coming up July 28th and 29th. Until we next meet, enjoy your summer!

The post Project Circe June Update appeared first on ScyllaDB.

↧

DynamoDB Autoscaling Dissected: When a Calculator Beats a Robot

July 8, 2021, 10:19 am

≫ Next: Scylla Rust Driver Update and Benchmarks

≪ Previous: Project Circe June Update

This post aims to help selecting the most cost effective and operationally simple configuration of DynamoDB tables.

TLDR; Choosing the Right Mode

Making sense of the multitude of scaling options available for DynamoDB can be quite confusing, but running a short checklist with a calculator can go a long way to help.

Follow the flowchart below to decide which mode to use
If you have historical data of your database load (or an estimate of load pattern), create a histogram or a percentile curve of the load (aggregate on hours used) – this is the easiest way to observe how many reserved units to pre-purchase. As a rule of thumb purchase reservations for units used over 32% of the time when accounting for partial usage and 46% of the time when not accounting for partial usage.
When in doubt, opt for static provisioning unless your top priority is avoiding being out of capacity – even at extreme costs.
Configure scaling limits (both upper and lower) for provisioned autoscaling. You want to avoid out-of-capacity during outages and extreme cost in case of rogue overload (DDOS anyone?)
Remember that there is no upper limit on DynamoDB on-demand billing other than the table’s scaling limit (which you may have requested raising for performance reasons). Make sure to configure billing alerts and respond quickly when they fire.

The Long Version

Before we dive in, it’s useful to be reminded of DynamoDB different service models and their scaling characteristics: DynamoDB tables can be configured to be either “provisioned capacity” or “on demand”, and there’s a cooldown period of 24 hours before you can change again.

On Demand

In this mode, DynamoDB tables are billed by cumulative query count. It doesn’t matter (from a billing perspective) what momentary throughput you have, only how many times you’ve queried the table over the month. You don’t have to worry about what throughput you might have, or plan ahead — or at least, that’s the promise. Unfortunately, DynamoDB scaling isn’t magic and if your load rises too fast, it will be throttled until enough capacity is added. As you are only paying for queries you actually did, and not for capacity, if you only have sporadic load your bill will be quite low; However if you actually have substantial sustained throughput the bill will be very high. For example, sustained 10k reads/sec would cost you around $6,500/month (compared with $1,000 provisioned).

For this reason, AWS recommends using on demand for sporadic load or small serverless applications. Even sporadic load or bursts on demand might be unusable, as it takes upwards of 2.5 hours to reach 40k rps from 0. The solution recommended by AWS to “warm up” the table is to switch to provisioned mode with enough provisioned capacity for your planned load and then switch back to on demand (but if you know in advance what load you need to handle and when, why would you use on demand in the first place?). One point to remember is that on demand tables are never scaled back down – which is very helpful for sporadic peaks.

Other than the aggressive scaling policies on demand employs other optimizations in the structure of its storage which makes it more suitable for bursty loads and dense multitenancy (you didn’t think AWS dedicates machines to you, did you?), which helps AWS compensate for the huge cost of overprovisioning.

Provisioned Capacity

In provisioned mode, AWS bills you by table capacity — although instead of specifying capacity in units of CPUs and disks AWS makes life easier for you and lets you specify your load in a more developer natural form of throughput: reads and writes per second. This is already a tremendous achievement, but unfortunately still requires you to do some capacity planning; this isn’t too bad in practice as capacity can be added and removed pretty quickly — in the span of hours.

Reservations

As with EC2, you have the option to reserve provisioned capacity — pay an upfront partial payment to have reduced price when using that capacity. The only catch, of course, is you need to decide up front how much you want to reserve. The cost savings can be as high as 56% if you can commit to the reservation period (one or three years).

A DynamoDB table can be either on-demand or provisioned, and while you can switch back and forth between the two they cannot be active simultaneously. To use reservations, your table must be in provisioned mode.

Autoscaling Demystified

The function of autoscaling controllers is to try and preserve some function within a band of specified margins. Utilization is the ratio of available capacity (provisioned by the user or by AWS internal on-demand controller) and the capacity consumed by external workload.

Utilization = available capacity / consumed capacity

To do that, the controller will add or remove capacity to reach the target utilization based on the past data it has seen. However, adjusting capacity takes time, and the load might continue changing during that time; The controller is always risking adding too much capacity or too little.

Another way to look at this is that the controller needs to “predict the future” and it’s prediction will be either too aggressive or too meek. Looking at the problem this way, it’s clear that the further into the future the controller needs to predict the greater the errors can be. In practice, this means that controllers need to be tuned to handle only a certain range of changes; a controller that handles rapid large changes will not handle slow changes or rapid small changes well. It also means that changes that are faster than the system response time (the time it takes the controller to add capacity) cannot be handled by the controller at all.

In the case of load spikes that rise sharply within a seconds or minutes all databases must handle the spikes using the already provisioned capacity — so a certain degree of over provisioning must always be kept, possibly by a large amount if the anticipated bursts are rapid and large. DynamoDB offers two basic scaling controllers: the aggressive on-demand mode and the more tame provisioned autoscaling.

On-Demand Scaling

DynamoDB on-demand under the hood automatically scales to double the capacity of the largest traffic peak in the last 30 minutes time window — anything above that and you get throttled. In other words, it always provisions 2x; So if you had a peak of 50,000 qps, DynamoDB will autoscale to support 100,000 qps — and it never scales back down. This type of exponential scaling algorithm is very aggressive and suitable for workloads that change fairly quickly. However, the price of aggressiveness of the algorithm is massive overprovisioning — the steady state can be close to 0% average utilization since it operates on peaks, which is very expensive. Although the pricing model of on demand is pay-per-query and not per capacity, which makes 0% utilization essentially free, the high overprovisioning cost is shoved into the per-query price making on demand extremely expensive at scale or sustained throughput.

Provisioned Autoscaling

This is AWS’ latest offering which utilizes their Application Auto Scaling services to adjust the provisioned capacity of DynamoDB tables. The controller uses a simple algorithm which allocates capacity to keep a certain utilization percentage. It will respond fairly quickly to capacity increase, but will be very pessimistic when reducing capacity and will wait relatively long (15 minutes) before adjusting it downward — a reasonable tradeoff of availability over cost. Unlike on demand mode it operates on 1 minute average load and not peaks of 30 minutes time window — which means it both responds faster and does not does not overallocate as much. You can get aggressive scaling behavior similar to on-demand by simply setting the utilization target to 0.5.

The Limits of Autoscaling

Autoscaling isn’t magic — it’s simply a robot managing capacity for you. But the robot has two fundamental problems: it can only add capacity so fast, and its knowledge about the traffic patterns is very limited. This means that autoscaling cannot deal with unexpected large changes in capacity. Nor can it be optimal when it comes to cost savings.

As a great example of that, let’s look at this AWS DynamoDB autoscaling blog post. In it AWS used provisioned autoscaling to bring the cost of DynamoDB workload from $1,024,920 static provisioning cost to $708,867 variable capacity cost. While a 30% cost reduction isn’t something that should be discounted, it also shows the limits of autoscaling: even with a slowly changing workload which is relatively easy to handle automatically, they only saved 30%. Had they used static provisioned capacity with 1 year reservation for the entire workload, the cost would have gone down to $473,232, a 53.8% cost reduction!

Even when combining autoscaling with reserved capacity, reserving units that are used more than 46% of the time (which is when reservation becomes cheaper) AWS was only able to bring down the cost to $460,327 — a negligible cost saving compared to completely reserved capacity — and they needed to know in the traffic pattern in advance. Is a cost saving of 2.7% worth moving your system from static capacity to dynamic, risking potential throttling when autoscaling fails to respond fast enough?

Given that both scenarios require capacity planning and good estimation of the workload and also given that autoscaling is not without peril, I would argue “no” — especially as using completely static 3 year reservation would bring the cost down further to $451,604.

DynamoDB	Statically Provisioned	WCUs: 2,000,000 RCUs: 800,000	$1,024,920
	Auto scaling	Variable capacity	$708,867
	Blended auto scaling and one-year reserved capacity	Variable capacity	$460,327
	One year reserved capacity	WCUs: 2,000,000 RCUs: 800,000	$473,232
	Three-year reserved capacity	WCUs: 2,000,000 RCUs: 800,000	$451,604

The Perils of Autoscaling

We’ve discussed above how automatic controllers are limited to a certain range of behaviours and must be tuned to err either on the side of caution (and cost) or risk overloading the system — this is common knowledge in the industry. What is not discussed as much is the sometimes catastrophic results of automatic controllers facing situations that are completely out of scope of their local and limited knowledge: namely, their behavior during system breakdown and malfunctions.

When Slack published a post mortem for their January 4th outage they described how autoscaling shut down critical parts of their infrastructure in response to network issues, further escalating incidents and causing a complete outage. This is not as rare as you think and has happened numerous times to many respectable companies — which is why autoscaling is often limited to a predetermined range of capacity regardless of the actual load, further limiting potential cost savings.

Ask yourself, are you willing to risk autoscaling erroneously downscaling your database in the middle of an outage? This isn’t so far fetched given the load anomalies that happen frequently during incidents. It’s important to note that this isn’t an argument for avoiding autoscaling altogether, only to use it cautiously and where appropriate.

Summary: the Cost of Variance

A recurring theme in performance and capacity engineering is the cost of variance. At scale we can capitalize on averaging effects by spreading out variance between many different parties — which is basically how AWS is able to reduce the cost of infrastructure to its clients, shifting capacity between them as needed. But the higher and more correlated the variance, the less averaging effect we have, and the cost of variance can only be masked to some degree. AWS knows this, and offers its customers significant discounts for reducing the variance and uncertainty of capacity utilization by purchasing reserved capacity.

In AWS’ own example, autoscaling failed to bring significant cost savings compared to capacity planning and upfront reservations — but that does not mean autoscaling is completely useless. Indeed there are traffic patterns and situations in which autoscaling can be beneficial — if only to allow operators to sleep well at night when unexpected loads occur.

So when is autoscaling useful? If the load changes too quickly and violently, autoscaling will be slow to respond. If the change in load has low amplitude, adjusting capacity is insignificant.

Autoscaling is most useful when:

Load changes have high amplitude
The rate of change is in the magnitude of hours
The load peak is narrow relative to the baseline

Even if your workload is within those parameters, it is still necessary to do proper capacity planning in order to cap and limit the capacity that autoscaling manages, or else you are risking a system run amok.

Autoscaling is no substitute for capacity planning and in many cases will cost more than a properly tuned and sized system. After all, if automation alone could solve our scaling problems, why are we still working so hard?

Want to Do Your Capacity Planning with Style?

If you have found this article interesting, the next step is to read our article about Capacity Planning with Style. Learn how to get the most out of our Scylla Cloud Calculator which you can use to compare provisioning and pricing work across various cloud database-as-a-service (DBaaS) offerings, from Scylla Cloud to DynamoDB to DataStax Astra or Amazon Keyspaces.

LEARN ABOUT CAPACITY PLANNING WITH STYLE

TRY THE SCYLLA CLOUD SIZING AND PRICING CALCULATOR

The post DynamoDB Autoscaling Dissected: When a Calculator Beats a Robot appeared first on ScyllaDB.

↧

Scylla Rust Driver Update and Benchmarks

July 13, 2021, 1:12 pm

≫ Next: Getting Ready for Scylla University LIVE Summer School

≪ Previous: DynamoDB Autoscaling Dissected: When a Calculator Beats a Robot

Scylla Rust Driver was born during Scylla’s internal developer hackathon. The effort did not stop after the hackathon though — the development continued and Scylla Rust Driver is now released as 0.2.0, officially available on the Rust community’s package registry — crates.io. It also already accepted its first contributions from the open-source community, has a comprehensive docs page, and much more! We also ran comparative benchmarks against other drivers to confirm that our driver is (more than) satisfactory in terms of performance.

Quick Start

Our docs page contains a quick start guide designed for people who would like to start using our driver. You’re welcome to try it out!

New Features

After the hackathon, Scylla Rust Driver was in a workable, but very limited state. It was capable of sending the requests to correct nodes and shards, but many features expected from a full-fledged driver were missing. The situation has changed drastically throughout the year, with the Scylla Rust Driver gaining many important features:

Authentication Support
It’s possible to connect to clusters which require username+password authentication

TLS Support
Certificates and keys can be configured to establish secure connections with the cluster

Configurable Load Balancing Algorithms
Multiple options for load balancing are now available, including DC-aware round robin and token-aware load balancing. Learn more.
Configurable Retry Policies
Depending on the failure type, our driver may apply various retry policies to increase the chance of success for requests. Learn more.
Speculative Execution
In certain situations it’s beneficial in terms of performance to speculatively resend a request to another node, before any reply arrives. Such behavior can now be configured per-session and comes in two flavors: based on constant delay or latency percentiles. Learn more.
Tracing Support
Query tracing is extremely useful for investigating performance issues and bottlenecks. The Scylla Rust Driver supports retrieving tracing information from specific requests. Learn more.
Internal Logging
For those interested in how the driver works internally, its logs are exposed via the tracing crate. Happy debugging!

Benchmarks

The performance of our driver was tested against multiple other drivers, including:

cpp-driver: the C++ driver (in single- and multithreaded modes)
cassandra-cpp: a Rust driver which uses the C++ driver bindings underneath
gocql: the Go driver
cdrs-tokio: native Rust driver, forked from cdrs and also based on the Tokio framework

Source code of all the benchmarks can be browsed here: https://github.com/cvybhu/rust-driver-benchmarks

Testing Configuration

All tests were performed on a 3-node cluster of Scylla 4.4 running on i3.4xlarge AWS instances and a powerful c5.9xlarge loader instance on which the drivers were used. That means that all drivers were given a chance to utilize 36 vCPUs and 72GiB of RAM, which allows reaching quite high concurrency.

Test Cases

Each driver was compared against all combinations of the following configurations:

3 workload types: read-only, write-only, mixed 50/50
data size of 1M, 10M and 100M rows
concurrency varying from 512 to 8192

Checking so many combinations was necessary to observe how well the drivers scale along with increasing concurrency, how well they can handle large amounts of data and whether the workload type influences their performance.

Results

We were happy to observe that our Scylla Rust Driver is very competitive both in terms of performance and scalability – it was generally either just as fast or even slightly faster than the multithreaded C++ driver (cpp-multi) and left other tested drivers far behind. All the results can be browsed here, and a small sample is presented below:

Chart 1: Results for 1 million rows, mixed read/write workload and various concurrency values (lower is better)

Chart 2: Results for 100 million rows, mixed workload and concurrency of 8192 (lower is better)

Summary

Scylla Rust Driver turned out to scale very well along with client’s concurrency, which is what we hoped for — one of the earliest design decisions was to make the driver fully asynchronous and thrive in high concurrency environments.

Academic Paper

The development of Scylla Rust Driver was not just yet another open-source initiative – it was also an academic project co-organized with the University of Warsaw. A team of 4 talented students were engaged in implementing missing features and performing benchmarks. All details and results can be found in their thesis. Have a good read!

READ THE SCYLLA RUST DRIVER THESIS

DOWNLOAD THE SCYLLA RUST DRIVER

The post Scylla Rust Driver Update and Benchmarks appeared first on ScyllaDB.

↧

Getting Ready for Scylla University LIVE Summer School

July 15, 2021, 12:26 pm

≫ Next: Say Hello to Scylla Cloud BYOA

≪ Previous: Scylla Rust Driver Update and Benchmarks

The upcoming Scylla University LIVE event is right around the corner. In this blog post, I’ll share more details about the different talks and how you can prepare for the event to help you get the most out of your experience.

A reminder, the Scylla University LIVE Summer School is a FREE, half-day, instructor-led training event, with training sessions from our top engineers and architects. It will include sessions that cover the basics and how to get started with Scylla and more advanced topics and new features. Following the sessions, we will host a roundtable discussion where you’ll have the opportunity to talk with Scylla experts and network with other users.

For the first time, we’ll host the live sessions in two different time zones to better support our global community of users. Our July 28th training is scheduled for a time convenient in Europe and Asia, while July 29th will be the same sessions scheduled for users in North and South America.

Detailed Agenda and How to Prepare

Here are the different event sessions and the recommended material you can use to prepare.

Essentials Track

Advanced Topics & Integrations

Getting Started with Scylla

This session covers the Scylla architecture, its effects, what happens in a Scylla cluster on a read and a write, partitioning of data in Scylla, different concepts and components in Scylla, and the basics of Scylla data modeling. The session includes hands-on labs.

Suggested learning material:

Working with Kafka and Scylla

This session will cover using our connectors to connect your Scylla cluster with Apache Kafka. It will explore how to configure real-time ingestion of data from Kafka to Scylla. Attendees will also learn how Kafka can consume the data changes happening in a Scylla cluster. The session includes a live demonstration of configuring and running the connectors.

Suggested learning material:

Advanced-Data Modeling

This session covers Advanced Data Modeling as well as other topics. The goal of the talk is to help the audience better understand Scylla and write better applications. We will cover the topics: Materialized Views and Secondary Indexes, Advanced data types, Lightweight transactions, and tips and best practices.

Suggested learning material:

Spark and Scylla: How do Spark and Scylla work together?

This session covers topics such as: an overview of Spark and the Scylla Spark Connector, How to connect Scylla and Spark, How to approach full scans of your data, Effective loading and working with big data stored in Scylla, An explanation of how your data is balanced between workers and executors, how to properly dimension a Spark cluster for your data set or use case, and a deep dive into mechanics behind data processing in Spark.

Suggested learning material:

Building Well-Architected Applications on Scylla Cloud

This talk covers architecture patterns for correctly building distributed applications on top of Scylla. Using a demo, the talk will touch on: Idempotent writes, Properly handling failed writes, Retries, Read after write and consistency, Client-side timestamps and when to use them, Achieving high consistency and high availability with read/write split.

Suggested learning material:

Improve Your Application Using Scylla Monitoring

Scylla exports thousands of different metrics – that provide a complete picture of how well your hardware is set up and used, the current/historical state of the cluster, and how well your app is written. In this session, we will cover the essentials of using Scylla Monitoring Stack to navigate this data and explain how to gain insights on the areas listed above.

Suggested learning material:

Swag and Certification

Participants that complete the training will have access to more free, online, self-paced learning material such as our hands-on labs on Scylla University.

Additionally, those that complete the training will be able to get a certification and some cool swag!

The post Getting Ready for Scylla University LIVE Summer School appeared first on ScyllaDB.

↧

Say Hello to Scylla Cloud BYOA

July 22, 2021, 11:39 am

≫ Next: >30 Developers to Share Insights on Application Performance at P99 CONF

≪ Previous: Getting Ready for Scylla University LIVE Summer School

Like countless other organizations, you are probably already running various cloud services on AWS. You’re running all kinds of compute instances, and using a bunch of other AWS services besides databases. If your company is big enough, you have a CFO who is looking at a single AWS bill; they really don’t want a cloud-based service to add another markup on top of the current spend. You’d like to simplify and have all of your AWS spending tallied up in one place. Plus, within your own AWS Account you likely have pre-negotiated discounts.

With Scylla Cloud BYOA (Bring Your Own Account), we provide a fully managed NoSQL database-as-a-service (DBaaS) that runs in your AWS account. We do it all — the provisioning, updates, the backups, the monitoring. In fact, Scylla Cloud is the only fully managed NoSQL database that offers this service. You pay only the subscription fees for Scylla; all of your infrastructure expenses are paid directly to AWS, through your existing accounts.

All this gives you a fully managed NoSQL DBaaS that is not only performant and highly available, but also CFO friendly. Ultimately, the BYOA configuration makes the math easier for everyone: As a DBaaS provider, ScyllaDB doesn’t have to include the infrastructure costs in our pricing. As a DBaaS user, you receive any discounts you’ve already negotiated with AWS.

Companies have built out their brands and services running on Scylla Cloud. Not only for the convenience of the BYOA feature, but for its fundamental performance, ease-of-use, affordability, availability and scalability characteristics. According to the Disney+ Hotstar team, “the major driving factor for us was the low latency.” Jason Mills, engineering manager at GumGum noted, “Other cloud database options simply weren’t fast enough to meet our SLAs.”

We’ve worked hard to make BYOA easy for you to set up. You can get started quickly, without even entering a credit card. We’ve built a simple setup wizard to spin up a Scylla Cloud managed cluster within your AWS account. The wizard enables you to allocate cloud resources for your new DBaaS in your own AWS accounts — as opposed to allocating it in a Scylla account. Everything is consolidated in one place to make it simple to manage your Scylla instances alongside your other infrastructure.

Deeper Benefits of BYOA

Beyond consolidating your infrastructure, there are other benefits of running your DBaaS under our BYOA scenario. First, dedicated servers provide more security and in general are easier to govern. With regulatory directives like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act, it is critical to know where your data is stored. With a typical DBaaS, you are never completely sure that your data is being stored in the geographic region that you expect. With Scylla’s BYOA, you’re in complete control.

Furthermore, if your company is in a highly regulated industry, like healthcare or financial services, then you likely have technology and procedures for regulatory compliance already in place. By running Scylla in a BYOA deployment, your compliance tools and processes continue running as they do today, without the need for audits of new technology and deployment topologies.

High level architecture of Scylla Cloud

Simple Setup

Delving a bit more into setup, you can see that we have tried to streamline the process as much as possible. Once you have signed up for Scylla Cloud service, confirm that your AWS account has the correct account limits, in terms of instances, VPCs, Elastic IPs, CloudFormation Stacks, and so on. Note that you will need to grant Scylla Cloud permission to create and manage the requisite resources in your account.

Then, the wizard walks you through the following steps:

Add your AWS account details
Add your AWS account details and start the cluster creation process
Define a boundary policy for Scylla Cloud on your AWS account
Create a Scylla Cloud policy for your AWS account
Create a Scylla Cloud role and give it specific privileges
Create the cluster and set it to run Scylla Cloud from your AWS account

Setting up permissions for Scylla Cloud backups in your AWS policies.

It’s then a matter of adding a few AWS account details. Once that’s done, you can navigate to My Clusters, click Add New Cluster. In the deployment drop-down, you can deploy the cluster in your own AWS account. It is as simple as that!

To wrap up, Scylla’s new BYOA feature makes it the only fully managed NoSQL DBaaS that runs on your own cloud infrastructure. Your team will save lots of time as the ScyllaDB team of experts takes care of the toil associated with the operational overhead of running a database in the cloud. You’ll also save a lot of money and make the bean counters in your organization happy as well.

You can read more about how to set up your Scylla Cloud with your own Amazon account with step-by-step instructions here.

FAQs

Is Scylla BYOA a single sign on feature?
No! The account in ‘Bring Your Own Account’ feature refers to your cloud provider account, in this case AWS.

How will ScyllaDB manage my cluster with BYOA?
We need limited privileges defined in your policy that allows us to create and manage Scylla Cloud resources directly in your account. We recommend you create a sub-account for Scylla BYOA for better isolation of resources.

GET STARTED ON SCYLLA CLOUD

The post Say Hello to Scylla Cloud BYOA appeared first on ScyllaDB.

↧

>30 Developers to Share Insights on Application Performance at P99 CONF

July 27, 2021, 8:50 am

≫ Next: Which Will Be the Best Wide Column Store?

≪ Previous: Say Hello to Scylla Cloud BYOA

P99 CONF speakers

We’re always excited to talk about our database. But we are also eager to speak to the underlying challenges, techniques, and movements around high performance architectures. After all, our founding team invented the KVM hypervisor. We think about low-level performance optimizations literally all the time.

So we reached out to leading technologists and asked them to present their key achievements on topics ranging from kernel to kernel bypass, frameworks, cool emerging programming languages, and heavyweight frameworks.

That’s why we’re excited to unveil the full roster of 30+ speakers for P99 CONF, a new, cross-industry virtual event for _engineers_ and by engineers on October 6-7. P99 speakers are leading developers of top projects — from Rust, Java, golang, Kubernetes, and the Linux kernel, to industry-leading database and streaming projects.

SEE OUR FEATURED SPEAKERS HERE

These speakers represent a wide range of companies that have faced and overcome latency issues at scale, including Netflix, Twitter, Percona, Datadog, Red Hat, Dynatrace, and, of course, ScyllaDB. (We’re sponsoring the event, but like all the other speakers we’re keeping our talks strictly technical.)

Since P99 CONF is about technology instead of products, open source solutions will be in the spotlight. In place of the usual thinly veiled product pitches and entry-level overviews, the event is tuned to provide insights and techniques that developers can mobilize immediately, on any project.

Conference topics will focus relentlessly on the technical and practical aspects of high performance architectures, from OS (kernel, eBPF, IO_uring), CPUs (Arm, Intel, OpenRisc), middleware and programming languages (go, Rust, JVM, DPDK), along with databases and observability methods.

P99 speaker Glauber Costa, a staff engineer at Datadog, invited developers to attend P99 CONF “to hear from colleagues in the trenches about innovative approaches they won’t find anywhere else.” Glauber sees latency as one of the defining challenges of modern application design, but one that unfortunately gets little attention until a project is suffering in production. We are looking forward to Glauber’s presentation at P99 CONF, which will cover designing for optimal performance in Rust.

Overall, the conference sessions will explore creative solutions to the complex challenges of real-time applications, with deep dives on topics including:

Development: Techniques for programming languages and operating systems
Architecture: High-performance distributed systems, design patterns and frameworks
Performance: Capacity planning, benchmarking and performance testing
DevOps: Observability and optimization to meet SLAs
Use Cases: Low-latency applications in production and lessons learned

To ensure attendees receive a full range of perspectives on the topic, the roster includes chief scientists and researchers, architects, staff engineers, performance engineers, and straight up geeks.

To give a few examples, Waldek Kozaczuk, an OSv Committer, will speak on running stateless and serverless apps in the cloud. Brian Martin, software engineer at Twitter, will discuss high-performance refactoring in Rust. Yarden Shafir, software engineer at Crowdstrike, will talk about optimizing Windows I/O. Tejas Chopra, senior software engineer at Netflix, will present on object compaction in the cloud.

There’s something for everybody — as long as “everybody” happens to be a developer.

Register now to save your seat for two half days of keynotes, technical deep dives, and lively conversations on all things P99!

P99 CONF is free for all developers. Please read the P99 CONF Code of Conduct. We value the participation of each member and want all attendees to have an enjoyable and fulfilling experience.

We’ll announce the full agenda for P99 CONF soon. Follow us on Twitter @p99conf for updates.

The post >30 Developers to Share Insights on Application Performance at P99 CONF appeared first on ScyllaDB.

↧

Which Will Be the Best Wide Column Store?

August 2, 2021, 9:54 am

≫ Next: ScyllaDB Brings Scylla Cloud to Google Cloud

≪ Previous: >30 Developers to Share Insights on Application Performance at P99 CONF

This past week Cassandra 4.0 was finally released as GA, six years after the previous major release.

Initially developed as an open source alternative to Amazon DynamoDB and Google Cloud Bigtable, Cassandra has had a major impact on our industry. To the surprise of none, it eventually became one of the 10 most popular databases — quite an achievement! We at ScyllaDB were inspired by Cassandra seven years ago when we first decided to reimplement it in C++ in a close-to-hardware design while keeping its symmetric, scale-out architecture.

Cassandra’s impact continues to echo through our industry. Even in recent years, long after Cassandra’s creation, cloud providers like Azure and AWS have added the CQL (Cassandra Query Language) API with varying degrees of compatibility; they’ve even added a managed Cassandra option.

However, even die-hard fans of Cassandra recognize that the project has slowed dramatically. This slow-down is visible in the trends for Cassandra’s DB-Engines.com ranking (see image below), its Github stars and its active commits.

One of the key improvements in Cassandra 4.0 is stabilization. We can attest that the late RC versions and GA are much better than the beta. In fact, the project reports more than 1,000 bug fixes. We see 476 closed issues year-to-date. (By comparison Scylla has closed 611 in the same period while not focusing solely on stability.)

Despite its success with a wide base of users and having a foundation structure for its leadership, Cassandra’s contributors come primarily from Apple and Datastax. That is, its wide user base does not translate to core development, and the number of active contributors is lower than other projects — Kafka, Spark or Scylla. Yes, we’re speaking from a somewhat biased perspective but these points are quite factual — even HBase has higher monthly code commits. When both of the major contributing companies are focusing on their own products, the amount of fresh code that reaches the project is low. This is not an opinion, it is a fact when you check the codebase — despite being six years in the making, the Cassandra 4.0 release doesn’t include many new features. Netty, virtual tables and zero copy for level triggered compaction aren’t that impressive. The major change is the use of a new JVM and a much better garbage collection algorithm — improving performance and latency by more than 25% over Cassandra 3.11. (Actually, we believe it helps even more than that and we’ll soon publish a detailed report).

We congratulate Apache Cassandra for a new major release, which is definitely more stable than any other Cassandra release, in addition, the new JVM with ZGC and Shenandoah GCs is a big improvement, but will this pace be sufficient to compete with the Amazons of the world?

The Dominant Wide Column Database of the Future

We at ScyllaDB have been challenging Cassandra for a long while. We firmly believe that we’re in the ideal position to become the best wide column store. Our database is known for its performance and we’ll follow with a Cassandra 4.0 benchmark to illustrate our performance advantages. Yet, there are far many more reasons why one would choose Scylla:

1. Open Source and Community
Scylla is an open-source-first company. That means that any new changes — from features to bug fix — go first to the Scylla open source branches and only later get backported to enterprise/cloud. We’re completely dedicated to open source and we have championed industry-wide open source projects such as KVM, Linux kernel, OSv and many others.

Scylla has many active contributors who commit at a rate of 100 commits/month. The vast majority of them come from ScyllaDB as this core project is highly complicated and our use of C++20 along with shard-per-core make the technical bar extremely high. We may not be a non-profit foundation community project but we tick all the other boxes (many of which are more important). We believe in open source and continue to dedicate lifetime(s) for it. We are also believers in traditional open source licenses and do not jump on SSPL-like trends.

Seastar, our open source core engine, receives many more external contributions and is powering Red Hat’s Ceph, Vectorized streaming and other outstanding projects. There are many other open source projects — Scylla Operator was initiated by our community, Scylla drivers are developed together with our community and projects like the Python sharded driver, Rust driver and GoCQLX are vibrant examples. There are many more, from Kafka connectors, CDC libraries, Spark migrators and more.And we haven’t even mentioned the Alternator — our open source, compatible, DynamoDB API.

2. Current Featureset
Scylla provides a featureset that’s superior to Cassandra. Our Change Data Capture feature is the most complete in the NoSQL community and we offer client libraries and Kafka connectors. But don’t take it from us; listen to what Confluent have to say about it in this recent webinar.

Materialized Views are supported at GA level. Design flaws led Cassandra to revert MV, while we at Scylla fixed most of the issues and we’re about to have a major architecture change to make them flawless.

Scylla is unique in our ability to provide per-service SLAs with different priorities. We even developed a new WebAssembly user-defined-function option which will allow us to run almost any language compiled code inside Scylla itself.

3. Scalability
Cassandra has always been the industry’s northstar in terms of number of nodes per cluster. Scylla adopted the same design and made it better, making it possible to scale to hundreds and even thousands of nodes. But the real difference for Scylla is scaling up — Scylla scales up to 256 CPUs and more. Moreover, our nodes can host 60TB of data while they can stream/decommission at hardware speed. It’s not rare to see disk writes of 12GB/s (bytes!) and 50gbp networking. Scylla can scale the number of keyspaces to hundreds and beyond and our single partition record is 200GB and growing.

4. API
Scylla fully supports the CQL api, along with an Amazon DynamoDB-compatible api. We even received a community-led Redis API (basic K/V portion). As the core is scalable and robust, it is easy to add compatible APIs. Beyond ease of migration, the Scylla team analyzes the api and expands the core capabilities. DynamoDB streams encourage us to develop our CDC approach with pre/post images but with an innovative internal-table interface. DynamoDB’s leader election was one of the key drivers to add the Raft protocol.

5. Roadmap
The database is a 50 years-old domain, and even NoSQL itself isn’t new. However, the amount of innovation, use cases, challenges, and compute environments have never been as large. Our hands are full with work and we constantly have no choice but to pass on exciting project ideas.

Yet, the Scylla team has decided to COMPLETELY TRANSFORM our fundamental assumptions. After we implemented LWT using Paxos and added read-modify-write verbs to match DynamoDB using LWT, we realized that it’s time to move to a better consensus protocol. This is a good match to improve other oversights of Scylla/Cassandra and make schema changes transactional, topology changes consistent (and thus double a cluster in a single operation) and provide full consistency at the price of eventual consistency.

Our Raft initiative was announced last January and is now part of the Circe project. Raft will enable long-term improvements such as synchronous materialized views and can completely eliminate repair (as the replicas are always in-sync).

6. Performance
We will publish a detailed benchmark comparison between Cassandra 4.0 and Scylla in the next couple of weeks. We tested throughput, latency and, more importantly, the speed of maintenance operations — from streaming to new nodes, decommission, failover and more. We’ve demonstrated leadership vs DynamoDB, Cassandra, Bigtable and others, but we’ll save all that for another day.

Cassandra isn’t going to go away. Instead, it will slowly lose traction as its development pace continues to slow down. Databases are hard to replace (here’s proof: Microsoft Access is still ranked high on db-engines) but it won’t be top of mind for new projects, either.

To be clear, we are not interested in the death of Cassandra. This is not a zero-sum game. If the ecosystem evolves and more CQL-based tools are generated, the whole segment benefits. More tools can be created on top of CQL, from JanusGraph to KairosDB, Kong and more. File formats can improve, drivers flourish and so forth. We are happy to see our own tools being used broadly, even by competitors — such as our Rust driver, Spark migrator and Gemini quality assurance tool.

All this to say, we are obviously very bullish on our future. We encourage you to join thousands of developers from such leading brands as Discord, Disney+, Bloomberg, Palo Alto networks, Instacart and many more who have chosen Scylla as their NoSQL database.

You’ll find that the best reason of all to try Scylla is on our downloads page.

The post Which Will Be the Best Wide Column Store? appeared first on ScyllaDB.

↧

ScyllaDB Brings Scylla Cloud to Google Cloud

August 4, 2021, 5:00 am

≫ Next: Overheard at Scylla University LIVE Summer Session

≪ Previous: Which Will Be the Best Wide Column Store?

Today, we’re announcing the general availability of Scylla Cloud on Google Cloud. Scylla Cloud is our resilient, highly performant, fully managed NoSQL database-as-service (DBaaS). Since its release, Scylla has become the go-to for companies that need a database built from the ground up for modern cloud environments. In 2019 we introduced Scylla Cloud, our fully-managed DBaaS. Initially available on AWS, users discovered Scylla Cloud made it easier for them to operate and scale their NoSQL workloads, since it alleviated them from administrative burdens. Today’s announcement gives users the flexibility to now run Scylla Cloud on the public cloud of their choice.

For those not already familiar with Scylla, it is a wide-column NoSQL database API compatible with both Apache Cassandra CQL as well as DynamoDB. While there are many offerings of Cassandra-compatible databases in the industry (of which we believe we are the best-of-breed),, we are the first and currently only company in the industry to offer a DynamoDB-compatible managed database on a public cloud apart from AWS. (Learn more about our Alternator API below.)

Already today 82% of our customers run Scylla on public clouds, and we believe this announcement will accelerate Scylla’s adoption within the Google Cloud community. Built with a close-to-the-hardware, shared-nothing design, Scylla Cloud empowers organizations to build and operate real-time applications at global scale — all for a fraction of the cost of other DBaaS options.

Scylla Cloud is now available in 20 geographical regions served by Google Cloud, from key US regions (Virginia, Ohio, California, and Oregon), to locations in Asia, Europe, South America, the Middle East, and Australia. Scylla Cloud will be deployed to the n2-highmem series of servers, known for their fast, locally-attached SSD storage.

The SADA Connection

To make this happen, Scylla partnered with SADA Systems, winner of Google Cloud’s partner of the year award for three years running. With over 5,000 Google Cloud customers accounting for half a billion in spend, SADA’s mission is to harness the power of Google Cloud to help you activate what you need and when you need it.

Miles Ward, SADA Systems’ CTO, pointed out a key benefit of Scylla Cloud on Google Cloud: “With Scylla Cloud running on Google Cloud, your data resides on the same infrastructure as other Google Cloud services and applications. Developers can spin up clusters in minutes and instantly gain access to the high throughput and predictable low-latency performance of Scylla Cloud.”

Companies using Scylla on Google Cloud Today

A number of Scylla customers already run Scylla Enterprise on Google Cloud. One such customer is Zeotap. Zeotap’s Customer Intelligence Platform hosts one of the world’s largest identity graphs, consisting of 20 billion nodes, 8 billion edges, and 3.6 billion identities. Based in Berlin, Zeotap has the additional consideration of the European Union’s General Data Privacy Regulation (GDPR); the service must provide adequate security and control over regional data sharing.

By running Scylla on Google Cloud, Zeotap reduced their data processing SLA from 10 hours (sometimes even more than a day!) to as little as 2 hours. They slashed the job failure rate from 20% per day to 2% per day. For more on Zeotap’s use of Scylla, read this blog post from last year.

Investing.com also runs Scylla on Google Cloud. With 12 million monthly unique visitors and 700,000 daily mobile visitors, latency affects many users. By moving to Scylla on Google Cloud, Investing.com experienced improved latency and better hardware utilization, which allowed the team to shrink their database cluster size by half.

Of course, what’s changing today is that we’re offering these benefits as a fully managed service on Google Cloud. The day-to-day management and monitoring of your clusters will be handled by the Scylla Cloud operations and support team.

The Benefits of Scylla Cloud on Google Cloud

Scylla Cloud running on Google Cloud delivers the full range of capabilities and benefits available on other cloud platforms:

Scale Up and Out: Scylla’s performance grows linearly with larger compute instances and additional cores. This translates to fewer nodes to provision and significantly lower cost to use Scylla Cloud compared to other NoSQL DBaaS options.
Resilient & Highly Available: Scylla Cloud automatically replicates data across multiple availability zones within a region, totally eliminating single points of failure. Scylla Cloud customers can add replicas and expand clusters across data centers as needed.
Security: Among managed DBaaS offerings, Scylla Cloud uniquely provides single-tenant hardened security, with encrypted data at rest, data in transit, encrypted backups, and key management. Additionally, Scylla Cloud is SOC2 Type II certified.
Hot Fixes: Updates are applied transparently to a running system, ensuring that the latest features and security updates are always installed.
Virtual Private Cloud (VPC) Peering: Applications can connect securely to the Scylla Cloud environment, and use private IPv4 or IPv6 addresses to avoid routing traffic over the Internet.
Automated Backups: Scylla Cloud offers automated backups directly to Google Cloud Storage.
Automated Monitoring: The Scylla Cloud engineering team also monitors your clusters 24x7x365 to ensure your database conforms to your SLAs. Metrics in Prometheus format can also be provided for consolidated monitoring by customers.

Beyond Vendor Lock-in: Running DynamoDB Workloads on Google Cloud

As stated above, one of Scylla’s unique features is its support for DynamoDB workloads. Scylla exposes a compatibility API, known as Scylla Alternator, which enables you to agnostically run your DynamoDB workloads anywhere — on any public cloud, in private clouds, or even on-premises. We have a number of customers already using this API in production.

By releasing Scylla Cloud on Google Cloud, we now support running DynamoDB workloads in a managed service on Google Cloud. For companies concerned about cloud vendor lock-in, this provides you the flexibility you’ve been looking for.

To help support your transition, our Spark-based Scylla Migrator makes switching between DynamoDB and Scylla Cloud on Google Cloud a far more manageable task.

If you have questions about whether to run Scylla Cloud using the CQL interface or its DynamoDB-compatible Alternator interface, you might find this article handy in enumerating the differences, or contact us to ask more in-depth questions.

Pricing

We strive to provide as many pricing options as possible to suit your team’s billing structure. You can structure your plan in the following ways:

Annual reserved pricing, billed upfront
Annual reserved pricing, billed monthly
On-demand hourly

We’ve also expanded our cloud pricing calculator to support both Google Cloud and AWS. The calculator helps you to work through scenarios and determine how Scylla Cloud fits into your budget. You specify your desired on demand or reserved pricing model, your estimated peak reads and writes per second, and the total data you need to store, and we’ll tell you the number and types of servers you’ll need to deploy, with an estimated monthly price.

Our handy pricing calculator lets you model your costs based on workloads and the size of your data set.

Global distribution, massive scale, and high performance are table stakes for modern business. Where most other NoSQL databases fall short, Scylla provides scale along with the industry’s best price/performance.

GET STARTED WITH THE SCYLLA CLOUD PRICING CALCULATOR

The post ScyllaDB Brings Scylla Cloud to Google Cloud appeared first on ScyllaDB.

↧

Overheard at Scylla University LIVE Summer Session

August 5, 2021, 12:01 pm

≫ Next: Step-by-Step Guide to Getting Started with Scylla Cloud

≪ Previous: ScyllaDB Brings Scylla Cloud to Google Cloud

Last week we hosted the Scylla University LIVE Summer Session virtual training. It was held on two consecutive days, one more convenient for EMEA timezones and another for the Americas.

Attendance was high. We saw a 27% increase in the number of participants from our previous LIVE event held in April, plus we were very pleased to see that both days had roughly equal numbers of attendees.

Our attendees came from a broad range of industry leading companies such as Amazon, Ericsson, Disney, Expedia, Apple, Fujitsu, Nubank, Rackspace, Palo Alto Networks, RSA, Salesforce and many others.

We don’t have an on-demand version of the event. However, you can find the slides from the different sessions, quiz questions, and some of the labs on Scylla University. Here is the link for the Essentials track course, and here is the one for the Advanced topics course. For those of you that are not familiar with Scylla University, it’s our online, free, self-paced Scylla training center. To get started all you have to do is create a user account.

Users that complete a course will get a certificate of completion and some bragging rights.

You had Questions? We had Answers!

Participants asked lots of questions in the different sessions and in the Expert Panel. Here are some of the more interesting ones:

Q: Are Materialized Views stable enough for production or are they still in testing?
A: Scylla’s Materialized View feature has been available since 2019. They are stable and in production; the feature has reached General Availability (GA) status. Learn more here.

Q: What are the benefits of migrating from Cassandra to ScyllaDB?
A: In addition to better performance, the cluster will behave in a more predictable and consistent way. Scylla is implemented in C++ and you don’t have to worry about Garbage Collection (GC). Also Scylla auto tunes itself and it’s much easier to maintain. Also, Scylla has some unique features, or features that are implemented in a better way than in Cassandra. For instance, our Change Data Capture (CDC) feature makes it far easier to consume change data right out of a CQL-readable table. Also, Scylla’s Materialized Views are production-ready, while Cassandra’s are not. Scylla’s Lightweight Transactions (LWT) feature is implemented more efficiently and performant than Cassandra’s. You can read more about the important differences between Scylla and Cassandra here.

Q: How can I migrate from Cassandra to Scylla?
A: There are different ways to migrate, one is the SSTable Loader tool, another is the Scylla Spark Migrator. You can read more about it in the documentation. Notice that these migration strategies can be done without any downtime.

Q: What are the differences between Scylla Cloud and Scylla Enterprise?
A: Scylla Cloud is a fully-managed database as a service (DBaaS). Under the hood it runs Scylla Enterprise. However we take care of things like software updates, backups, repairs, day-to-day monitoring, and security hardening so you can concentrate on your application development.

Q: When would it make sense to use the Scylla DynamoDB compatible API (Project Alternator) as opposed to using CQL?
A: If you’re starting from scratch, it would make more sense to use CQL. If you have an existing project that uses DynamoDB and you’re looking for better performance, reduced costs and no vendor lock-in, it would make more sense to use the Alternator project. You can read more about the difference between these two APIs here.

Essentials Track

In the Getting Started with Scylla session, I covered basic topics such as an intro to Scylla. Before diving into the theory I ran users through our Quick Wins lab which shows how easy and fast it is to start a Scylla cluster and perform some basic queries. I also gave an overview of the design goals for Scylla: High Availability, High Scalability, High Performance, Low Maintenance and being API-level compliant with Apache Cassandra and DynamoDB. Keep in mind that in Scylla, high availability is given preference over consistency. I explained concepts such as Node, Keyspace, Consistency Level, Replication Factor, Token Ranges, Cluster, and more. I then moved on to talk about Data Modeling, the importance of primary key selection and what users should focus on when creating a data model in Scylla (or Cassandra for that matter).
Tzach Livyatan, Scylla’s VP Product, talked about Advanced Data Modeling. Tzach continued where I left off and dove into choosing a partition key, using some examples. He then talked about Materialized Views and Secondary Indexes and when to use each. Next, Tzach talked about Counters. They are a Conflict-free Replicated Data Type (CRDT). Concurrent updates converge to a stable value. Counters support increment and decrement and are implemented as a set of triplets (node ID, vector clock, value). He also covered Sets, Lists, and Maps, giving an example of each type of collection. Then he talked about User-Defined Types (UDTs) and presented some code showing how they can be used. Tzach also discussed Time To Live (TTL) and how and when to use it. Finally, Tzach talked about Lightweight Transactions, when and how to use them.
Avishai Ish Shalom, Developer Advocate, gave the last talk in the Essentials track, titled Building Well-Architected Apps in Scylla Cloud. Avishai covered consistency in Scylla and what happens when we have failed writes? What are the options for handling this? He then talked about different architecture patterns such as the read/write split, write dam, Harvest and Yield, before moving on to some examples.

Advanced Track

Piotr Grabowski, Software Developer, talked about Working with Kafka and Scylla. He started by giving an overview of Apache Kafka, an open-source distributed event streaming system, and presented a hands-on demo. It allows users to ingest data from a multitude of different systems, such as databases, services or other applications. It then can store the data for future reads, process and transform the incoming streams in real-time and allows downstream applications to consume the stored data stream. As an example of the latter Piotr went over the Scylla Sink Connector which reads messages from a Kafka topic and inserts them into Scylla. The connector supports different data formats (Avro, JSON). Piotr then covered the Scylla CDC Source Connector, giving a quick overview of Change Data Capture (CDC) and how to use it.
Lubos Kosco, Software Engineer, covered utilizing Spark and Scylla together. Apache Spark is a unified analytics engine for large-scale data processing. It allows for writing data analytics applications quickly in Java, Scala, Python, R, and SQL. The Scylla Spark Connector allows for integration between Scylla and Spark. Lubos showed an example of using it to connect to Scylla, perform queries and process data. Lastly Lubos talked about the Scylla Migrator, showing an example of how to use it with a sample app.
Amnon Heiman, Software Developer, gave a talk about Improving Applications Using Scylla Monitoring, which is the goto tool for understanding what’s going on in your cluster. After an overview, Amnon covered common pitfalls and how they can be detected in the different monitoring dashboards, performance issues, alerts, and how to debug issues.

Attendee Poll Results

While attendees were learning about Scylla, we wanted to learn about you. A few last observations we’d like to share from the event come from our attendee polls. Let’s look at the results.

How are you interested in deploying Scylla?

The vast majority of our attendees — over two thirds — were interested in deploying Scylla Open Source. Of the remaining third, the ratio of users who want to run their own instances of Scylla Enterprise outnumbered those who want a fully-managed DBaaS option by a factor of two-to-one.

How much data do you have under management in your transactional database systems?

What’s interesting in terms of data size is that most attendees — nearly half — support workloads in the low-terabytes scale. Together with sub-terabyte workloads, that accounts for 80% of all attendees. However, about 20% are in the 50 terabyte or larger scale; about half of whom — one attendee in ten overall — are operating workloads with more than 100 terabytes of data under management.

What is your level of experience with Scylla?

As expected for an online training opportunity, the majority of attendees — about 70% — were new to Scylla. About half of all attendees knew some sort of NoSQL; only for a fifth of attendees was this their very first experience with NoSQL. About 30% of our attendees are already using Scylla, who sought to get even deeper insights.

Enroll in Scylla University

Thank you to everyone who attended our Scylla University LIVE Summer Session and made the event so great! If you didn’t have a chance to attend, never fear! Many of these courses are available right now in a free self-paced form in Scylla University. Create an account and get started today!

The post Overheard at Scylla University LIVE Summer Session appeared first on ScyllaDB.

↧

Step-by-Step Guide to Getting Started with Scylla Cloud

August 10, 2021, 11:39 am

≫ Next: Cassandra 4.0 vs. Cassandra 3.11: Comparing Performance

≪ Previous: Overheard at Scylla University LIVE Summer Session

More and more teams are choosing our Scylla Cloud as their database-as-a-service. To make this transition even easier, in this post we’ll go step-by-step into what it takes to get your Scylla Cloud cluster up and running quickly.

First Steps

If you haven’t already, sign up for a new Scylla Cloud account
Recommended: Set up Two-Factor Authentication (2FA) for your Scylla Cloud user account.
Optional: Take advantage of our free trial offering and immediately create your first cluster.
Recommended: After setting up your free trial, check out our Scylla University Scylla Cloud lab session that walks you through all the steps to create a simple demo cluster. Then connect to your new cluster and execute basic commands to insert and select sample data.
Check out the other free Scylla University courses.
Bookmark the Scylla Cloud documentation site.
Join our community on Slack!

Select Your Cloud Provider

Scylla Cloud is available on both AWS and Google Cloud public clouds, so you can choose your preferred cloud provider to run your cluster. We chose the most optimal compute instances with the best CPU/RAM ratio and local NVMe storage to guarantee predictable performance, high throughput, and low latencies for your applications.

On the AWS cloud platform, we support high-performance storage optimized i3 and i3en instances.
On Google Cloud, we are utilizing n2-highmem machines.

Based on your selection, the Scylla Cloud wizard will automatically update the list of geographic regions where you can deploy your cluster and instance types with the estimated cost of cloud compute resources per hour per node.

BYOA as an Option

Scylla Cloud users on AWS have the option to provision Scylla Cloud EC2 resources directly into servers from their own AWS accounts. We call it “Bring Your Own Account” or BYOA. This option allows you to manage all your AWS resources under one account. It also enables you to take advantage of any pre-negotiated AWS rates or cloud credits and apply them towards Scylla Cloud compute resources. Plus, this may help you satisfy strict compliance requirements where sensitive data should remain within your accounts.

Upon selection of your account as a destination for Scylla Cloud deployment, Scylla Cloud provides a wizard that will walk you through all the steps to create a cloud policy and an IAM role for Scylla Cloud. See our documentation for more details. If you have any questions, you can always open a support request through the Scylla Cloud user interface.

Capacity Planning

After you have settled on your choice of cloud provider and deployment options for Scylla resources, it’s time to choose the right instance types to satisfy your workloads. We offer a capacity planning calculator for your convenience to help you properly size your cluster based on a few properties:

read and write throughput
average item size
projected data set size

The calculator takes those inputs and provides you with a suggested cluster specification, plus a cost estimate for on-demand and reserved capacity. Please note that the calculator cannot factor in more complex types of workloads or advanced features, such as compute or throughput intensive operations. If you have any questions about planning for your own workloads, please reach out to our Solution Architects team, using our Contact Us page.

To learn more about capacity planning for Scylla Cloud, please read this blog post.

Deployment

Now use the cluster specifications to select instance types within your desired geographic region, the number of nodes you need, and the replication factor you want for your data. Scylla Cloud will automatically provision your cluster across multiple availability zones to ensure the high availability and resilience of your cluster.

You can learn more about Scylla’s high availability design here.

Don’t forget to name your cluster! And we are taking security seriously, so you have to provide a list of IP addresses that will be able to communicate with your cluster on Scylla Cloud.

We highly recommend you avoid routing your traffic over the open internet and instead directly connect your Virtual Private Cloud (VPC) to Scylla Cloud. Read how you can enable VPC peering on your cluster.

Launch Time!

When you are satisfied with your selection, simply click the “Launch Cluster” button to provision cloud resources. Now sit back and relax; our automated processes will do all the work.

Connecting to Your Cluster

After your VPC peering is set up, provisioning is done, and your cluster is ready for use, you can immediately connect to your cluster. Scylla Cloud provides you with default credentials and detailed instructions on how to connect to your cluster with different clients and drivers.

Don’t have a driver? Now’s the time to download one. We have links to our CQL drivers right inside the Scylla Cloud user interface.

While we generate a very strong randomized password for you, you can take this opportunity to change the default password of “scylla” user for security purposes!

Monitoring Your Cluster

Scylla Cloud comes integrated with Scylla Monitoring Stack for greater visibility into your cluster’s health and performance. While our team of engineers is taking care of all cluster management and health monitoring tasks, we also give you a real-time dashboard that provides a transparent view of your cluster health, lets you explore CQL metrics, and proactively identifies potential issues in the Advisor section.

Testing the Limits

While we can claim Scylla is the most performant NoSQL database, it’s best to see the objective evidence for yourself. How about putting your cluster to a stress test to go from zero to 2M OPS just in 5 minutes?

Note that due to your own cluster’s specification, you may see more or less operations per second compared to the test setup.

Alternator

It’s important to note that this guide was written with Scylla’s CQL interface in mind. However, if you are using our Amazon DynamoDB-compatible API, known as Project Alternator, you can check out the documentation here.

Conclusion

Thank you again for trusting your organization’s data and daily operations to Scylla Cloud. We strive to improve the product every day, so if there is a feature you don’t see, or have questions we haven’t already answered in this guide, please let us know!

Feel free to ask a question on our Slack channel.

Or if your question is more of a confidential nature, contact us privately.

GET STARTED ON SCYLLA CLOUD

The post Step-by-Step Guide to Getting Started with Scylla Cloud appeared first on ScyllaDB.

↧

Cassandra 4.0 vs. Cassandra 3.11: Comparing Performance

August 19, 2021, 5:00 am

≫ Next: Apache Cassandra 4.0 vs. Scylla 4.4: Comparing Performance

≪ Previous: Step-by-Step Guide to Getting Started with Scylla Cloud

This is part one of a two-part blog series on the relative performance of the recently released Apache Cassandra 4.0. In this post we’ll compare Cassandra 4.0 versus Cassandra 3.11. In part two of this series, we’ll compare both of these Cassandra releases with the performance of Scylla Open Source 4.4.

Apache Cassandra 3.0 was originally released in November of 2015. Its last minor release, Cassandra 3.11, was introduced in June of 2017. Since then users have awaited a major upgrade to this popular wide column NoSQL database. On July 27, 2021, Apache Cassandra 4.0 was finally released. For the open source NoSQL community, this long-awaited upgrade is a significant milestone. Kudos to everyone involved in its development and testing!

Cassandra has consistently been ranked amongst the most popular databases in the world, as per the DB-engines.com ranking, often sitting in the top 10.

TL;DR Cassandra 4.0 vs Cassandra 3.0 Results

As the emphasis of Cassandra 4.0 release was on stability, the key performance gain is achieved due to a major upgrade of the JVM (OpenJDK 8 → OpenJDK 16) and the usage of ZGC instead of G1GC. As you can quickly observe, the latencies under maximum throughput were drastically improved! You can read more about the new Java garbage collectors (and their various performance test results) in this article.

P99 latencies at one half (50%) of maximum throughput of Cassandra 4.0. Cassandra 4.0 reduced these long-tail latencies between 80% – 99% over Cassandra 3.11.

Maximum throughput for Cassandra 4.0 vs. Cassandra 3.11, measured in 10k ops increments, before latencies become unacceptably high. While many cases produced no significant gains for Cassandra 4.0, some access patterns saw Cassandra 4.0 capable of 25% – 33% greater throughput over Cassandra 3.11.

In our test setup, which we will describe in greater detail below, Cassandra 4.0 showed a 25% improvement for a write-only disk-intensive workload and 33% improvements for cases of read-only with either a low or high cache hit rate. Otherwise max throughput between the two Cassandra releases was relatively similar.

This doesn’t tell the full story as most workloads wouldn’t be executed in maximum utilization and the tail latency in max utilization is usually not good. In our tests, we marked the throughput performance at SLA of under 10msec in P90 and P99 latency. At this service level Cassandra 4.0, powered by the new JVM/GC, can perform twice that of Cassandra 3.0.

Outside of sheer performance, we tested a wide range of administrative operations, from adding nodes, doubling a cluster, node removal, and compaction, all of them under emulated production load. Cassandra 4.0 improves these admin operation times up to 42%.

For users looking for improvements in throughputs for other use cases Cassandra 4.0’s results may be slight or negligible.

Test Setup

We wanted to use relatively typical current generation servers on AWS so that others could replicate our tests, and reflect a real-world setup.

	Cassandra 4.0/3.11	Loaders
EC2 Instance Type	i3.4xlarge	c5n.9xlarge
Cluster Size	3	3
vCPUs	16	36
Memory (GiB)	122	96
Storage	2x 1.9TB NVMe in RAID0	Not important for a loader (EBS-only)
Network	Up to 10 Gbps	50 Gbps

We set up our cluster on Amazon EC2, in a single Availability Zone within us-east-2. Database cluster servers were initialized with clean machine images (AMIs), running Cassandra 4.0 (which we’ll refer to as “C*4” below) and Cassandra 3.11 (“C*3”) on Ubuntu 20.04.

Apart from the cluster, three loader machines were employed to run cassandra-stress in order to insert data and, later, provide background load to mess with the administrative operations.

Once up and running, the databases were loaded by cassandra-stress with 3 TB of random data organized into the default schema. At the replication factor of 3, this means approximately 1 TB of data per node. The exact disk occupancy would, of course, depend on running compactions and the size of other related files (commitlogs, etc.). Based on the size of the payload, this translated to ~3.43 billion partitions. Then we flushed the data and waited until the compactions finished, so we can start the actual benchmarking.

Limitations of Our Testing

It’s important to note that this basic performance analysis does not cover all factors in deciding whether to stay put on Cassandra 3.x, upgrade to Cassandra 4.0, or to migrate to a new solution. Users may be wondering if the new features of Cassandra 4.0 are compelling enough. Plus there are issues of risk aversion based on stability and maturity for any new software release — for example, the ZGC garbage collector we used currently employs Java 16, which is supported by Cassandra, but not considered production-ready; newer JVMs are not officially supported by Cassandra yet.

Throughputs and Latencies

The actual benchmarking is a series of simple invocations of cassandra-stress with CL=QUORUM. For 30 minutes we keep firing 10,000 requests per second and monitor the latencies. Then we increase the request rate by another 10,000 for another 30 min, and so on. (20,000 in case of larger throughputs). The procedure repeats until the DB is no longer capable of withstanding the traffic, i.e. until cassandra-stress cannot achieve the desired throughput or until the 90-percentile latencies exceed 1 second.

Note: This approach means that throughput numbers are presented with 10k/ops granularity (in some cases 20k/ops).

We tested Cassandra 4.0 and 3.11 with the following distributions of data:

“Real-life” (Gaussian) distribution, with sensible cache-hit ratios of 30-60%
Uniform distribution, with a close-to-zero cache hit ratio, which we’ll call “disk-intensive”
“In-memory” distribution, expected to yield almost 100% cache hits, which we’ll call “memory-intensive”

Within these scenarios we ran the following workloads:

100% writes
100% reads
50% writes and 50% reads

“Real-life” (Gaussian) Distribution

In this scenario we issue queries that touch partitions randomly drawn from a narrow Gaussian distribution. We make an Ansatz about the bell curve: we assume that its six-sigma spans the RAM of the cluster (corrected for the replication factor). The purpose of this experiment is to model a realistic workload, with a substantial cache hit ratio but less than 100%, because most of our users observe the figures of 60-90%. We expect Cassandra to perform well in this scenario because its key cache is dense, i.e. it efficiently stores data in RAM, though it relies on SSTables stored in the OS page cache which can be heavyweight to look up.

Mixed Workload – 50% reads and 50% writes

The 90- and 99-percentile latencies of UPDATE queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload consists of 50% reads and 50% writes, randomly targeting a “realistic” Gaussian distribution. C*3 quickly becomes nonoperational, C*4 is a little better but doesn’t achieve greater than 40k/ops.

Metric	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11
Maximum throughput	40k/s	30k/s	1.33x
Maximum throughput with 90% latency <10ms	30k/s	10k/s	3x
Maximum throughput with 99% latency <10ms	30k/s	–	–

The 90- and 99-percentile latencies of SELECT queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload consists of 50% reads and 50% writes, randomly targeting a “realistic” Gaussian distribution. C*3 quickly becomes nonoperational, C*4 is a little better but doesn’t achieve greater than 40k/ops.

Metric	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11
Maximum throughput	40k/s	40k/s	1x
Maximum throughput with 90% latency < 10ms	30k/s	10k/s	3x
Maximum throughput with 99% latency < 10ms	10k/s	–	–

Uniform Distribution (low cache hit ratio)

In this scenario we issue queries that touch random partitions of the entire dataset. In our setup this should result in negligible cache hit rates, i.e. that of a few %.

Writes Workload – Only Writes

The 90- and 99-percentile latencies of UPDATE queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed, i.e. every partition in the 1 TB dataset has an equal chance of being updated. C*3 quickly becomes nonoperational, C*4 is a little better, achieving up to 50k/ops.

Metric	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs Cassandra 3.11
Maximum throughput	50k/s	40k/s	1.25x
Maximum throughput with 90% latency < 10 ms	40k/s	20k/s	2x
Maximum throughput with 99% latency < 10 ms	30k/s	–	–

Reads Workload – Only Reads

The 90- and 99-percentile latencies of SELECT queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed, i.e. every partition in the 1 TB dataset has an equal chance of being selected. C*4 serves 90% of queries in a <10 ms time until the load reaches 40k ops. Please note that almost all reads are served from disk.

Metric	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs Cassandra 3.11
Maximum throughput	40k/s	30k/s	1.25x
Maximum throughput with 90% latency < 10 ms	40k/s	30k/s	1.25x
Maximum throughput with 99% latency < 10 ms	20k/s	–	–

Mixed Workload – 50% reads and 50% writes

The 90- and 99-percentile latencies of UPDATE queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed, i.e. every partition in the 1 TB dataset has an equal chance of being selected/updated. Both C*4 and C*3 throughputs up to 40k ops, but the contrast was significant: C*4’s P90s were nearly single-digit, while C*3s P90s were over 500 ms, and its P99s were longer than a second.

Metric	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs Cassandra 3.11
Maximum throughput	40k/s	40k/s	1x
Maximum throughput with 90% latency < 10 ms	40k/s	20k/s	2x
Maximum throughput with 99% latency < 10 ms	30k/s	–	–

The 90- and 99-percentile latencies of SELECT queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed, i.e. every partition in the 1 TB dataset has an equal chance of being selected/updated. C*3 can barely maintain sub-second P90s at 40k ops, and not P99s. C*4 almost achieved single-digit latencies in the P90 range, and had P99s in the low hundreds of milliseconds.

Metric	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs Cassandra 3.11
Maximum throughput	40k/s	40k/s	1x
Maximum throughput with 90% latency < 10 ms	30k/s	20k/s	1.5x
Maximum throughput with 99% latency < 10 ms	20k/s	–	–

In-Memory Distribution (high cache hit ratio)

In this scenario we issue queries touching random partitions from a small subset of the dataset, specifically: one that fits into RAM. To be sure that our subset resides in cache and thus no disk IO is triggered, we choose it to be… safely small, at an arbitrarily picked value of 60 GB. The goal here is to evaluate both DBs at the other extreme end: where they both serve as pure in-memory datastores.

Writes Workload – Only Writes

The 90- and 99-percentile latencies of UPDATE queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed over 60 GB of data, so that every partition resides in cache and has an equal chance of being updated. Both versions of Cassandra quickly become nonoperational beyond 40k ops, though C*4 maintains single-digit latencies up to that threshold. C*3 can only maintain single-digit P90 latencies at half that throughput — 20k ops.

Metric	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs Cassandra 3.11
Maximum throughput	40k/s	40k/s	1x
Maximum throughput with 90% latency < 10 ms	40k/s	20k/s	2x
Maximum throughput with 99% latency < 10 ms	40k/s	–	–

Reads Workload – Only Reads

The 90- and 99-percentile latencies of SELECT queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed over 60 GB of data, so that every partition resides in cache and has an equal chance of being selected. C*4 can achieve 80k ops before becoming functionally non-performant, whereas C*3 can only achieve 60k ops. C*4 can also maintain single digit millisecond latencies for P99s up to 40k ops, whereas C*3 quickly exceeds that latency threshold even at 20k ops.

Metric	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs Cassandra 3.11
Maximum throughput	80k/s	60k/s	1.33x
Maximum throughput with 90% latency < 10 ms	60k/s	40k/s	1.5x
Maximum throughput with 99% latency < 10 ms	40k/s	–	–

Mixed Workload – 50% reads and 50% writes

The 90- and 99-percentile latencies of UPDATE queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed over 60 GB of data, so that every partition resides in cache and has an equal chance of being selected/updated. C*4 can maintain single-digit long-tail latencies up to 40k ops. C*3 can only maintain single-digit P90 latencies at half that rate (20k ops) and quickly rises into hundreds of milliseconds for P90/P99 latencies at 40k ops. Both C*4 and C*3 fail to achieve reasonable latencies beyond those ranges.

Metric	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs Cassandra 3.11
Maximum throughput	40k/s	40k/s	1x
Maximum throughput with 90% latency < 10 ms	40k/s	20k/s	2x
Maximum throughput with 99% latency < 10 ms	40k/s	–	–

The 90- and 99-percentile latencies of SELECT queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed over 60 GB of data, so that every partition resides in cache and has an equal chance of being selected/updated. C*4 and C*3 can only maintain single-digit millisecond long-tail latencies at 20k ops throughput (and C*3 only for P90; its P99s are already in the hundreds of milliseconds even at 20k ops). C*4 can achieve single digit P90 latencies at 40k ops, but P99 latencies rise into double-digit milliseconds.

Metric	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs Cassandra 3.11
Maximum throughput	40k/s	40k/s	1x
Maximum throughput with 90% latency < 10 ms	40k/s	20k/s	2x
Maximum throughput with 99% latency < 10 ms	20k/s	–	–

Administrative Operations

Beyond the speed of raw performance, users have day-to-day administrative operations they need to perform: including adding a node to a growing cluster, or replacing a node that has died. The following tests benchmarked performance around these administrative tasks.

Adding Nodes

The timeline of adding 3 nodes to an already existing 3-node cluster (ending up with six i3.4xlarge machines), doubling the size of the cluster. Cassandra 4 exhibited a 12% speed improvement over Cassandra 3.

One New Node

In this benchmark, we measured how long it took to add a new node to the cluster. The reported times are the intervals between starting a Cassandra node and having it fully finished bootstrapping (CQL port open).

Cassandra 4.0 is equipped with a new feature, Zero Copy Streaming (ZCS), which basically allows efficient streaming of entire SSTables. An SSTable is eligible for ZCS if all of its partitions need to be transferred, which can be the case when LeveledCompactionStrategy (LCS) is enabled. Willing to demonstrate this feature, we run the next benchmarks with the usual SizeTieredCompactionStrategy (STCS) compared to LCS since the first cannot benefit from zero copy streaming.

The time needed to add a node to an already existing 3-node cluster (ending up with 4 i3.4xlarge machines). Cluster is initially loaded with 1 TB of data at RF=3. C*4 noted an 15% speed improvement over C*3 using STCS. but 30% faster compared to C*3 when using LCS.

Strategy	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11
STCS	1 hour 47 minutes 1 second	2 hours 6 minutes	15% faster
LCS	1 hour 39 minutes 45 seconds	2 hours 23 minutes 10 seconds	30% faster

Doubling the Cluster Size

In this benchmark, we measured how long it took to double the cluster node count: we go from 3 nodes to 6 nodes. Three new nodes were added sequentially, i.e. waiting for the previous one to fully bootstrap before starting the next one. The reported time spans from the instant the startup of the first new node is initiated, all the way until the bootstrap of the third new node finishes.

The time needed to add 3 nodes to an already existing 3-node cluster of i3.4xlarge machines, preloaded with 1 TB of data at RF=3. C* 4 was 12% faster than C*3 using STCS, and 21% faster when using LCS.

Strategy	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11
STCS	3 hours 58 minutes 21 seconds	4 hours 30 minutes 7 seconds	12% faster
LCS	3 hours 44 minutes 6 seconds	4 hours 44 minutes 46 seconds	21% faster

Replace Node

In this benchmark, we measured how long it took to replace a single node. One of the nodes is brought down and another one is started in its place.

The time needed to replace a node in a 3-node cluster of i3.4xlarge machines, preloaded with 1 TB of data at RF=3. Cassandra 4.0 noted significant improvement over Cassandra 3.11.

Strategy	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11
STCS	3 hours 28 minutes 46 seconds	4 hours 35 minutes 56 seconds	24% faster
LCS	3 hours 19 minutes 17 seconds	5 hours 4 minutes 9 seconds	34% faster

Summary

Cassandra 4.0 is undeniably better than Cassandra 3.11. It improved latencies under almost all conditions, and could often sustain noticeably improved throughputs. As well, it sped up the process of streaming, which is very useful in administrative operations.

Key findings:

Cassandra 4.0 has better P99 latency than Cassandra 3.11 by up to 100x!
Cassandra 4.0 throughputs can be up to 33% greater compared to Cassandra 3.11, but more importantly, under an SLA of < 10 ms in P99 latency, Cassandra 4.0 can be 2x to 3x more performing.
Cassandra 4.0 speeds up streaming up to 34% faster than Cassandra 3.11

Stay Tuned

Stay tuned for Part 2 of our benchmarking analysis, in which we will compare the performance of Apache Cassandra, both 3.11 and 4.0, against Scylla Open Source 4.4.

Appendix

Cassandra 3.11 configuration

JVM settings JVM version: OpenJDK 8
-Xms48G -Xmx48G -XX:+UseG1GC -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=500 -XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=16

cassandra.yaml

Only settings changed from the default configuration are mentioned here.

disk_access_mode: mmap_index_only row_cache_size_in_mb: 10240 concurrent_writes: 128 file_cache_size_in_mb: 2048 buffer_pool_use_heap_if_exhausted: true disk_optimization_strategy: ssd memtable_flush_writers: 4 trickle_fsync: true concurrent_compactors: 16 compaction_throughput_mb_per_sec: 960 stream_throughput_outbound_megabits_per_sec: 7000

Cassandra 4.0 configuration

JVM settings

JVM version: OpenJDK 16

-Xmx70G -Xmx70G -XX:ConcGCThreads=16 -XX:+UseZGC
-XX:ConcGCThreads=16 -XX:ParallelGCThreads=16 -XX:+UseTransparentHugePages -verbose:gc -Djdk.attach.allowAttachSelf=true -Dio.netty.tryReflectionSetAccessible=true

cassandra.yaml

Only settings changed from the default configuration are mentioned here.

`Cassandra-stress` parameters

Background loads were executed in the loop (so duration=5m is not a problem).
REPLICATION_FACTOR is 3.
COMPACTION_STRATEGY is SizeTieredCompactionStrategy unless stated otherwise.
loadgenerator_count is the number of generator machines (3 for these benchmarks).
DURATION_MINUTES is 10 for in-memory benchmarks.

Inserting data	`write cl=QUORUM` `-schema "replication(strategy=SimpleStrategy,replication_factor={REPLICATION_FACTOR})" "compaction(strategy={COMPACTION_STRATEGY})"` `-mode native cql3threads` and `throttle` parameters were chosen for each DB separately, to ensure 3TB were inserted quickly, yet also to provide headroom for minor compactions and avoid timeouts/large latencies.
Cache warmup in Gaussian latency / throughput	`mixed ratio(write=0,read=1)` `duration=180m` `cl=QUORUM -pop dist=GAUSSIAN(1..{ROW_COUNT},{GAUSS_CENTER},{GAUSS_SIGMA})` `-mode native cql3 -rate "threads=500 throttle=35000/s" -node {cluster_string}')`
Latency / throughput – Gaussian	`duration={DURATION_MINUTES}m` `cl=QUORUM` `-pop dist=GAUSSIAN(1..{ROW_COUNT},{GAUSS_CENTER},{GAUSS_SIGMA})` `-mode native cql3` `"threads=500 fixed={rate // loadgenerator_count}/s"`
Latency / throughput – uniform / in-memory	`duration={DURATION_MINUTES}m` `cl=QUORUM` `-pop dist=UNIFORM(1..{ROW_COUNT})` `-mode native cql3` `-rate "threads=500 fixed={rate // loadgenerator_count}/s"`

The post Cassandra 4.0 vs. Cassandra 3.11: Comparing Performance appeared first on ScyllaDB.

↧

Apache Cassandra 4.0 vs. Scylla 4.4: Comparing Performance

August 24, 2021, 10:55 am

≫ Next: P99 CONF Agenda Now Online

≪ Previous: Cassandra 4.0 vs. Cassandra 3.11: Comparing Performance

This is part two of a two-part blog series on the relative performance of the recently released Apache Cassandra 4.0. In part one, we compared Cassandra 4.0’s improvements over Cassandra 3.11. In part two we will compare Cassandra 4.0 and 3.11 with the performance of Scylla Open Source 4.4.

On July 27, 2021, after almost six years of work, the engineers behind Apache Cassandra bumped its major revision number from 3 to 4. Over almost the same period of time, Scylla emerged from its earliest beta (October 2015), proceeded through four major releases, and is currently at minor release 4.4.

In the fast-paced world of big data many other advances have occurred: there are new JVMs, new system kernels, new hardware, new libraries and even new algorithms. Progress in all those areas presented Cassandra with some unprecedented opportunities to achieve new levels of performance. Similarly, Scylla did not stand still over this period, as we consistently improved our NoSQL database engine with new features and optimizations.

Let’s compare the performance of the latest release of Scylla Open Source 4.4 against Cassandra 4.0 and Cassandra 3.11. We measured the latencies and throughputs at different loads, as well as the speed of common administrative operations like adding/replacing a node or running major compactions.

TL;DR Scylla Open Source 4.4 vs. Cassandra 4.0 Results

The detailed results and the fully optimized setup instructions are shared below. We compared two deployment options in the AWS EC2 environment:

The first is an apples-to-apples comparison of 3-node clusters.
The second is a larger-scale setup where we used node sizes optimal for each database. Scylla can utilize very large nodes so we compared a setup of 4 i3.metal machines (288 vCPUs in total) vs. 40 (!) i3.4xlarge Cassandra machines (640 vCPUs in total — almost 2.5x the Scylla’s resources).

Key findings:

Cassandra 4.0 has better P99 latency than Cassandra 3.11 by 100x!
Cassandra 4.0 speeds up admin operations by up to 34% compared to Cassandra 3.11
Scylla has 2x-5x better throughput than Cassandra 4.0 on the same 3-node cluster
Scylla has 3x-8x better throughput than Cassandra 4.0 on the same 3-node cluster while P99 <10ms
Scylla adds a node 3x faster than Cassandra 4.0
Scylla replaces a node 4x faster than Cassandra 4.0
Scylla doubles a 3-node cluster capacity 2.5x faster than Cassandra 4.0
A 40 TB cluster is 2.5x cheaper with Scylla while providing 42% more throughput under P99 latency of 10 ms
Scylla adds 25% capacity to a 40 TB optimized cluster 11x faster than Cassandra 4.0.
Scylla finishes compaction 32x faster than Cassandra 4.0
Cassandra 4.0 can achieve a better latency with 40 i3.4xlarge nodes than 4 i3.metal Scylla nodes when the throughput is low and the cluster is being underutilized. Explanation below.

A peek into the results: the 90- and 99-percentile latencies of UPDATE queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Both Cassandras quickly become functionally nonoperational, serving requests with tail latencies that exceed 1 second.

A peek into the results: the 99-percentile (P99) latencies in different scenarios, as measured on 3 x i3.4xlarge machines (48 vCPUs in total) under load that puts Cassandra 4.0 at halfway to saturation. Scylla excels at response times: Cassandra 4.0 P99 latencies are anywhere between 80% to 2,200% greater than Scylla 4.4.

A peek into the results: the maximum throughput (measured in operations per second) achieved on 3 x i3.4xlarge machines (48 vCPUs). Scylla leads the pack, processing from 2x to 5x more requests than either of the Cassandras.

A peek into the results: the time taken by replacing a 1 TB node, measured under Size-Tiered Compaction Strategy (STCS) and Leveled Compaction Strategy (LCS). By default (STCS) Scylla is almost 4x faster than Cassandra 4.0.

A peek into the results: latencies of SELECT query, as measured on 40 TB cluster on uneven hardware — 4 nodes (288 vCPUs) for Scylla and 40 nodes (640 vCPUs) for Cassandra.

Limitations of Our Testing

It’s important to note that this basic performance analysis does not cover all factors in deciding whether to stay put on Cassandra 3.x, upgrade to Cassandra 4.0, or to migrate to Scylla Open Source 4.4. Users may be wondering if the new features of Cassandra 4.0 are compelling enough, or how changes between implemented features compare between Cassandra and Scylla. For instance, you can read more about the difference in CDC implementations here, and how Scylla’s Lightweight Transactions (LWT) differ from Cassandra’s here. Apart from comparison of basic administrative tasks like adding one or more nodes which is covered below, benchmarking implementation of specific features is beyond the scope of consideration.

Plus there are issues of risk aversion based on stability and maturity for any new software release — for example, the ZGC garbage collector we used currently employs Java 16, which is supported by Cassandra, but not considered production-ready; newer JVMs are not officially supported by Cassandra yet.

Cluster of Three i3.4xlarge Nodes

3-Node Test Setup

The purpose of this test was to compare the performance of Scylla vs. both versions of Cassandra on the exact same hardware. We wanted to use relatively typical current generation servers on AWS so that others could replicate our tests, and reflect a real-world setup.

	Cassandra/Scylla	Loaders
EC2 Instance type	i3.4xlarge	c5n.9xlarge
Cluster size	3	3
vCPUs (total)	16 (48)	36 (108)
RAM (total)	122 (366) GiB	96 (288) GiB
Storage (total)	2x 1.9TB NVMe in RAID0 (3.8 TB)	Not important for a loader (EBS-only)
Network	Up to 10 Gbps	50 Gbps

We set up our cluster on Amazon EC2, in a single Availability Zone within us-east-2. Database cluster servers were initialized with clean machine images (AMIs), running CentOS 7.9 with Scylla Open Source 4.4 and Ubuntu 20.04 with Cassandra 4.0 or Cassandra 3.11 (which we’ll refer to as “C*4” and “C*3”, respectively).

Apart from the cluster, three loader machines were employed to run cassandra-stress in order to insert data and, later, provide background load to mess with the administrative operations.

Once up and running, the databases were loaded by cassandra-stress with random data organized into the default schema at RF=3. The loading continues until the cluster’s total disk usage reaches approx. 3 TB (or 1 TB per node). The exact disk occupancy would, of course, depend on running compactions and the size of other related files (commitlogs, etc.). Based on the size of the payload, this translated to ~3.43 billion partitions. Then we flushed the data and waited until the compactions finished, so we can start the actual benchmarking.

Throughput and Latencies

The actual benchmarking is a series of simple invocations of cassandra-stress with CL=QUORUM. For 30 minutes we keep firing 10,000 requests per second and monitor the latencies. Then we increase the request rate by another 10,000 for another 30 min, and so on. (20,000 in case of larger throughputs). The procedure repeats until the DB is no longer capable of withstanding the traffic, i.e. until cassandra-stress cannot achieve the desired throughput or until the 90-percentile latencies exceed 1 second.

Note: This approach means that throughput numbers are presented with 10k/s granularity (in some cases 20k/s).

We have tested our databases with the following distributions of data:

“Real-life” (Gaussian) distribution, with sensible cache-hit ratios of 30-60%
Uniform distribution, with a close-to-zero cache hit ratio
“In-memory” distribution, expected to yield almost 100% cache hits

Within these scenarios we ran the following workloads:

100% writes
100% reads
50% writes and 50% reads

“Real-life” (Gaussian) Distribution

In this scenario we issue queries that touch partitions randomly drawn from a narrow Gaussian distribution. We make an Ansatz about the bell curve: we assume that its six-sigma spans the RAM of the cluster (corrected for the replication factor). The purpose of this experiment is to model a realistic workload, with a substantial cache hit ratio but less than 100%, because most of our users observe the figures of 60-90%. We can expect Cassandra to perform well in this scenario because its key cache is denser than Scylla’s, i.e. it efficiently stores data in RAM, though it relies on SSTables stored in the OS page cache which can be heavyweight to look up. By comparison, Scylla uses a row-based cache mechanism. This Gaussian distribution test should indicate which uses the more efficient caching mechanism for reads.

Mixed Workload – 50% reads and 50% writes

The 90- and 99-percentile latencies of UPDATE queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload consists of 50% reads and 50% writes, randomly targeting a “realistic” Gaussian distribution. C*3 quickly becomes nonoperational, C*4 is a little better, meanwhile Scylla maintains low and consistent write latencies in the entire range.

Metric	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	80k/s	40k/s	30k/s	1.33x	2x
Maximum throughput with 90% latency <10ms	80k/s	30k/s	10k/s	3x	2.66x
Maximum throughput with 99% latency <10ms	80k/s	30k/s	–	–	2.66x

Metric	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	90k/s	40k/s	40k/s	1x	2.25x
Maximum throughput with 90% latency <10ms	80k/s	30k/s	10k/s	3x	2.66x
Maximum throughput with 99% latency <10ms	70k/s	10k/s	–	–	7x

Uniform Distribution (disk-intensive, low cache hit ratio)

In this scenario we issue queries that touch random partitions of the entire dataset. In our setup this should result in high disk traffic and/or negligible cache hit rates, i.e. that of a few %.

Writes Workload – Only Writes

The 90- and 99-percentile latencies of UPDATE queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed, i.e. every partition in the 1 TB dataset has an equal chance of being updated. C*3 quickly becomes nonoperational, C*4 is a little better; meanwhile Scylla maintains low and consistent write latencies up until 170,000-180,000 ops/s.

Metric	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	180k/s	50k/s	40k/s	1.25x	3.6x
Maximum throughput with 90% latency <10ms	180k/s	40k/s	20k/s	2x	3.5x
Maximum throughput with 99% latency <10ms	170k/s	30k/s	–	–	5.66x

Reads Workload – Only Reads

The 90- and 99-percentile latencies of SELECT queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed, i.e. every partition in the 1 TB dataset has an equal chance of being selected. Scylla serves 90% of queries in a <5 ms time until the load reaches 70’000 ops/s. Please note that almost all reads are served from disk.

Metric	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	80k/s	40k/s	30k/s	1.25x	2x
Maximum throughput with 90% latency <10ms	70k/s	40k/s	30k/s	1.25x	1.75x
Maximum throughput with 99% latency <10ms	60k/s	20k/s	–	–	3x

Mixed Workload – 50% reads and 50% writes

The 90- and 99-percentile latencies of UPDATE queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed, i.e. every partition in the 1 TB dataset has an equal chance of being selected/updated. At 80,000 ops/s Scylla maintains the latencies of 99% of queries in a single-figure regime (in milliseconds).

Metric	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	90k/s	40k/s	40k/s	1x	2.25x
Maximum throughput with 90% latency <10ms	80k/s	40k/s	20k/s	2x	2x
Maximum throughput with 99% latency <10ms	80k/s	30k/s	–	–	2.66x

The 90- and 99-percentile latencies of SELECT queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed, i.e. every partition in the 1 TB dataset has an equal chance of being selected/updated. Under such conditions Scylla can handle over 2x more traffic and offers highly predictable response times.

Metric	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	90k/s	40k/s	40k/s	1x	2.25x
Maximum throughput with 90% latency <10ms	80k/s	30k/s	20k/s	1.5x	2.66x
Maximum throughput with 99% latency <10ms	60k/s	20k/s	–	–	3x

Uniform Distribution (memory-intensive, high cache hit ratio)

In this scenario we issue queries touching random partitions from a small subset of the dataset, specifically: one that fits into RAM. To be sure that our subset resides in cache and thus no disk IO is triggered, we choose it to be safely small, at an arbitrarily picked value of 60 GB. The goal here is to evaluate both databases at the other extreme end: where they both serve as pure in-memory datastores.

Writes Workload – Only Writes

The 90- and 99-percentile latencies of UPDATE queries, as measured on three i3.4xlarge machines (48 vCPUs in total) in a range of load rates. Workload is uniformly distributed over 60 GB of data, so that every partition resides in cache and has an equal chance of being updated. Cassandras instantly become nonoperational; Scylla withstands over 5x higher load and maintains low and consistent write latencies over the entire range.

Metric	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	200k/s	40k/s	40k/s	1x	5x
Maximum throughput with 90% latency <10ms	200k/s	40k/s	20k/s	2x	5x
Maximum throughput with 99% latency <10ms	200k/s	40k/s	–	–	5x

Reads Workload – Only Reads

Metric	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	300k/s	80k/s	60k/s	1.33x	3.75x
Maximum throughput with 90% latency <10ms	260k/s	60k/s	40k/s	1.5x	4.33x
Maximum throughput with 99% latency <10ms	240k/s	40k/s	–	–	6x

Mixed Workload – 50% reads and 50% writes

Metric	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	180k/s	40k/s	40k/s	1x	4.5x
Maximum throughput with 90% latency <10ms	160k/s	40k/s	20k/s	2x	4x
Maximum throughput with 99% latency <10ms	160k/s	40k/s	–	–	4x

Metric	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11	Cassandra 4.0 vs. Cassandra 3.11	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	180k/s	40k/s	40k/s	1x	4.5x
Maximum throughput with 90% latency <10ms	160k/s	40k/s	20k/s	2x	4x
Maximum throughput with 99% latency <10ms	160k/s	20k/s	–	–	8x

Adding Nodes

The timeline of adding 3 nodes to an already existing 3-node cluster (ending up with six i3.4xlarge machines). Total time for Scylla 4.4 to double the cluster size was 94 minutes 57 seconds. For Cassandra 4.0, it took 238 minutes 21 seconds (just shy of 4 hours); Cassandra 3.11 took 270 minutes (4.5 hours). While Cassandra 4.0 noted a 12% improvement over Cassandra 3.11, Scylla completes the entire operation even before either version of Cassandra bootstraps its first new node.

One New Node

In this benchmark, we measured how long it takes to add a new node to the cluster. The reported times are the intervals between starting a Scylla/Cassandra node and having it fully finished bootstrapping (CQL port open).

Cassandra 4.0 is equipped with a new feature — Zero Copy Streaming — which allows for efficient streaming of entire SSTables. An SSTable is eligible for ZCS if all of it’s partitions need to be transferred, which can be the case when LeveledCompactionStrategy (LCS) is enabled. Willing to demonstrate this feature, we run the next benchmarks with the usual SizeTieredCompactionStrategy (STCS) compared to LCS.

The time needed to add a node to an already existing 3-node cluster (ending up with 4 i3.4xlarge machines). Cluster is initially loaded with 1 TB of data at RF=3. Cassandra 4.0 showed an improvement over Cassandra 3.11, but Scylla still wins by a huge margin.

Strategy	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11
STCS	36 minutes 56 seconds	1 hour 47 minutes 1 second	2 hours 6 minutes
LCS	44 minutes 11 seconds	1 hour 39 minutes 45 seconds	2 hours 23 minutes 10 seconds

Strategy	Cassandra 4.0 vs Cassandra 3.11	Scylla 4.4.3 vs Cassandra 4.0
STCS	-15%	-65%
LCS	-30%	-55%

Doubling Cluster Size

In this benchmark, we measured how long it takes to double the cluster node count, going from 3 nodes to 6 nodes. Three new nodes are added sequentially, i.e. waiting for the previous one to fully bootstrap before starting the next one. The reported time spans from the instant the startup of the first new node is initiated, all the way until the bootstrap of the third new node finishes.

The time needed to add 3 nodes to an already existing 3-node cluster of i3.4xlarge machines, preloaded with 1 TB of data at RF=3. Cassandra 4.0 performed moderately better than Cassandra 3.11. but Scylla still leads the pack.

Strategy	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11
STCS	1 hour 34 minutes 57 seconds	3 hours 58 minutes 21 seconds	4 hours 30 minutes 7 seconds
LCS	2 hours 2 minutes 37 seconds	3 hours 44 minutes 6 seconds	4 hours 44 minutes 46 seconds

Strategy	Cassandra 4.0 vs Cassandra 3.11	Scylla 4.4.3 vs Cassandra 4.0
STCS	-11%	-60%
LCS	-21%	-45%

Replace node

In this benchmark, we measured how long it took to replace a single node. One of the nodes is brought down and another one is started in its place. Throughout this process the cluster is being agitated by a mixed R/W background load of 25,000 ops at CL=QUORUM.

The time needed to replace a node in a 3-node cluster of i3.4xlarge machines, preloaded with 1 TB of data at RF=3. Cassandra 4.0 noted an improvements over Cassandra 3.11. but Scylla is still the clear winner, taking around an hour to do what Cassandra 4.0 took over 3 hours to accomplish.

Strategy	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11
STCS	54 minutes 19 seconds	3 hours 28 minutes 46 seconds	4 hours 35 minutes 56 seconds
LCS	1 hour 9 minutes 18 seconds	3 hours 19 minutes 17 seconds	5 hours 4 minutes 9 seconds

Strategy	Cassandra 4.0 vs Cassandra 3.11	Scylla 4.4.3 vs Cassandra 4.0
STCS	-24%	-73%
LCS	-34%	-65%

Major Compaction

In this benchmark, we measured how long it takes to perform a major compaction on a single node loaded with roughly 1TB of data. Thanks to Scylla’s sharded architecture, it can perform the major compactions on each shard concurrently, while Cassandra is single-thread bound. The result of major compaction is the same in both Scylla and Cassandra: a read is served by a single SSTable. In the later section of this blogpost we also measure the speed of a major compaction in a case where there are many small Cassandra nodes (which get higher parallelism). We observed worse major compaction performance in Cassandra 4.0.0 with the default num_tokens: 16 parameter.

Major compaction of 1 TB of data at RF=1 on i3.4xlarge machine. Scylla demonstrates the power of sharded architecture by compacting on all cores concurrently. In our case Scylla is up to 60x faster and this figure should continue to scale linearly with the number of cores.

	Scylla 4.4.3	Cassandra 4.0	Cassandra 3.11
Major Compaction (`num_tokens: 16`)	`num_tokens: 16` not recommended	21 hours, 47 minutes, 34 seconds (78,454 seconds)	24 hours, 50 minutes, 42 seconds (89,442 seconds)
Major Compaction (`num_tokens: 256`)	36 minutes, 8 seconds (2,168 seconds)	37 hours, 56 minutes, 32 seconds (136,592 seconds)	23 hours, 48 minutes, 56 seconds (85,736 seconds)

“4 vs. 40” Benchmark

Now let us compare both databases installed on different hardware, where Scylla gets four powerful 72-core servers, meanwhile Cassandra gets the same i3.4xlarge servers as before, just… forty of them. Why would anyone ever consider such a test? After all, we’re comparing some 4 machines to 40 very different machines. In terms of CPU count, RAM volume or cluster topology these both are like apples and oranges, no?

Not really.

Due to its sharded architecture and custom memory management Scylla can utilize really big hunks of hardware. And by that we mean the-biggest-one-can-get. Meanwhile, Cassandra and its JVM’s garbage collectors excel when they go heavily distributed, with many smaller nodes on the team. So, the true purpose of this test is to show that both CQL solutions can perform similarly in a pretty fair duel, yet Cassandra requires about 2.5x more hardware, for 2.5x the cost. What’s really at stake now is a 10x reduction in the administrative burden: your DBA has either 40 servers to maintain… or just 4. And, as you’ll see, the advantage can go even further than 10x.

4 vs. 40 Node Setup

We set up clusters on Amazon EC2 in a single Availability Zone within us-east-2 datacenter, but this time the Scylla cluster consists of 4 i3.metal VMs. The competing Cassandra cluster consisted of 40 i3.4xlarge VMs. Servers are initialized with clean machine images (AMIs) of Ubuntu 20.04 (Cassandra 4.0) or CentOS 7.9 (Scylla 4.4).

Apart from the cluster, fifteen loader machines were used to run cassandra-stress to insert data, and — later — to provide background load at CL=QUORUM to mess with the administrative operations.

	Scylla	Cassandra	Loaders
EC2 Instance type	i3.metal	i3.4xlarge	c5n.9xlarge
Cluster size	4	40	15
Storage (total)	8x 1.9 TB NVMe in RAID0 (60.8 TB)	2x 1.9 TB NVMe in RAID0 (152 TB)	Not important for a loader (EBS-only)
Network	25 Gbps	Up to 10 Gbps	50 Gbps
vCPUs (total)	72 (288)	16 (640)	36 (540)
RAM (total)	512 (2048) GiB	122 (4880) GiB	96 (1440) GiB

Once up and running, both databases were loaded with random data at RF=3 until the cluster’s total disk usage reached approximately 40 TB. This translated to 1 TB of data per Cassandra node and 10 TB of data per Scylla node. After loading was done, we flushed the data and waited until the compactions finished, so we can start the actual benchmarking.

A Scylla cluster can be 10x smaller in node count and run on a cluster 2.5x cheaper, yet maintain the equivalent performance of Cassandra 4.

Throughput and Latencies

Mixed Workload – 50% reads and 50% writes

The 90- and 99-percentile latencies of UPDATE queries, as measured on:

4-node Scylla cluster (4 x i3.metal, 288 vCPUs in total)
40-node Cassandra cluster (40 x i3.4xlarge, 640 vCPUs in total).

Workload is uniformly distributed, i.e. every partition in the multi-TB dataset has an equal chance of being selected/updated. Under low load Cassandra slightly outperforms Scylla.The reason is that Scylla runs more compaction automatically when it is idle and the default scheduler tick of 0.5 ms hurts the P99 latency. There is a parameter that controls it but we wanted to provide out-of-the-box results with zero custom tuning or configuration.

Metric	Scylla 4.4.3	Cassandra 4.0	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	600k/s	600k/s	1x
Maximum throughput with 99% latency <10ms	600k/s	450k/s	1.33x

The 90- and 99-percentile latencies of SELECT queries, as measured on:

4-node Scylla cluster (4 x i3.metal, 288 vCPUs in total)
40-node Cassandra cluster (40 x i3.4xlarge, 640 vCPUs in total).

Workload is uniformly distributed, i.e. every partition in the multi-TB dataset has an equal chance of being selected/updated. Under low load Cassandra slightly outperforms Scylla.

Metric	Scylla 4.4.3	Cassandra 4.0	Scylla 4.4.3 vs. Cassandra 4.0
Maximum throughput	600k/s	600k/s	1x
Maximum throughput with 99% latency <10ms	500k/s	350k/s	1.42x

Scaling the cluster up by 25%

In this benchmark, we increase the capacity of the cluster by 25%:

By adding a single Scylla node to the cluster (from 4 nodes to 5)
By adding 10 Cassandra nodes to the cluster (from 40 nodes to 50 nodes)

	Scylla 4.4.3	Cassandra 4.0	Scylla 4.4 vs. Cassandra 4.0
Add 25% capacity	1 hour, 29 minutes	16 hours, 54 minutes	11x faster

Major Compaction

In this benchmark we measure the throughput of a major compaction. To compensate for Cassandra having 10 times more nodes (each having 1/10th of the data), this benchmark measures throughput of a single Scylla node performing major compaction and the collective throughput of 10 Cassandra nodes performing major compactions concurrently.

Throughput of a major compaction at RF=1 (more is better). Scylla runs on a single i3.metal machine (72 vCPUs) and competes with a 10-node cluster of Cassandra 4 (10x i3.4xlarge machines; 160 vCPUs in total). Scylla can split this problem across CPU cores, which Cassandra cannot do, so – effectively – Scylla performs 32x better in this case.

	Scylla 4.4.3	Cassandra 4.0	Scylla 4.4 vs. Cassandra 4.0
Major Compaction	1,868 MB/s	56 MB/s	32x faster

Summary

On identical hardware, Scylla Open Source 4.4.3 withstood up to 5x greater traffic and in almost every tested scenario offered lower latencies than Cassandra 4.0.

We also demonstrated a specific use-case where choosing Scylla over Cassandra 4 would result in $170,000 annual savings in the hardware costs alone, not to mention the ease of administration or environmental impact.

Nonetheless, Cassandra 4 is undeniably far better than Cassandra 3.11. It improved query latencies in almost all tested scenarios and sped up all the processes that involve streaming. Even if you choose not to take advantage of Scylla for its superior performance, upgrading from Cassandra 3.11 to Cassandra 4.0 is a wise idea.

Yet if you are determined to take the effort of an upgrade, then why not aim higher and get even more performance? Or at least keep the same performance and pocket the difference via savings?

While the benchmarks speak for themselves, we also hope that you don’t just take our word for Scylla’s superior performance. That’s why we provided everything that’s needed to re-run them yourself.

Beyond performance benchmarks, there are even more reasons to run Scylla: the feature set is bigger. For example, our CDC implementation is easier to manage and consume, implemented as standard CQL-readable tables. Also, Scylla’s Lightweight Transactions (LWT) are more efficient than Cassandra’s. Scylla provides observability through Scylla Monitoring Stack to watch over your clusters using Grafana and Prometheus. All of that you get with Scylla Open Source. With Scylla Enterprise on top of it, you also get unique features like our Incremental Compaction Strategy (ICS) for additional storage efficiency, workload prioritization and more.

Whether you’re a CTO, systems architect, lead engineer, SRE or DBA — your time to consider Scylla is right now and your organization is unlikely to regret it.

Supplementary Information

Here you can check out detailed results of latency/throughput benchmarks, JVM settings and cassandra.yaml from Cassandra 3 and Cassandra 4, as well as cassandra-stress invocations used to run benchmarks. Scylla used default configuration.

Cassandra 3.11 configuration

JVM settings JVM version: OpenJDK 8
-Xms48G -Xmx48G -XX:+UseG1GC -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=500 -XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=16

cassandra.yaml

Only settings changed from the default configuration are mentioned here.

Cassandra 4.0 configuration

JVM settings

JVM version: OpenJDK 16

cassandra.yaml

Only settings changed from the default configuration are mentioned here.

In major compaction benchmarks, the parameter compaction_throughput_mb_per_sec was set to 0 to make sure the compaction was not throttled.

`Cassandra-stress` parameters

Only the important facts and options are mentioned below.

Scylla’s Shard-aware Java driver was used.
Background loads were executed in the loop (so duration=5m is not a problem).
REPLICATION_FACTOR is 3 (except for major compaction benchmark).
COMPACTION_STRATEGY is SizeTieredCompactionStrategy unless stated otherwise.
loadgenerator_count is the number of generator machines (3 for “3 vs 3” benchmarks, 15 for “4 vs 40”).
BACKGROUND_LOAD_OPS is 1000 in major compaction, 25000 in other benchmarks.
DURATION_MINUTES is 10 for in-memory benchmarks, 30 for other benchmarks.

Inserting data	`write cl=QUORUM` `-schema "replication(strategy=SimpleStrategy,replication_factor={REPLICATION_FACTOR})" "compaction(strategy={COMPACTION_STRATEGY})"` `-mode native cql3` `threads` and `throttle` parameters were chosen for each DB separately, to ensure 3TB were inserted quickly, yet also to provide headroom for minor compactions and avoid timeouts/large latencies.In case of “4 vs 40” benchmarks additional parameter `maxPending=1024` was used.
Background load for replace node	`mixed ratio(write=1,read=1)` `duration=5m` `cl=QUORUM` `-pop dist=UNIFORM(1..{ROW_COUNT})` `-mode native cql3` `-rate "threads=700 throttle={BACKGROUND_LOAD_OPS // loadgenerator_count}/s"`
Background load for new nodes / major compaction	`mixed ratio(write=1,read=1)` `duration=5m` `cl=QUORUM` `-pop dist=UNIFORM(1..{ROW_COUNT})` `-mode native cql3` `-rate "threads=700 fixed={BACKGROUND_LOAD_OPS // loadgenerator_count}/s"`
Cache warmup in Gaussian latency / throughput	`mixed ratio(write=0,read=1)` `duration=180m` `cl=QUORUM -pop dist=GAUSSIAN(1..{ROW_COUNT},{GAUSS_CENTER},{GAUSS_SIGMA})` `-mode native cql3 -rate "threads=500 throttle=35000/s" -node {cluster_string}')`
Latency / throughput – Gaussian	`duration={DURATION_MINUTES}m` `cl=QUORUM` `-pop dist=GAUSSIAN(1..{ROW_COUNT},{GAUSS_CENTER},{GAUSS_SIGMA})` `-mode native cql3` `"threads=500 fixed={rate // loadgenerator_count}/s"`
Latency / throughput – uniform / in-memory	`duration={DURATION_MINUTES}m` `cl=QUORUM` `-pop dist=UNIFORM(1..{ROW_COUNT})` `-mode native cql3` `-rate "threads=500 fixed={rate // loadgenerator_count}/s"` In case of “4 vs 40” benchmarks additional parameter `maxPending=1024` was used.

The post Apache Cassandra 4.0 vs. Scylla 4.4: Comparing Performance appeared first on ScyllaDB.

↧

P99 CONF Agenda Now Online

August 26, 2021, 9:25 am

≫ Next: Project Circe August Update

≪ Previous: Apache Cassandra 4.0 vs. Scylla 4.4: Comparing Performance

P99 CONF is the conference for low-latency high-performance distributed systems. An event by engineers for engineers, P99 CONF brings together speakers from across the tech landscape spanning all aspects, from architecture and design, to the latest techniques in operating systems and development languages, to databases and streaming architectures, to real-time operations and observability.

P99 CONF is a free online virtual event scheduled for Wednesday and Thursday, October 6th and 7th, from 8:25 AM to 1:00 PM Pacific Daylight Time (PDT).

We recently published the full agenda, and wanted to share the highlights with you. You will find sessions with speakers from companies like Twitter, Netflix, WarnerMedia, Mozilla, Confluent, Red Hat, VMWare, Splunk, Datadog, Dynatrace, ScyllaDB, Couchbase, RedisLabs and more.

Keynote Speakers

Brian Martin, Twitter Site Reliability Engineer, will talk about reimplementing Pelikan, the unified cache project, in Whoops! I Wrote it in Rust!
Avi Kivity, ScyllaDB CTO, will speak on Keeping Latency Low and Throughput High with Application-level Priority Management
Simon Ritter, Azul Deputy CTO, presents Thursday on how to Get Lower Latency and Higher Throughput for Java Applications
Steve Rostedt, VMware Open Source Engineer, will dig deep into fstrace in his talk on New Ways to Find Latency in Linux Using Tracing

General Sessions

Our general session tracks will run in two virtual stages, with each session running a brisk 20 minutes.

Wednesday, October 6th

Peter Zaitsev, Percona CEO, will describe USE, RED and Golden Signals methods in his talk Performance Analysis and Troubleshooting Methodologies for Databases.
Abel Gordon, Lightbit Labs Chief System Architect, will show how his team is Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storage System.
Karthik Ramaswamy, Splunk Senior Director of Engineering, will show how data, including logs and metrics, can be processed at scale and speed in his talk Scaling Apache Pulsar to 10 Petabytes/Day.
Denis Rystsov, Vectorized Staff Engineer, will discuss how you can tune for performance without discarding distributed transaction guarantees in his session Is it Faster to Go With Redpanda Transactions than Without Them?
Heinrich Hartmann, Zalando Site Reliability Engineer, will go into the deeper maths and technical considerations in his level-setting session, How to Measure Latency.
Thomas Dullien, Optimyze.cloud CEO, will uncover all the hidden places where you can recover your wasted CPU resources in his session Where Did All These Cycles Go?
Daniel Bristot de Oliveira, Red Hat Principal Software Engineer, will look into operating system noise in his talk OSNoise Tracer: Who is Stealing My CPU Time?
Marc Richards, Talawah Solutions, will show how he was able to optimize utilization in his talk Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance.
Sam Just, Red Hat Principal Software Engineer, will describe the underlying architecture of their next-generation distributed filesystem to take advantage of emerging storage technologies in his talk Seastore: Next Generation Backing Store for Ceph.
Orit Wasserman, OpenShift Data Foundation Architect, will talk about implementing Seastar, a highly asynchronous engine as a new foundation for the Ceph distributed storage system in her talk Crimson: Ceph for the Age of NVMe and Persistent Memory
Pavel Emelyanov, ScyllaDB Developer, will show how, with the latest generation of high-performance options, storage is no longer the same bottleneck it once was in his session What We Need to Unlearn about Persistent Storage
Glauber Costa, Datadog Staff Software Engineer, will address the performance concern on the top of developer minds in his session Rust Is Safe. But Is It Fast?
Jim Blandy, Mozilla Staff Software Engineer, will continue the dialogue on using Rust for I/O-bound tasks in his talk Wait, What is Rust Async Actually Good For?
Bryan Cantrill, Oxide Computer Company CTO, will look ahead to the coming decade of development in his session on Rust, Wright’s Law, and the Future of Low-Latency Systems
Felix Geisendörfer, Datadog Staff Engineer, will uncover the unique aspects of the Go runtime and interoperability with tools like Linux perf and bpftrace in his session Continuous Go Profiling & Observability

Thursday, October 7th

Tanel Poder will introduce a new eBPF script for Continuous Monitoring of IO Latency Reasons and Outliers
Bryan McCoid, Couchbase Senior Software Engineer, will speak on the latest tools in the Linux kernel in his talk High-Performance Networking Using eBPF, XDP, and io_uring
Yarden Shafir, CrowdStrike Software Engineer, will present on I/O Rings and You — Optimizing I/O on Windows
Shlomi Livne, ScyllaDB VP of Research & Development, will engage audiences on How to Meet your P99 Goal While Overcommitting Another Workload
Miles Ward, SADA CTO, will share his insights on building and maintaining Multi-cloud State for Kubernetes (K8s): Anthos and ScyllaDB
Andreas Grabner, Dynatrace DevOps Activist, will also be talking about Kubernetes in his session on Using SLOs for Continuous Performance Optimizations of Your k8s Workloads
Tejas Chopra, Netflix Senior Software Engineer, will describe how Netflix reaches petabyte scale in his talk on Object Compaction in Cloud for High Yield
Felipe Oliveira, RedisLabs Performance Engineer, will speak on Data Structures for High Resolution and Real Time Telemetry at Scale
Gunnar Morling, Red Hat Principal Software Engineer, will show off the capabilities of the Java JDK Flight Recorder in his session Continuous Performance Regression Testing with JfrUnit
Pere Urbón-Bayes, Confluent Senior Solution Architect, will show audiences how to measure, evaluate and optimize performance in his talk on Understanding Apache Kafka P99 Latency at Scale
Waldek Kozaczuk, WarnerMedia Principal Architect, will talk about building components for the video supply chain of CNN in his talk OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud
Felipe Huici, NEC Laboratories Europe, will showcase the utility and design of unikernels in his talk Unikraft: Fast, Specialized Unikernels the Easy Way
Roman Shaposhnik, VP Product and Strategy, and Kathy Giori, Ecosystem Engagement Lead, both of Zededa will speak on RISC-V on Edge: Porting EVE and Alpine Linux to RISC-V
Konstantine Osipov, ScyllaDB Director of Engineering, will address the tradeoffs between hash or range-based sharding in his talk on Avoiding Data Hotspots at Scale

Register Now

While P99 CONF is totally free, open and online, make sure you register now to keep the dates booked on your calendar. We’re just six weeks to the event and are looking forward to seeing you all there!

The post P99 CONF Agenda Now Online appeared first on ScyllaDB.

↧

Project Circe August Update

September 1, 2021, 11:40 am

≫ Next: Deploying Scylla Operator with GitOps

≪ Previous: P99 CONF Agenda Now Online

Project Circe is ScyllaDB’s year-long initiative to improve Scylla consistency and performance. Today we’re sharing our updates for the month of August 2021.

Toward Scylla “Safe Mode”

Scylla is a very powerful tool, with many features and options we’ve added over the years, some of which we modeled after Apache Cassandra and DynamoDB, and others that are unique to Scylla. In many cases, these options, or a combination of them, are not recommended to run in production. We don’t want to disable or remove them, as they are already in use, but we also want to move users away from them. That’s why we’re introducing Scylla Safe Mode.

Safe Mode is a collection of reservations that make it harder for the user to use non-recommended options in production.

Some examples of Safe Mode that we added to Scylla in the last month:

It’s now possible to prevent users from using SimpleReplicationStrategy. The goal is to first default to warning and then default to actual prevention. SimpleReplicationStrategy can make it hard to later grow the cluster by adding data centers.
Warn or prevent usage of DateTieredCompactionStrategy. It has long since been deprecated in favor of TimeWindowCompactionStrategy.
Disable Thrift by default
Ensure that all nodes use the same snitch mode

More Safe Mode restrictions are planned.

Performance Tests

We are constantly improving Scylla performance, but other projects are, of course, doing the same. So it’s interesting to run updated benchmarks comparing performance and other attributes. In a recent 2-part blog series we compared the performance of Scylla with the latest release of its predecessor, Apache Cassandra.

Raft

We continue our initiative to combine strong consistency and high performance with Raft. Some of the latest updates:

Raft now has its own experimental flag in the configuration file: “experimental: raft”
The latest Raft pull request (PR) adds a Group 0 sub-service, which includes all the members of the clusters, and allows other services to update topology changes in a consistent, linearizable way.
This service brings us one step closer to strong consistent topology changes in Scylla.
Followup services will have consistent schema changes, and later data changes (transactions).

20% Projects

You might think that working on a distributed database in cutting edge C++ is already a dream job for most developers, but the Scylla dev team allocates 20% of their time to personal projects.

One such cool project is when Piotr Sarna made a PR for adding WebAssembly to user-defined functions (UDF). While still in very early stages, this has already initiated an interesting discussion in the comment thread.

User Defined Functions are an experimental feature in Scylla since the 3.3 release. We originally supported Lua functions, and have now extended to WASM. More languages can be added in the future.

Below is an example taken from the PR, of a CQL command to create a simple WASM fibonacci function:


CREATE FUNCTION ks.fibonacci (str text)
    CALLED ON NULL INPUT
    RETURNS boolean
    LANGUAGE wasm
    AS ' (module
        (func $fibonacci (param $n i32) (result i32)
            (if
                (i32.lt_s (local.get $n) (i32.const 2))
                (return (local.get $n))
            )
            (i32.add
                (call $fibonacci (i32.sub (local.get $n) (i32.const 1)))
                (call $fibonacci (i32.sub (local.get $n) (i32.const 2)))
            )
        )
        (export "fibonacci" (func $fibonacci))
    ) '

More on the great potential of UDF in Scylla in a talk by Avi

Some Cool Additions to Git Master

These updates will be merged into upcoming Scylla releases, primarily Scylla Open Source 4.6

Repair-based node operations are now enabled by default for the replacenode operation. Repair-based node operations use repair instead of streaming to transfer data, making it resumable and more robust (but slower). A new parameter defines which node operations use repair. (Learn more)
User-Defined Aggregates (UDA) have been implemented. Note UDA is based on User Defined Function (UDF) which is still an experimental feature
If Scylla stalls while reclaiming memory, it will now log memory-related diagnostics so it is easier to understand the root cause.
After adding a node, a cleanup process is run to remove data that was copied to the new node. This is a compaction process that compacts only one SSTable at a time. This fact was used to optimize cleanup. In addition, the check for whether a partition should be removed during cleanup was also improved.
When Scylla starts up, it checks if all SSTables conform to the compaction strategy rules, and if not, it reshapes the data to make it conformant. This helps keep reads fast. It is now possible to abort the reshape process in order to get Scylla to start more quickly.
Scylla uses reader objects to read sequential data. It caches those readers so they can be reused across multiple pages of the result set, eliminating the overhead of starting a new sequential read each time. However, this optimization was missed for internal paging used to implement aggregations (e.g. SUM(column)). Scylla now uses the optimization for aggregates too.
The row cache behavior was quadratic in certain cases where many range tombstones were present. This has been fixed.
The installer now offers to set up RAID 5 on the data disks in addition to RAID 0; this is useful when the disks can have read errors, such as on GCP local disks.
The install script now supports supervisord in addition to systemd. This was brought in from the container image, where systemd is not available, and is useful in some situations where root access is not available.
A limitation of 10,000 connections per shard has been lifted to 50,000 connections per shard, and made tunable.
The docker image base has been switched from CentOS 7 to Ubuntu 20.04 (like the machine images). CentOS 7 is getting old.
The SSTableloader tool now supports Zstd compression.
There is a new type of compaction: validation. A validation compaction will read all SSTables and perform some checks, but write nothing. This is useful to make sure all SSTables can be read and pass sanity checks.
SSTable index files are now cached, both at the page level and at an object level (index entry). This improves large partition workloads as well as intermediate size workloads where the entire SSTable index can be cached.
It was found that the very common single-partition query was treated as an IN query with a 1-element tuple. This caused extra work to be done (to post-process the IN query). We now specialize for this common case and avoid the extra post-processing work.

Monitoring News

Scylla Monitoring Stack continues to move forward fast.

We continue to invest in Scylla Advisor, which takes information from Scylla and OS level metrics (via Prometheus) and Logs (via Loki), combining them using policy rules to advise the user on what he should look at, in a production system.

For example Scylla Monitoring Stack 3.8 now warns about prepared-statements cache eviction.

Other August Releases

Scylla Manager 2.5 is out with improvements for backups and other features (checkout the release notes)
Scylla Operator 1.4 is out with performance, stability and new features (checkout the release notes)
Scylla Monitoring Stack 3.8.2 added a dashboard for Scylla Manager 2.5, as well as bug fixes.

Just Plain Cool

A new Consistency Level Calculator helps you understand the impact of choosing different replication factors and consistency levels with Scylla.

The post Project Circe August Update appeared first on ScyllaDB.

↧

Deploying Scylla Operator with GitOps

September 2, 2021, 9:49 am

≫ Next: Linear Algebra in Scylla

≪ Previous: Project Circe August Update

There are many approaches to deploying applications on Kubernetes clusters. As there isn’t a clear winner, scylla-operator tries to support a wide range of deployment methods. With Scylla Operator 1.1 we brought support for helm charts (You can read more about how to use and customize them here) and since Scylla Operator 1.2 we are also publishing the manifests for deploying either manually or with GitOps automation.

What is GitOps?

GitOps is a set of practices to declaratively manage infrastructure and application configurations using Git as the single source of truth. (You can read more about GitOps in this Red Hat article.) A Git repository contains the entire state of the system making any changes visible and auditable.

In our case, the Kubernetes manifests represent the state of the system and are kept in a Git repository. Admins either apply the manifests with kubectl or use tooling that automatically reconciles the manifests from Git, like ArgoCD.

Deploying Scylla Operator from Kubernetes Manifests

In a GitOps flow, you’d copy over the manifests into an appropriate location in your own Git repository, but for the purposes of this demonstration, we’ll assume you have checked out the https://github.com/scylladb/scylla-operator repository at v1.4.0 tag and you are located at its root.

Prerequisites

Local Storage

ScyllaClusters use Kubernetes local persistent volumes exposing local disk as PersistentVolumes (PVs). There are many ways to set it up. All the operator cares about is a storage class and a matching PV being available. A common tool for mapping local disks into PVs is the Local Persistence Volume Static Provisioner. For more information have a look at our documentation. For testing purposes, you can use minikube that has an embedded dynamic storage provisioner.

(We are currently working on a managed setup by the operator, handling all this hassle for you.)

Cert-manager

Currently, the internal webhooks require the cert-manager to be installed. You can deploy it from the official manifests or use the snapshot in our repository:

$ kubectl apply -f ./examples/common/cert-manager.yaml

# Wait for the cert-manager to be ready.
$ kubectl wait --for condition=established crd/certificates.cert-manager.io crd/issuers.cert-manager.io
$ kubectl -n cert-manager rollout status deployment.apps/cert-manager-webhook

Deploying the ScyllaOperator

The deployment manifests are located in ./deploy/operator and /deploy/manager folders. For the first deployment, you’ll need to deploy the manifests in sequence as they have interdependencies, like scylla-manager needing a ScyllaCluster or establishing CRDs (propagating to all apiservers). The following instructions will get you up and running:

$ kubectl apply -f ./deploy/operator

# Wait for the operator deployment to be ready.
$ kubectl wait --for condition=established crd/scyllaclusters.scylla.scylladb.com
$ kubectl -n scylla-operator rollout status deployment.apps/scylla-operator

$ kubectl apply -f ./deploy/manager/prod

# Wait for the manager deployment to be ready.

$ kubectl -n scylla-manager rollout status deployment.apps/scylla-manager
$ kubectl -n scylla-manager rollout status deployment.apps/scylla-manager-controller
$ kubectl -n scylla-manager rollout status statefulset.apps/scylla-manager-cluster-manager-dc-manager-rack

Customization

Customization is the beauty of using Git; you can change essentially anything in a manifest and keep your changes in a patch commit. Well, changing everything is probably not the best idea to keep having a supported deployment, but changing things like the loglevel or adjusting ScyllaCluster resources for the scylla-manager makes sense.

Summary

We are constantly trying to make deploying Scylla Operator easier and manifests allow you to do that without any extra tooling with just kubectl and git, or hooking it into your GitOps automation.

For those using Operator Lifecycle Manager (OLM), we are also planning to ship an OLM bundle and publish it on the operatorhub.io so stay tuned.

LEARN MORE ABOUT SCYLLA OPERATOR

DOWNLOAD SCYLLA OPERATOR

The post Deploying Scylla Operator with GitOps appeared first on ScyllaDB.

↧

Linear Algebra in Scylla

September 7, 2021, 10:55 am

≫ Next: OlaCabs’ Journey with Scylla

≪ Previous: Deploying Scylla Operator with GitOps

So, we all know that Scylla is a pretty versatile and fast database with lots of potential in the real-time applications, especially ones involving time series. We’ve seen Scylla doing the heavy lifting in the telecom industry, advertising, nuclear research, IoT or banking; this time, however, let’s try something out of the ordinary. How about using Scylla for… numerical calculations? Even better – for distributed numerical calculations? Computations on large matrices require enormous amounts of memory and processing power, and Scylla allows us to mitigate both these issues – matrices can be stored in a distributed cluster, and the computations can be run out-of-place by several DB clients at once. Basically, by treating the Scylla cluster as RAM, we can operate on matrices that wouldn’t fit into a hard disk!

But Why?

Linear algebra is a fundamental area of mathematics that many fields of science and engineering are based on. Its numerous applications include computer graphics, machine learning, ranking in search engines, physical simulations, and a lot more. Fast solutions for linear algebra operations are in demand and, since combining this with NoSQL is an uncharted territory, we had decided to start a research project and implement a library that could perform such computations, using Scylla.

However, we didn’t want to design the interface from scratch but, instead, base it on an existing one – no need to reinvent the wheel. Our choice was BLAS, due to its wide adoption. Ideally, our library would be a drop-in replacement for existing codes that use OpenBLAS, cuBLAS, GotoBLAS, etc.

But what is “BLAS”? This acronym stands for “Basic Linear Algebra Subprograms” and means a specification of low-level routines performing the most common linear algebraic operations — things like dot product, matrix-vector and matrix-matrix multiplication, linear combinations and so on. Almost every numerical code links against some implementation of BLAS at some point.

The crown jewel of BLAS is the general matrix-matrix multiplication (gemm) function, which we chose to start with. Also, because the biggest real-world problems seem to be sparse, we decided to focus on sparse algebra (that is: our matrices will mostly consist of zeroes). And — of course! — we want to parallelize the computations as much as possible. These constraints have a huge impact on the choice of representation of matrices.

Representing Matrices in Scylla

In order to find the best approach, we tried to emulate several popular representations of matrices typically used in RAM-based calculations/computations. Note that in our initial implementations, all matrix data was to be stored in a single database table.

Most formats were difficult to reasonably adapt for a relational database due to their reliance on indexed arrays or iteration. The most promising and the most natural of all was the dictionary of keys.

Dictionary of Keys

Dictionary of keys (or DOK in short) is a straightforward and memory-efficient representation of a sparse matrix where all its non-zero values are mapped to their pairs of coordinates.

CREATE TABLE zpp.dok_matrix (
matrix_id int,
pos_y bigint,
pos_x bigint,
val   double,
PRIMARY KEY (matrix_id, pos_y, pos_x)
);

NB. A primary key of ((matrix_id, pos_y), pos_x) could be a slightly better choice from the clustering point of view but at the time we were only starting to grasp the concept.

The Final Choice

Ultimately, we decided to represent the matrices as:

CREATE TABLE zpp.matrix_{id} (
    block_row bigint,
    block_column bigint,
    row bigint,
    column  bigint,
    value   {type},
    PRIMARY KEY ((block_row, block_column), row, column)
);

The first visible change is our choice to represent each matrix as a separate table. The decision was mostly ideological as we simply preferred to think of each matrix as a fully separate entity. This choice can be reverted in the future.

In this representation, matrices are split into blocks. Row and column-based operations are possible with this schema but time-consuming. Zeroes are again not tracked in the database (i.e. no values less than a certain epsilon should be inserted into the tables).

Alright, but what are blocks?

Matrix Blocks and Block Matrices

Instead of fetching a single value at a time, we’ve decided to fetch blocks of values, a block being a square chunk of values that has nice properties when it comes to calculations.

Definition:

Let us assume there is a global integer n, which we will refer to as the “block size”. A block of coordinates (x, y) in matrix A = [ a_{ij} ] is defined as a set of those a_{ij} which satisfy both:

(x – 1) * n < i <= x * n

(y – 1) * n < j <= y * n

Such a block can be visualized as a square-shaped piece of a matrix. Blocks are non-overlapping and form a regular grid on the matrix. Mind that the rightmost and bottom blocks of a matrix whose size is indivisible by n may be shaped irregularly. To keep things simple, for the rest of the article we will assume that all our matrices are divisible into nxn-sized blocks.

NB. It can easily be shown that this assumption is valid, i.e. a matrix can be artificially padded with (untracked) zeroes to obtain a desired size and all or almost all results will remain the same.

A similar approach has been used to represent vectors of values.

It turns out that besides being good for partitioning a matrix, blocks can play an important role in computations, as we show in the next few paragraphs.

Achieving Concurrency

For the sake of clarity, let us recall the definitions of parallelism and concurrency, as we are going to use these terms extensively throughout the following chapters.

Concurrency	A condition that exists when at least two threads are making progress. A more generalized form of parallelism that can include time-slicing as a form of virtual parallelism.
Parallelism	A condition that arises when at least two threads are executing simultaneously.

Source:Defining Multithreading Terms (Oracle)

In the operations of matrix addition and matrix multiplication, as well as many others, each cell of the result matrix can be computed independently, as by definition it depends only on the original matrices. This means that the computation can obviously be run concurrently.

(Matrix multiplication. Source: Wikipedia)

Remember that we are dealing with large, sparse matrices? Sparse – meaning that our matrices are going to be filled mostly with zeroes. Naive iteration is out of question!

It turns out that bulk operations on blocks give the same results as direct operations on values (mathematically, we could say that matrices of values and matrices of blocks are isomorphic). This way we can benefit from our partitioning strategy, downloading, computing and uploading entire partitions at once, rather than single values, and performing computation only on relevant values. Now that’s just what we need!

Parallelism

Let’s get back to matrix multiplication. We already know how to perform the computation concurrently: we need to use some blocks of given matrices (or vectors) and compute every block of the result.

So how do we run the computations in parallel?

Essentially, we introduced three key components:

a scheduler
Scylla-based task queues
workers

To put the idea simply: we not only use Scylla to store the structures used in computations, but also “tasks”, representing computations to be performed, e.g. “compute block (i, j) of matrix C, where C = AB; A and B are matrices”. A worker retrieves one task at a time, deletes it from the database, performs a computation and uploads the result. We decided to keep the task queue in Scylla solely for simplicity: otherwise we would need a separate component like Kafka or Redis.

The scheduler is a class exposing the library’s public interface. It splits the mathematical operations into tasks, sends them to the queue in Scylla and monitors the progress signaled by the workers. Ordering multiplication of two matrices A and B is as simple as calling an appropriate function from the scheduler.

As you may have guessed already, workers are the key to parallelisation – there can be arbitrarily many of them and they’re independent from each other. The more workers you can get to run and connect to Scylla, the better. Note that they do not even have to operate all on the same machine!

A Step Further

With the parallelism in hand we had little to no problem implementing most of BLAS operations, as almost every BLAS routine can be split into a set of smaller, largely independent subproblems. Of all our implementations, the equation system solver (for triangular matrices) was the most difficult one.

Normally, such systems can easily be solved iteratively: each value bi of the result can be computed based on the values a_i1, …, a_ii, x₁, …, x_i-1 and b_i. as

This method doesn’t allow for concurrency in the way we described before.

Conveniently, mathematicians have designed different methods of solving equation systems, which can be run in parallel. One of them is the so-called Jacobi method, which is essentially a series of matrix-vector multiplications repeated until an acceptable result is obtained.

One downside of this approach is that in some cases this method may never yield an acceptable result; fortunately for us, this shouldn’t happen too often for inputs representing typical real-life computational problems. The other is that the results we obtain are only (close) approximations of the exact solutions. We think it’s a fair price for the benefit of scalability.

Benchmarks

To measure the efficiency of our library, we performed benchmarks on the AWS cloud.

Our test platform consisted of:

Instances for workers – 2 instances of c5.9xlarge:
- 2 × (36 CPUs, 72GiB of RAM)
- Running 70 workers in total.
Instances for Scylla – 3 instances of i3.4xlarge:
- 3 × (16CPUs, 122GiB of RAM, 2×1,900GB NVMe SSD storage)
- Network performance of each instance up to 10 Gigabit.

We have tested matrix-matrix multiplication with a variable number of workers.

The efficiency scaled well, at least up to a certain point, where database access became the bottleneck. Keep in mind that in this test we did not change the number of Scylla’s threads.

Matrix-matrix multiplication (gemm), time vs input size. Basic gemm implementations perform O(n3) operations, and our implementation does a good job of keeping the same time complexity.

Our main accomplishment was a multiplication of a big matrix (10⁶×106, with 1% of the values not being 0) and a dense vector of length 10⁶.

Such a matrix, stored naively as an array of values, would take up about 3.6TiB of space whereas in a sparse representation, as a list of <block_x, block_y, x, y, value>, it would take up about 335GiB.

The block size used in this example was 500*500. The test concluded successfully – the calculations took about 39 minutes.

Something Extra: Wrapper for cpp-driver

To develop our project we needed a driver to talk to Scylla. The official scylla-cpp-driver (https://github.com/scylladb/cpp-driver), despite its name, only exposes C-style interface, which is really unhandy to use in modern C++ projects. In order to avoid mixing C-style and C++-style code we developed a wrapper for this driver, that exposes a much more user-friendly interface, thanks to usage of RAII idiom, templates, parameter packs, and other mechanisms.

Wrapper’s code is available at https://github.com/scylla-zpp-blas/scylla_modern_cpp_driver.

Here is a short comparison of code required to set up a Scylla session. On the left cpp-driver, on the right our wrapper.

Summary

This research project was jointly held between ScyllaDB and the University of Warsaw, Faculty of Mathematics, Informatics and Mechanics, and constituted the BSc thesis for the authors. It was a bumpy but fun ride and Scylla did not disappoint us: we built something so scalable that, although still being merely a tidbit, could bring new quality to the world of sparse computational algebra.

The code and other results of our work are available for access at the GitHub repository:

https://github.com/scylla-zpp-blas/linear-algebra

No matter whether you do machine learning, quantum chemistry, finite element method, principal component analysis or you are just curious — feel free to try our BLAS in your own project!

CHECK OUT THE SCYLLA BLAS PROJECT ON GITHUB

The post Linear Algebra in Scylla appeared first on ScyllaDB.

↧

OlaCabs’ Journey with Scylla

September 9, 2021, 9:11 am

≫ Next: What We’ve Learned after 6 Years of IO Scheduling

≪ Previous: Linear Algebra in Scylla

OlaCabs is a mobility platform, providing ride sharing services spanning 250+ cities across India, Australia, New Zealand and the UK. OlaCabs is just one part of the overall Ola brand, which also includes OlaFoods, OlaPayment, and an electric vehicle startup, Ola Electric. Founded in 2010, OlaCabs is redefining mobility. Whether your delivery vehicle or ride sharing platform is a motorized bike, auto-rickshaw, metered taxi or cab, OlaCabs’s platform supports over 2.5 million driver-partners and hundreds of millions of consumers.

The variety of vehicles Ola supports reflects the diversity of localized transportation options.

At Scylla Summit 2021 we had the privilege of hosting Anil Yadav, Engineering Manager at Ola Cabs, who spoke about how Ola Cabs was using Scylla’s NoSQL database to power their applications.

OlaCabs began using Scylla in 2016, when it was trying to solve for the problem of spiky intraday traffic in its ride hailing services. Since then they have developed multiple applications that interface with it, including Machine Learning (ML) for analytics and financial systems.

The team at OlaCabs determined early that they did not require the ACID properties of an RDBMS, and instead needed a high-availability oriented system to meet demanding high throughput, low-latency, bursty traffic.

OlaCabs’ architecture combines Apache Kafka for data streaming from files stored in AWS S3, Apache Spark for machine learning and data pipelining, and Scylla to act as their real-time operational data store.

OlaCabs needs to coordinate data between both the demand of the passengers side of transactions and transportation as well as the supply-side of the drivers, matching up as best as possible which drivers and vehicles would be best suited for each route. It also makes these real-time decisions based on learning from prior history of traffic patterns and even the behavior of their loyal customers and experiences of their drivers.

Anil shared some of the key takeaways from his experiences running Scylla in production.

You can watch the full video on demand from Scylla Summit 2021 to hear more about the context of these tips below. And you can learn more about OlaCabs journey with Scylla by reading their case study of how they grew their business as one of the earliest adopters of Scylla.

If you’d like to learn how you can grow your own business using Scylla as a foundational element of your data architecture, feel free to contact us or join our growing community on Slack.

The post OlaCabs’ Journey with Scylla appeared first on ScyllaDB.

↧

What We’ve Learned after 6 Years of IO Scheduling

September 15, 2021, 10:51 am

≫ Next: AWS Graviton2: Arm Brings Better Price-Performance than Intel

≪ Previous: OlaCabs’ Journey with Scylla

This post is by P99 CONF speaker Pavel Emelyanov, a developer at NoSQL database company ScyllaDB. To hear more from Pavel and many more latency-minded developers, register for P99 CONF today.

Why Scheduling at All?

Scheduling requests of any kind always serves one purpose — gain control over priorities of those requests. In the priorities-less system one doesn’t need to schedule; just putting whatever arrives into the queue and waiting when it finishes is enough. When serving IO requests in Scylla we cannot afford to just throw those requests into the disk and wait for them to complete. In Scylla different types of IO flows have different priorities. For example, reading from disk to respond to a user query is likely a ”synchronous” operation in the sense that a client really waits for it to happen, even though the CPU is most likely busy with something else. In this case if there’s some IO running at the time the query request comes in Scylla must do its best to let the query request get served in a timely manner even if this means submitting it into the disk ahead of something else. Generally speaking we can say that OLTP workloads are synchronous in the aforementioned sense and are thus latency-sensitive. This is somewhat opposite to OLAP workloads, which can tolerate higher latency as long as they get sufficient throughput.

Seastar’s IO Scheduler

Scylla implements its IO scheduler as a part of the Seastar framework library. When submitted, an IO request is passed into the seastar IO scheduler, where it finds its place in one of several queues and eventually gets dispatched into the disk.

Well, not exactly into the disk. Scylla does its IO over files that reside on a filesystem. So when we say “request is sent into a disk” we really mean here that the request is sent into the Linux kernel AIO, then it gets into a filesystem, then to the Linux kernel IO-scheduler and only then — to the disk.

The scheduler’s goal is to make sure that requests are served in a timely manner according to the assigned priorities. To maintain the fairness between priorities IO scheduler maintains a set of request queues. When a request arrives the target queue is selected based on the request priority and later, when dispatching, the scheduler uses a virtual-runtime-like algorithm to balance between queues (read — priorities), but this topic is beyond the scope of this blog post.

The critical parameter of the scheduler is called “latency goal.” This is the time period after which the disk is guaranteed to have processed all the requests submitted so far, so the new request, if it arrives, can be dispatched right at once and, in turn, be completed not later than after the “latency goal” time elapses. To make this work the scheduler tries to predict how much data can be put into the disk so that it manages to complete them all within the latency goal. Note that meeting the latency goal does not mean that requests are not queued somewhere after dispatch. In fact, modern disks are so fast that the scheduler does dispatch more requests than the disk can handle without queuing. Still the total execution time (including the time spent in the internal queue) is small enough not to violate the latency goal.

The above prediction is based on the disk model that’s wired into the scheduler’s brains, and the model uses a set of measurable disk characteristics. Modeling the disk is hard and a 100% precise model is impossible, since disks, as we’ve learned, are always very surprising.

The Foundations of IO

Most likely when choosing a disk one would be looking at its 4 parameters — read/write IOPS and read/write throughput (in Gbps). Comparing these numbers to one another is a popular way of claiming one disk is better than the other and in most of the cases real disk behavior meets the user expectations based on these numbers. Applying Little’s Law here makes it clear that the “latency goal” can be achieved at a certain level of concurrency (i.e. — the number of requests put in disk altogether) and all the scheduler needs to do its job is to stop dispatching at some level of in-disk concurrency.

Actually it may happen that the latency goal is violated once even a single request is dispatched. With that the scheduler should stop dispatching before it submits this single request, which in turn means that no IO should ever happen. Fortunately this can be observed only on vintage spinning disks that may impose milliseconds-scale overhead per request. Scylla can work with these disks too, but the user’s latency expectation must be greatly relaxed.

Share Almost Nothing

Let’s get back for a while to the “feed the disk with as many requests as it can process in ‘latency goal’ time” concept and throw some numbers into the game. The latency goal is the value of a millisecond’s magnitude, the default goal is 0.5ms. An average disk doing 1GB/s is capable of processing 500kB during this time frame. Given a system of 20 shards each gets 25kB to dispatch in one tick. This value is in fact quite low. Partially because Scylla would need too many requests to work and thus it would be noticeable overhead, but the main reason is that disks often require much larger requests to work at their maximum bandwidth. For example, the NVMe disks that are used by AWS instances might need 64k requests to get to the peak bandwidth. Using 25k requests will give you ~80% of the bandwidth even if exploiting high concurrency.

This simple math shows that Seastar’s “shared nothing” approach doesn’t work well when it comes to disks, so shards must communicate when dispatching requests. In the old days Scylla came with the concept of IO coordinator shards; later this was changed to the IO-groups.

Why iotune?

When deciding whether or not to dispatch a request, the scheduler always asks itself — if I submit the next request, will it make the in-disk concurrency high enough so that it fails the latency goal contract or not? Answering this question, in turn, depends on the disk model that sits in the scheduler’s brain. This model can be evaluated in two ways — ashore or on the fly (or the combination of these two).

Doing it on the fly is quite challenging. Disk, surprisingly as it can be, is not deterministic and its performance characteristics change while it works. Even such a simple number as “bandwidth” doesn’t have a definite fixed value, even if we apply statistical errors to our measurement. The same disk can show different read speeds depending on if it’s in so-called burst mode or if the load is sustained, if it’s a read or write (or mixed) IO, it’s heavily affected by the disk usage history, air temperature in the server room and tons of other factors. Trying to estimate this model runtime can be extremely difficult.

Contrary to this, Scylla measures disk performance in advance with the help of a tool called iotune. This tool literally measures a bunch of parameters the disk has and saves the result in a file we call “IO properties.” Then the numbers are loaded by Seastar on start and are then fed into the IO scheduler configuration. The scheduler thus has the 4-dimensional “capacity” space at hand and is allowed to operate inside a sub-area in it. The area is defined by 4 limits on each of the axes and the scheduler must make sure it doesn’t leave this area in a mathematical sense when submitting requests. But really these 4 points are not enough. Not only the scheduler needs a more elaborated configuration of the mentioned “safe area,” but also must handle the requests’ lengths carefully.

Pure Workloads

First, let’s see how disks behave if being fed with what we call “pure” loads, i.e. with only reads or only writes. If one divides maximum disk bandwidth on its maximum IOPS rate, the obtained number would be some request size. If heavily loading the disk with requests smaller than that size, the disk will be saturated by IOPS and its bandwidth will be underutilized. If using requests larger than that threshold, the disk will be saturated by bandwidth and its IOPS capacity will be underutilized. But are all “large” requests good enough to utilize the disk’s full bandwidth? Our experiments show that some disks show notably different bandwidth values when using, for example, 64k requests vs using 512k requests (of course, the larger request size is the larger the bandwidth is). So to get the maximum bandwidth from the disk one needs to use larger requests and vice versa — if using smaller requests one would never get the peak bandwidth from the disk even if the IOPS limit would still not be hit. Fortunately, there’s an upper limit on the request size above which the throughput will no longer grow. We call this limit a “saturation length.”

This observation has two consequences. First, the saturation length can be measured by iotune and, if so, it is later advertised by the scheduler as the IO size that subsystems should use if they want to obtain the maximum throughput from the disk. The SSTables management code uses buffers of that length to read and write SSTables.

This advertised request size, however, shouldn’t be too big. It must still be smaller than the largest one with which the disk still meets the latency goal. These two requirements — to be large enough to saturate the bandwidth and small enough to meet the latency goal — may be “out of sync”, i.e. the latter one may be lower than the former. We’ve met such disks, for those the user will need to choose between latency and throughput. Otherwise he will be able to enjoy both (provided other circumstances are favored).

The second consequence is that if the scheduler sees medium-sized requests coming in it must dispatch fewer data than it would if the requests had been larger. This is because effectively disk bandwidth would be below the peak and, respectively, the latency goal requirement won’t be met. Newer Seastar models this behavior with the help of a staircase function which seems to be both — good approximation and not too many configuration parameters to maintain.

Mixed Workloads

The next dimension of complexity comes with what we call “mixed workloads.” This is when the disk has to execute both reads and writes at the same time. In this case both the total throughput and the IOPS will be different from what one would get if we calculated a linear ratio between the inputs. This difference is two-fold.

First, read flows and write flows get smaller in a different manner. Let’s take a disk that can run 1GB/s of reads or 500MB/s of writes. It’s no surprise that disks write slower than they read. Now let’s try to saturate the disk with two equal unbounded read and write flows. What output bandwidth would we see? The linear ratio makes us think that each flow would get its half, i.e. reads would show 500MB/s and writes would get 250MB/s. In reality the result will differ between disk models and the common case seems to be that writes would always feel much better than reads. For example we may see an equal rate of 200MB/s for both flows, which is 80% for write and only 40% for read. Or, in the worst (or maybe the best) case, writes can continue working at peak bandwidth while reads would have to be content with the remaining percents.

Second, this inhibition greatly depends on the request sizes used. For example, when a saturated read flow is disturbed with a one-at-a-time write request the read throughput may become 2 times lower for small-sized writes or 10 times lower for large-sized writes. This observation imposes yet another limitation on the maximum IO length that the scheduler advertises to the system. When configured the scheduler additionally limits the maximum write request length so that it will have a chance to dispatch mixed workload and still stay within the latency goal.

Unstable Workloads

If digging deeper we’ll see that there are actually two times more speed numbers for a disk. Each speed characteristic can in fact be measured in two modes — bursted or sustained. EBS disks are even explicitly documented to work this way. This surprise is often the first thing a disk benchmark measures — the documented (in ads) disk throughput is often the “bursted” one, i.e. the peak bandwidth the disk dies would show if being measured in 100% refined circumstances. But once the workload lasts longer than a few seconds or becomes “random” there starts a background activity inside the disk and the resulting speed drops. So when benchmarking the disk it’s often said that one must clearly distinguish between short and sustained workloads and mention which one was used in the test.

The iotune, by the way, measures the sustained parameters, mostly because scylla doesn’t expect to exploit burtsable mode, partially because it’s hard to pin this “burst.”

View the Full Agenda

To see the full breadth of speakers at P99 CONF, check out the published agenda, and register now to reserve your seat. P99 CONF will be held October 6 and 7, 2021, as a free online virtual conference. Register today to attend!

The post What We’ve Learned after 6 Years of IO Scheduling appeared first on ScyllaDB.

↧

AWS Graviton2: Arm Brings Better Price-Performance than Intel

September 16, 2021, 12:30 pm

≫ Next: Scylla Operator 1.5

≪ Previous: What We’ve Learned after 6 Years of IO Scheduling

Since the last time we took a look at Scylla’s performance on Arm, its expansion into the desktop and server space has continued: Apple introduced its M1 CPUs, Oracle Cloud added Ampere Altra-based instances to its offerings, and AWS expanded its selection of Graviton2-based machines. So now’s a perfect time to test Arm again — this time with SSDs.

Summary

We compared Scylla’s performance on m5d (x86) and m6gd (Arm) instances of AWS. We found that Arm instances provide 15%-25% better price-performance, both for CPU-bound and disk-bound workloads, with similar latencies.

Compared machines

For the comparison we picked m5d.8xlarge and m6gd instances, because they are directly comparable, and other than the CPU they have very similar specs:

	Intel (x86-Based) Server	Graviton2 (Arm-based) Server
Instance type	m5d.8xlarge	m6gd.8xlarge
CPUs	Intel Xeon Platinum 8175M (16 cores / 32 threads)	AWS Graviton2 (32 cores)
RAM	128 GB	128 GB
Storage (NVMe SSD)	2 x 600 GB (1,200 GB Total)	1 x 1,900 GB
Network bandwidth	10 Gbps	10 Gbps
Price/Hour (us-east-1)	$1.808	$1.4464

For m5d, both disks were used as one via a software RAID0.

Scylla’s setup scripts benchmarked the following stats for the disks:

m5d.8xlarge
read_iops: 514 k/s
read_bandwidth: 2.502 GiB/s
write_iops: 252 k/s
write_bandwidth: 1.066 GiB/s

m6d.8xlarge
read_iops: 445 k/s
read_bandwidth: 2.532 GiB/s
write_iops: 196 k/s
write_bandwidth: 1.063 GiB/s

Their throughput was almost identical, while the two disks on m5d provided moderately higher IOPS.

Note: “m” class (General Purpose) instances are not the typical host for Scylla. Usually, “I3” or “I3en” (Storage Optimized) instances, which offer a lower cost per GB of disk, would be chosen. However there are no Arm-based “i”-series instances available yet.

While this blog was being created, AWS released a new series of x86-based instances — the m6i — which boasts up to 15% improved price performance over m5. However, they don’t yet have SSD-based variants, so would not be a platform recommended for use with a low latency persistent database system like Scylla.

Benchmark Details

For both instance types, the setup consisted of a single Scylla node, 5 client nodes (c5.4xlarge) and 1 Scylla Monitoring node (c5.large), all in the same AWS availability zone.

Scylla was launched on 30 shards with 2GiB of RAM per shard. The remaining CPUs (2 cores for m6gd, 1 core with 2 threads for m5d) were dedicated to networking. Other than that, Scylla AMI’s default configuration was used.

The benchmark test we used was cassandra-stress, with a Scylla shard-aware driver. This version of cassandra-stress is distributed with Scylla. The default schema of cassandra-stress was used.

The benchmark consisted of six phases:

Populating the cluster with freshly generated data.
Writes (updates) randomly distributed over the whole dataset.
Reads randomly distributed over the whole dataset.
Writes and reads (mixed in 1:1 proportions) randomly distributed over the whole dataset.
Writes randomly distributed over a small subset of the dataset.
Reads randomly distributed over a small subset of the dataset.
Writes and reads (mixed in 1:1 proportions) randomly distributed over a small subset of the dataset.

Phase 1 was running for as long as needed to populate the dataset, phase 2 was running for 30 minutes, and other phases were running for 15 minutes. There was a break after each write phase to wait for compactions to end.

The size of the dataset was chosen to be a few times larger than RAM, while the “small subset” was chosen small enough to fit entirely in RAM. Thus phases 1-4 test the more expensive (disk-touching) code paths, while phases 5-7 stress the cheaper (in-memory) paths.

Note that phases 3 and 4 are heavily bottlenecked by disk IOPS, not by the CPU, so their throughput is not very interesting for the general ARM vs x86 discussion. Phase 2, however, is bottlenecked by the CPU. Scylla is a log-structured merge-tree database, so even “random writes” are fairly cheap IO-wise.

Results

Phase	m5d.8xlarge kop/s	m6gd.8xlarge kop/s	difference in throughput	difference in throughput/price
1. Population	563	643	14.21%	42.76%
2. Disk writes	644	668	3.73%	29.66%
3. Disk reads	149	148	-0.67%	24.16%
4. Disk mixed	205	176	-14.15%	7.32%
5. Memory writes	1059	1058	-0.09%	24.88%
6. Memory reads	1046	929	-8.92%	13.85%
7. Memory mixed	803	789	-1.74%	22.82%

We see roughly even performance across both servers, but because of the different pricing, this results in a ~20% price-performance advantage overall for m6gd. Curiously, random mixed read-writes are significantly slower on m6gd, even though both random writes and random reads are on par with m5d.

Service Times

The service times below were measured on a separate run, where cassandra-stress was fixed at 75% of the maximum throughput.

	m5d p90	m5d p99	m5d p999	m5d p9999	m6gd p90	m6gd p99	m6gd p999	m6gd p9999
Disk writes	1.38	2.26	9.2	19.5	1.24	2.04	9.5	20.1
Disk reads	1.2	2.67	4.28	19.2	1.34	2.32	4.53	19.1
Disk mixed (write latency)	2.87	3.9	6.48	20.4	2.78	3.58	7.97	32.6
Disk mixed (read latency)	0.883	1.18	4.33	13.9	0.806	1.11	4.05	13.7
Memory writes	1.43	2.61	11.7	23.7	1.23	2.8	13.2	25.7
Memory reads	1.09	2.48	11.6	24.6	0.995	2.32	11	24.4
Memory mixed (read latency)	1.01	2.22	10.4	24	1.12	2.42	10.8	24.2
Memory mixed (write latency)	0.995	2.19	10.3	23.8	1.1	2.37	10.7	24.1

No abnormalities. The values for both machines are within 10% of each other, with the one notable exception of the 99.99% quantile of “disk mixed.”

Conclusion

AWS’ Graviton2 ARM-based servers are already on par with x86 or better, especially for price-performance, and continue to improve generation-after-generation.

By the way, Scylla doesn’t have official releases for Arm instances yet, but stay tuned. Their day is inevitably drawing near. If you have any questions about running Scylla in your own environment, please contact us via our website, or reach out to us via our Slack community.

The post AWS Graviton2: Arm Brings Better Price-Performance than Intel appeared first on ScyllaDB.

↧