Notes on Snowflake’s new paper: Building an Elastic Query Engine on Disaggregated Storage

5 min readJan 3, 2021

This is snowflake’s new paper published to NSDI’ 20, here’s the link: https://www.usenix.org/conference/nsdi20/presentation/vuppalapati

Highlights from my PoV

Resource (Storage, memory, compute) disaggregation is the main theme in recent computer architecture development. By disaggregation compute, storage, and memory, we can achieve individually scale compute and storage without overprovisioning.

Recent features adopted by cloud vendors and warehouse-scale-computing such as fast network makes part of disaggregated storage possible. Snowflake leverages reliable and cheap (but not fast) persistent remote storage (S3), and built a fast local ephemeral storage to cache local and intermediate data to achieve performance goals.

I’m also interested in several topics related to multi-tenancy discussed in this paper: Snowflake heavily leverages per-warmed hosts to achieve low latency of add new / scale-up virtual warehouses. However, because cloud vendors moved from per-hour billing model to per-second billing model, the pre-warmed node pool becoming less attractive from the economic point of view (surprise, surprise!).

Leveraging serverless platforms (such as AWS Lambda, Fargate, etc.) is a possible solution, but the current serverless platforms lack isolation for performance and security blocks adoption of serverless platforms. This paper is even considered to build on themselves!

Also, this paper highlights another pressing demand for resource disaggregation: Local-SSD and Main Memory

Even though the progress of the fast network makes network-attached storage possible (such as AWS EBS). But high latency still a problem and bandwidth/latency is a problem to access a remote memory (even with expensive InfiniBand).

Without disaggregation of Local-SSDs and Main Memory from computing resource (CPUs). We will always overprovision DRAMs and local SSDs to pair up with local compute resources.

Overall, I like a lot about this paper, this is one of the best papers I read in 2020. A lot of metrics about queries, read/write patterns in the paper are interesting to read.

(Following is notes on each chapter of the paper; TL; DR)

1. Introduction

Problems of existing solution

Shared-nothing architecture usually provisions resources more than need.
When adding nodes, reshuffle data cost is high.

For many of Snowflake’s queries, it is hard to decide hardware usage, lots of queries are not from controllable data sources (such as web service logs).

2. Design

The design concept of Snow

Use local ephemeral storage uses local SSDs + Memory to cache intermediate data to meet latency requirement. And existing cloud blob storage provided unnecessary semantics than needed.
Intermediate data is multiple orders of magnitude across queries.
Elastic local ephemeral storage co-located with compute nodes.

Other notes:

Uses prewarmed EC2 instances across tenants.

4. Storage Architecture

The ephemeral storage system uses mem and local SSDs.
Spill extra intermediate data to S3 instead of other compute node: It doesn’t need to track which compute node holds such data, and avoid handling OOM, Out of Disk errors.
Future direction: For performance-critical queries, we don’t want intermediate data spills to S3 at all.
But it is very hard to match compute and local storage because it is not predictable and intermediate data varies a lot. So decouple compute and ephemeral storage is a possible future direction.

Local Ephemeral Storage system acts like a write-through cache, it leverages memory, local SSD, and S3 to handle caches. So it can cache input files and intermediate files at the same time.

Consistent hash is not sufficient because it requires re-shuffle data, SNOW implements lazy consistent hashing. (See next)

6. Resource Elasticity

Lazy consistent hashing

When a new node added, there’s no local cache, when work stealing (opportunistic execution) happens, a new file will be cached by the new node, so future work will be automatically moved to the new node. during the process, no reshuffle needed.

Future directions:

Infra-query elasticity is preferred, existing SNOW can rescale when a new query executed.
There’s a per-second billing mechanism that SNOW wants to use, so existing serverless platforms (AWS Lambda, etc.) are attractive. But there’s no good isolation (for security and performance), see this paper (https://www.usenix.org/conference/atc18/presentation/wang-liang). And for SNOW’s use case, it requires solving remote ephemeral storage access and multi-tenant resource sharing.

Multi-tenancy

There’re two problems which SNOW is facing:

SNOW previously uses the per-hour billing model of EC2, but now since cloud providers moved to per-second billing. The pre-warm model becoming costly to meet the customer’s SLA demand.

2. Individual Virtual Warehouse’s utilization is low, so co-locate multiple warehouses to the same VM instance to allow spare memory/CPU can be used by other VMs makes sense.

SNOW is thinking in an aggressive direction to put different tenants to the same set of VMs. But doing so has the following challenges:

How to design a shared ephemeral storage system using both memory and SSDs to support fine-grained elasticity w/o sacrificing isolation across tenants.
When adding a new node to the cluster, invalidate the cache (using the lazy consistent hashing) could impact the performance of all the tenants, which is not ideal.