Is It Realistic To Run ANY Big Data (Like Database, ETL) Workload on GPU Now?

6 min readJan 9, 2021

Today I will share two papers/presentations:

[1](Clemens Lutz, et al.) Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnections. (SIGMOD’ 20).

[2] (Robert Evans & Jason Lowe) Deep Dive into GPU Support in Apache Spark 3.x. (Spark + AI Summit 2020)

As GPU becomes more and more popular, it is natural to consider using GPU's powerful compute horsepower to run big data workloads. This is a question I don’t know an answer for a long time (If you’re asking ML workload, especially DNN, the consensus is to run on GPU, so there’s no debate). That is why I read the papers and wrote the blog post.

For Big Data workload, it includes OLAP use cases like SQL analytics workloads, Spark, Flink, etc. Comparing to (traditional) ML workloads, Big data workloads have much higher volumes (10–100s TBs of data or even more), which hard to fit in main memory. Data parallelization is a common way to run a workload in distributed nodes; lots of shuffles are needed, and intermediate data typically way larger than input/output data.

GPU has so many more cores (thousands of CUDA cores in Nvidia v100 v.s. dozens of threads in the latest Intel CPU). Very high bandwidth (to GPU memory) comparing to CPU access main memory. Which make it very suitable for tasks that can be processed by vector processor.

Seymour Cray (founder of supercomputing) once joked :

“If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?”

Ironically, after decades, the computer industry moves in the opposite direction: use many many (tiny) cores to flow a massive field!

What makes GPU weak in big data processing?

Low GPU memory volumes (e.g. 32 GB for V100 GPU).
Copy from CPU memory to GPU memory is slow and inefficient. (Interconnect between GPU/CPU)
Even though GPU can handle some particular tasks fast, Amdahl’s law decided total speed up decides part of the program cannot be parallelized by GPU determines how much speed up we will get.
High barrier to coding using GPU (CUDA programming requires special knowledge about how GPU works).

First, let’s look at GPU interconnect.

I aware GPU-to-CPU interconnection could cause performance issues. Let’s see if the latest solution solves the problem now.

[1] mentioned:

From what you can see, GPU uses PCI-e 3.0 connection for a decade. With the recent upgrade to NVLink 2.0 (differences of NVLink 2.0 v.s. PCI-e 3.0 see (a)) and GPU-CPU interconnect bandwidth v.s. CPU access main memory is more closer. (Comparing to Xeon access main memory, 63 GiB/s v.s. 81 GiB/s for sequential access, and 2.8 GiB/s v.s. 2.7 GiB/s for random access).

Even though accessing main memory from GPU has high latency (434 ns v.s. 68 ns), but if we can “batch” GPU to CPU memory access, it won’t slow down the compute too much.

Second, how to effectively access CPU memory from GPU

Since GPU has less memory on the board, we cannot cache too much data in GPU memory, so we still need to access CPU memory frequently.

Previously, accessing CPU memory from GPU is challenging. There’re many different approaches listed by [1]

Previously, accessing a large volume of CPU memory from GPU needs to “pin” the memory in CPU first, then copy from GPU. The latest “coherence” approach can directly allow GPU access to CPU memory. Which is proven to be the best performer [1]:

But is it easy to get the best performance result?

[1] uses a non-partitioning hash join program to do the benchmark test. To get the best result, it has to very carefully allocate from CPU which is closest to GPU (CPU has multiple sockets to Memory, which has different latencies to GPU), as mentioned in the paper:

My understanding of the implementation of the paper [1] looks like this:

If carefully allocation of memory (across NUMA zones) is required to get a meaningful performance boost (over CPU), it is less practical in real life: When running workloads on the public cloud, it is typically in a multi-tenant environment. Cloud VMs share physical hosts, applications running to process big data are sharing the same set of resources provided by VM. Resource management platforms like YARN or K8s provided support for NUMA zones. (See YARN-5764, and https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/). But making the resource management system aware of GPU-to-CPU memory affinity requires more work.

When GPU do random memory access, performance degrades even further: The same graph in [1] showed

Even with NVLink 2.0, Random access throughput is only 4.4% of Sequential access (2.8 GiB/s v.s. 63 GiB/s).

If an application cannot guarantee access to consecutive memory spaces, the performance will be awful. It’s better to let CPUs handle such computation rather than GPU.

Third, how hard for users to leverage GPU to speed up big data applications?

If users have to learn and write CUDA programming language to leverage GPU, the adoption will be very slow. The good news is, computation frameworks like Spark hid complexities internally so users can easily try if it works or not without changing their program.

[2] Mentioned RAPIDs library integration, a part of Spark computation (which is supported by SPARK-24615, SPARK-27396, etc.), can be scheduled to GPU when GPU is available. Users don’t need to understand complexities behind the scene:

Importantly, [2] mentioned if it is a silver bullet (I like this slide a lot):

So if GPU can help increase performance (or not), it depends on

a. If there’s any bottleneck hit (Disk, Network, bandwidth, etc.)

b. If data is suitably for GPU processing, which means:

Not too big, which larger than GPU memory.
Not too small, because of high latency for GPU to read from main memory and GPU memory, processing small data in GPU won’t worth it.
Suitable for GPU computation (Vectorizable).

My takeaways

GPU is good to speed up a good amount of workloads.
But it is still too early to claim GPU is suitable for general big data applications now.
With framework support (like Spark) and cloud IaaS. It is easy for users to try and see if a particular workload can run faster or not.
Over time, more and more applications can use GPU to speed up.