Scaling Big Data Processing in the Cloud: Proven Practices with Spark & Ray

Tags:
cloud-data-processing, large-scale-data, spark, ray, cloud-architecture, data-engineering, python-dataprocessing, devops-cloud, kubernetes, distributed-systems

Introduction
Sharing
- 2.1 Environment Sharing Across Machines
- 2.2 Account System Sharing
Elasticity
- 3.1 Cost & Resource Strategy
- 3.2 Code for Scale
- 3.3 Debugging at Scale
Isolation
- 4.1 Development Stage
- 4.2 Runtime Stage
Conclusion

Introduction

As cloud infrastructure continues to mature, new companies often move their systems to the cloud to quickly achieve business goals. Developing on the cloud is quite different from developing directly on physical machines. The cloud emphasizes sharing and elasticity, and as scale grows, isolation becomes increasingly important. These changes also push us to adjust our development practices.

For large-scale data processing in the cloud, most of my experience comes from Spark and Ray, primarily using Python. From these tech stacks, I’d like to share some practical methods that have worked reasonably well.

When using Ray for large-scale data processing in the cloud, a basic approach is: build the minimal parallelizable unit, run functional and performance tests, and then scale using ray.data (e.g., map, map_batches). Spark works slightly differently; compared to Ray, Spark is less flexible but offers better abstractions. It allows you to think about data operations from the dataset perspective, and automatically expands and handles faults based on partitioning and parallelism.

Author: 木鸟杂记 — https://www.qtmuniao.com/2025/06/14/data-processing-on-cloud/ — Please credit the source when sharing.

Sharing can be divided into two parts: shared environments between multiple development machines, and shared environments between development and production. Why do we need shared environments across machines? First, we often need to migrate development machines for various reasons. Second, CPU machines and GPU machines are usually separate. Third, coworkers or dev/prod environments occasionally need to share files.

If you’ve ever switched development machines, you’ve likely faced the pain of rebuilding an environment that you painstakingly set up before. Some people script this process, but even with automation, reinstalling still takes time and is far from “plug-and-play”.

Just like Java’s slogan “write once, run anywhere,” can we configure our development environment once and use it anywhere? Here’s a crude practice I’ve tried.

To enable multi-machine sharing, you can request a large POSIX-compatible multi-writer shared disk from your cloud provider. Then place all user home directories on that disk. When provisioning a new dev machine, just mount the disk and symlink your user directory into /home. This solves data sharing.

Next: can account systems also be shared? If accounts differ across machines, you would need to assign permissions for the same directory to different accounts. The real problem is conflicting UIDs/GIDs across machines. To truly achieve seamless multi-machine user directory sharing, the account system must also be unified.

On Linux, user info ultimately boils down to two files:

/etc/passwd   — basic user information  
/etc/group    — group information

If all machines share these two files, they share the same account system. But then: where are these files generated? What if someone alters them maliciously?

For the first issue, similar to distributed systems, you can designate one machine as the “master” responsible for all user creation. Other machines act as “followers”, mounting and overriding their local user files with the master’s versions. This guarantees unique, non-conflicting UIDs/GIDs.

For the second issue, group admins should control account creation. But since developers still need sudo privileges to install local tools, these files can’t be fully locked down. Ultimately, this requires trust and conventions.

With this setup, we achieve a rough version of “configure once, switch machines freely.”

Some practical concerns remain—for example, using tools like conda to install all dependencies into your home directory rather than system disks. We’ll discuss this in the next section on isolation.

Elasticity

Elasticity is one of the cloud’s biggest selling points: scale on demand, pay as you go. But reality is not so ideal.

Cost & Resource Strategy

Elastic resources are significantly more expensive than reserved instances. Most users therefore buy enough reserved capacity to cover typical workloads, and rely on elastic resources only for bursts.

If your workload is predictable, this saves money. If not, elasticity only solves “availability,” not “cost.” Heavy or constant usage requires you to precisely calculate and manage usage.

When machine counts grow, we usually pool resources using containers and orchestrate workloads with Kubernetes. Thanks to its openness, most compute frameworks now have operators—like KubeRay and Spark on Kubernetes—making deployment easy.

Resource pooling introduces a classic problem: scheduling. When scheduling tasks, we usually consider urgency and resource demand. Here are a few scenarios:

Preemption
If resources are exhausted and a high-priority urgent task arrives, it must cut in line. With preemptive scheduling, it can even take resources from lower-priority tasks. A common manual workaround is killing some low-priority pods to free capacity. Distributed systems rely heavily on retrying tasks on new machines.
Deadlock
Yes, schedulers can deadlock. Suppose 100 CPUs remain. Task A needs 80 CPUs, task B needs 60. If A gets 70 and B gets 30, neither can start. Gang scheduling (e.g., Volcano) ensures a task receives all required resources at once, eliminating this situation.
Elastic scaling
On Kubernetes, many frameworks support job-level elasticity: more resources → faster execution, fewer resources → slower execution. Spark does this well. KubeRay’s elasticity is currently… lacking. But Spark’s strong elasticity also means a single large job can easily consume the entire cluster if priorities aren’t set.

For cloud-based large task scheduling, a scheduler supporting gang scheduling, priorities, and high throughput (e.g., Spark launching tens of thousands of pods) is essential.

Code for Scale

Scaling code is not free. Although the cost decreases over time—MapReduce to Spark/Flink to Ray—Ray still introduces challenges. Ray is essentially a distributed computer based on a master-worker architecture. It pools memory into an object store and abstracts CPU/GPU resources with fractional allocation.

Ray supports extremely fine-grained parallelism and is very flexible—ideal for the large-model era. But flexibility comes at the cost of missing higher-level distributed system abstractions, requiring you to implement them:

Logical scheduling
Ray schedules using quantized labels. The gap between logical requirements and physical reality must be tuned manually. For example, workers claiming 10 GB logically but needing far more physically can easily OOM a node.
Producer–consumer mismatch
Ray data processing requires careful tuning. If upstream produces too quickly, memory blows up; if downstream is slow, resources idle.
Non-standard data entries
Rare bad records can crash long-running jobs. Without proper error handling or checkpointing, the cost is painful. Ray supports silent errors, but that only shifts the mess to downstream consumers.

Debugging at Scale

Errors are common in large-scale parallel systems, but reproduction is difficult due to huge data, complex environments, and long runtimes. To debug efficiently, you need observability:

Logging
Output logs at all key steps. Divide your pipeline into “blocks” to quickly locate failing code segments.
Metrics
Data pipelines have multiple stages; OOMs often come from buildup at a particular stage. Collecting metrics helps locate performance bottlenecks. Metrics apply to:
- application-level code
- Ray actor resource usage over time
- node-level CPU/memory/network
- storage system throughput (e.g., object store bucket/prefix IO)

Once you gather logs and metrics, correlate them with failure timestamps. Simple cases can be diagnosed immediately; more complex ones may require extracting a bad record and creating a minimal reproducible example, ideally turned into a unit test.

Performance debugging has three common causes:

Poor performance of minimal execution unit
Resource contention during scaling
Producer–consumer speed mismatch

These are typical distributed system issues. Mature frameworks solve them with backpressure but sacrifice flexibility. Ray’s design means you often need to handle these issues manually.

If slowdown or hangs occur, log into machines directly. Use Linux tools to inspect system resources, use py-spy to see what code is doing, and identify where the system is stuck.

Isolation

Isolation can be considered from two lifecycle stages: development and runtime.

Development Stage

As mentioned earlier, multiple developers may share one dev machine, creating a need for isolation. The natural solution is Linux accounts. Some teams take shortcuts and use root everywhere. This inevitably leads to conflicts as people install tools and dependencies. It’s better not to use root, and instead install everything with tools like conda inside your home directory. This makes migration easier and avoids interfering with others.

Even for a single user, environment conflicts arise easily—especially with Python. If you install everything in your home directory, multiple projects will conflict inevitably. Even multiple branches of the same project may conflict. So virtualenv or conda is necessary.

Runtime Stage

Thanks to containers, isolating environments for runtime is extremely easy today. With containerization and dependency tools (e.g., poetry, uv), we no longer fear dev–prod dependency inconsistencies.

Specifically:

Use the same pyproject.toml to manage dependencies
During development, install dependencies via poetry/uv
In production, use the same poetry/uv setup inside the Dockerfile

Of course, this only covers Python dependencies; OS-level libraries (e.g., CUDA) still require consistency. But the principle is the same:

Use the Dockerfile as the single source of truth for all dependencies.
Both development and production should rely on it.

More aggressively, you can even spin up a container from the same Dockerfile during development, achieving perfect dev–prod parity。

Another way to think about it：

During development，images must be writable
In production，images must be read-only

So dependency management will differ slightly between the two scenarios.

Conclusion

Developing in the cloud changes certain paradigms and introduces new practices. This article briefly outlines some common problems and experiences around large-scale data processing in the cloud. If it brings you any insight, then it has served its purpose. Due to limited experience and scope, some aspects may not be fully covered—discussion and additions are welcome.

Scaling Big Data Processing in the Cloud: Proven Practices with Spark & Ray

Table of Contents

Introduction

Elasticity

Cost & Resource Strategy

Code for Scale

Debugging at Scale

Isolation

Development Stage

Runtime Stage

Conclusion

Comments

More from this blog

3 basic sorting algorithm

Large-Scale Data Systems (1): DataSet

Step by step, dissecting the most challenging parts of the database transaction — Isolation

LevelDB Data Structures Serials I: Skip List

Command Palette

Table of Contents

Introduction

Sharing

Environment Sharing Across Machines

Account System Sharing

Elasticity

Cost & Resource Strategy

Code for Scale

Debugging at Scale

Isolation

Development Stage

Runtime Stage

Conclusion

Comments

More from this blog