Scaling ML Training on a Budget: Ray and Spot Nodes in Action

Yashwardhan Modi

Amit Khanna, Srikanth Sundarrajan

5 min read

Posted on April 23, 2025

Scaling ML Training on a Budget: Ray and Spot Nodes in Action

Introduction

When it comes to machine learning (ML) training, cost and performance often go head-to-head. You want to train large models quickly, but you don’t want to break the bank. That’s where running Ray jobs on Spot Instances can shine—Spot Nodes are (usually) way cheaper, sometimes up to a 90% discount, compared to regular on-demand VMs. The tradeoff? Spot eviction during training leads to recomputations and can end up in a perpetual cycle of re-computation.

The primary challenge is balancing the cost savings while ensuring your training job isn’t stalled by too many spot evictions.

Issues with Training on the spot

Ray is built around a cluster architecture consisting of a head node (which orchestrates task scheduling and cluster state) and worker nodes (which execute the distributed tasks). In the context of ML training, Ray provides abstractions such as Ray Train that handle data parallelism by distributing tasks (batches of data) across multiple worker nodes. One of its core synchronization methods, particularly in many training loops, is all-reduce, where gradients from each worker need to be combined and redistributed so that every node has the latest model parameters before proceeding to the next batch or epoch.

1. Worker Availability:

Because of Ray’s all-reduce mechanism, every active worker node is required for synchronization at the end of each training step. If a Spot node is evicted (terminated by the cloud provider on short notice), training stalls until Ray can either replace that node or reschedule the workload. This can lead to significant idle time while the cluster is waiting to return to a fully operational state.

2. Increased Overhead:

When a Spot node disappears, Ray needs to spin up a new worker on an available node. This process includes scheduling overhead, data reloading, and possibly re-initializing model weights from a checkpoint. Each interruption adds extra steps to the training loop, causing delays and sometimes repeated computation.

3. Potential Deadlocks:

If Spot evictions happen frequently, there could be extended periods when Ray can’t gather all workers simultaneously for the allreduce step. In worst-case scenarios, the job remains stuck, repeatedly attempting to restore lost workers, leading to deadlock-like conditions where progress is stalled indefinitely.

Fallback Strategy in Ray jobs

Given the unpredictability of Spot Instance availability, relying solely on them can lead to deadlocks in distributed training scenarios. Implementing a fallback mechanism ensures that if Spot Instances become unavailable or are interrupted too frequently, the workload can transition to more stable, albeit more expensive, regular instances, thereby maintaining the continuity of the training process. Ray’s fallback strategy makes sure that if Spot Instance failures occur too often, your training doesn’t stay stuck forever. We define a max_failures value in the scaling configuration, which tracks how many worker pods can fail, due to evictions or any other reason, before halting the Spot-based run. Once this threshold is reached, the training job stops and automatically restarts on on-demand nodes using the latest checkpoint, so you don’t lose progress. The value of max_failures can be set to -1 if you want to stay on Spot Instances indefinitely, no matter how many evictions occur.

ml-scaling-diagram

Checkpointing for fault tolerance

Checkpointing is a fundamental building block of any fault-tolerant ML training pipeline, especially when using unreliable infrastructure like Spot instances. It ensures that progress is not lost when failures occur. In distributed training, a single worker node failure (e.g., due to Spot eviction) can cause the job to crash. Without checkpointing, you’d need to start from scratch, wasting compute time and cost. Checkpointing allows the job to resume from the last saved state, preserving training progress even across interruptions and resource switches.

When max_failures is hit and the job transitions from Spot to On-Demand, Ray restores the job from the latest checkpoint, continuing seamlessly. This avoids repeating epochs or recomputing gradients, ensuring both efficiency and consistency in results.

In TensorFlow, checkpointing is typically handled using the tf.train.Checkpoint API. This integrates smoothly with Ray, which can track and persist these checkpoints if you report them using train.report() inside your training loop.

Keep In Mind

1. Save frequently, ideally after every N batches or epochs.

2. Store in shared/durable storage (e.g., S3, Blob, NFS) to survive node failures.

3. Ensure consistency: save all training-related state—model, optimizer, learning rate scheduler, etc.

Implementing the spot-to-regular Node fallback

In distributed machine learning (ML) training, especially when utilizing spot instances, ensuring fault tolerance is crucial due to the potential for unexpected interruptions. While frameworks like Ray and TensorFlow offer foundational tools to address these challenges, integrating their functionalities requires a customized approach to create an effective fallback mechanism.

When to fallback

1. Eviction/max_failures: We can fallback to On-demand nodes in case we have seen enough failures/evictions on spot nodes.

2. Timeout due to unavailability: If the nodes are not present at all to start the job or cannot procure nodes after eviction, a timeout should be configured to avoid an infinite wait.

Understanding the Components:

1. Ray’s max_failures Parameter:

Ray provides the max_failures setting within its FailureConfig, allowing users to define the number of permissible recovery attempts for a training run. Upon reaching this threshold, Ray will halt the job.

2. Checkpointing with TensorFlow:

TensorFlow facilitates the saving of a model’s state through its checkpointing capabilities. Training can resume from the last saved state after an interruption by periodically capturing the model's parameters.

3. Ray’s Restore Functionality:

Ray’s Tuner.restore method enables the resumption of a previously failed run from its last checkpoint. This function is essential for continuing training without starting over.

Integrating Components for a Fallback Mechanism:

By combining Ray’s failure handling and restoration features with TensorFlow’s checkpointing, a robust fallback mechanism can be developed:

Failure Monitoring: Set the max_failures parameter to define the allowable number of failures. When this limit is reached, Ray will terminate the job.
Checkpoint Utilization: Implement TensorFlow’s checkpointing to save the model’s state at regular intervals, ensuring that training progress is preserved.
Job Restoration with Modified Configuration: Upon job termination due to reaching the failure threshold, use Ray’s Tuner.restore function to restart the job from the last checkpoint. Modify the job’s configuration to switch from spot instances to on-demand instances, thereby enhancing stability.

This integrated approach allows for the seamless transition between different instance types, ensuring that training processes are resilient to spot instance interruptions and can continue with minimal disruption.

Impact of spot instances on Total Runtime

Setting a higher max_failures value allows the training job to tolerate more interruptions, which is beneficial when utilizing spot instances prone to preemption. However, each interruption necessitates a recovery process, including restarting tasks and reloading data, which extends the overall runtime. While this approach maximizes the use of cost-effective spot instances, it may lead to longer training durations due to the cumulative effect of handling multiple failures.

Assumptions:

To establish a consistent baseline for analysis, we consider the following assumptions:

Symbol	Meaning
T	Total job time (in hours) if fully run on On-Demand nodes
α	Fraction of the job that runs on Spot nodes before fallback
M	max_failures: Number of Spot interruptions tolerated before fallback
Δt	Overhead (in hours) added per Spot failure (restart, reschedule, etc.)
Cₒ	On-Demand cost per hour
Cₛ	Spot cost per hour = 0.1 × Cₒ (i.e., 90% discount)
S	Final cost savings percentage

Runtime Breakdown

We decompose the total runtime into:

Here are the equations with the requested formatting changes:

Spot time:
T_S = α * T
On-Demand time (after fallback):
T_O = (1 - α) * T + M * Δt
Total runtime:
T_total = T + M * Δt

This shows how higher max_failures allow more progress on Spot but increase the total time due to accumulated restart overhead.

Cost Formulation

The total cost of the training job when using Spot and fallback is:

C_fallback = T_S * C_S + T_O * C_O

The baseline cost if the entire job ran on On-Demand:

C_old = T*C_O

Final Cost Formula

Assuming C_S = (0.1) * C_O, we derive the cost savings as:

C_new = α * (0.1*C_O) * T_total + (1 - α) * (C_O) * T_total

C_new = (T + M * Δt) * C_O * (1 - 0.9*α)

This clean expression cost as a function of three critical design variables:

α: Inversly proportional to how much of the job is completed on Spot
M: Directly proportional to how many Spot interruptions you're willing to tolerate
Δt: Directly proportional to how costly each failure is in terms of time lost

Cost Saving based on fallback percent

C_new = T_total * C_O * (1 - 0.9*α)

C_old = T * C_O

Hence, cost saving will happen if below equation satisfies

T_total * (1 - 0.9*α) < T

Runtime flexibility

Job on Spot (%)	Runtime Flexibility (without cost increase)	Insight
30%	Up to 36% longer runtime	Cost-efficient even with moderate Spot usage
40%	Up to 56% longer runtime	High tolerance to delay, solid savings window
50%	Up to 81% longer runtime	Break-even shifts significantly in your favor
60%	Up to 217% longer runtime	Massive buffer—ideal for cost-optimized training jobs

Example and savings achieved

Let’s assume:

T = 5 hours
α = 0.6 (i.e., 60% of time job ran on Spot)
M = 5
Δ t = 0.167 hours (10 minutes per failure)
Approximate, M * Δ t = 1 hour
T_total = 6 hours
C_new = T_total * C_O * (1 - 0.9*α)

C_new = 2.76C_O

Savings = 5C_O - 2.76C_O = 2.24C_O = 44.8% savings

So, even with 5 interruptions adding ~1 hour of overhead(20%), you still save around 45% of the cost compared to running the entire job on On-Demand instances.

Interpretation and Practical Use

Below are a few things to keep in mind while using the approach

Parameter	Increase Effect
α ↑	More Spot usage → higher savings
M ↑	More tolerated failures → more savings, but higher runtime overhead
Δ t ↑	Cost of failure grows → savings drop
T ↑	Larger jobs tolerate failure better (savings drop less)

This formula can be used as a design tool:

Choose max_failures by estimating your Spot eviction rate and acceptable runtime increase.
Estimate potential savings before deploying.

Multi-level fallback

The above technique is not limited to spot to on-demand fallback, but can also extend to a general fallback scenario. You can have a list of machine types you want to use in any order, ensuring a sequential fallback happens based on thresholds, timeout, and other criteria. Selection of these machines can be multiple spot, on-demand, cross-region, etc., to ensure the reliability of the job.

Conclusions

Higher max_failures lead to more Spot usage, which reduces cost, but also introduces latency through overhead (M⋅Δ t).
There’s a nonlinear trade-off: savings increase linearly with Spot usage, but runtime overhead accumulates proportionally to failures.
The optimal setting for max_failures depends on your tolerance for runtime increase and SLA requirements.

In Production Use

In our use case, we found that setting max_failures between 5–10 gives a good trade-off, allowing meaningful cost reductions without severely affecting runtime.
For extremely cost-sensitive training, consider pushing this higher (e.g., max_failures = 20) but only if checkpointing is robust and job runtime is not critical.

Key Takeaways

Fallback isn’t just a safety net—it’s a strategic design tool.
With controlled thresholds (max_failures) and strategic checkpointing, fallback allows aggressive use of Spot resources without jeopardizing job completion.
There’s no one-size-fits-all for max_failures.
Optimal settings depend on Spot market stability, model training duration, and checkpoint frequency. Fine-tuning this parameter ensures maximum savings without sacrificing reliability.
Even moderate Spot usage (40–60%) can deliver significant savings (30–50%)
with minimal added runtime overhead, provided Spot interruptions are managed effectively.
Combining Ray’s distributed training with Spot-to-On-Demand fallback mechanisms significantly improves the resilience and cost efficiency of training jobs, allowing seamless continuity even amid frequent Spot interruptions.
The formula Cost = (T + M × Δt) × C_O × (1 - 0.9 × α) provides a clear method for quantifying and optimizing the balance between cost, runtime overhead, and Spot usage according to your team’s specific needs.

In closing, the fallback mechanism we’ve discussed is more than a workaround—it’s a scalable pattern for building resilient, cost-optimized ML pipelines in cloud-native environments.

If you’re training large models regularly, this design could mean the difference between sustainable experimentation and budget overruns.

References

Handling Failures and Node Preemption — Ray 2.44.1
ray.tune.Tuner.restore — Ray 2.44.1

Back to Articles