Flink Checkpoints — Best Practices (By FlinkPOD)

VerticalServe Blogs
5 min readMay 28, 2024

--

Apache Flink is a powerful stream processing framework that provides robust fault tolerance capabilities through checkpointing. Checkpointing allows Flink to recover from failures by periodically saving the state of the application. In this blog post, we will explore best practices for configuring checkpointing in Flink, including checkpoint intervals, latency, timeouts, monitoring, size management, and the choice between aligned and non-aligned checkpoints. We will also discuss the importance of exactly-once semantics and the role of two-phase commits in ensuring data consistency.

Checkpoint Interval

The checkpoint interval defines how often a checkpoint is triggered. A shorter interval can lead to more frequent state snapshots, providing better fault tolerance at the cost of higher overhead.

Best Practice: Configure the checkpoint interval based on the criticality of your application and the acceptable overhead. A common starting point is 10 seconds.

env.enableCheckpointing(10000); // 10 seconds

Latency

Checkpointing can introduce latency because it requires synchronizing state across distributed tasks. Minimizing this latency is crucial for maintaining low-latency processing.

Best Practice: Use asynchronous checkpointing to reduce the impact on processing latency.

env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.ASYNCHRONOUS);

Checkpoint Timeouts

Checkpoint timeouts ensure that checkpoints do not hang indefinitely, which can lead to resource exhaustion.

Best Practice: Set a reasonable checkpoint timeout to balance between allowing enough time for completion and avoiding resource blockage. A common value is 1 minute.

env.getCheckpointConfig().setCheckpointTimeout(60000); // 1 minute

Checkpoint Monitoring

Monitoring checkpoints is essential to detect issues like frequent failures or long completion times.

Best Practice: Use Flink’s metrics and logging to monitor checkpoint performance. Set up alerts for failures and prolonged checkpoint durations.

env.getCheckpointConfig().setTolerableCheckpointFailureNumber(3);

Checkpoint Size

Managing checkpoint size is crucial for performance and resource utilization. Large states can lead to longer checkpointing times and increased storage costs.

Best Practice: Optimize state size by using state backends that support incremental checkpoints and avoid storing unnecessary state.

env.setStateBackend(new FsStateBackend("file:///checkpoints", true));

Aligned vs. Non-Aligned Checkpoints

Aligned Checkpoints: These ensure a consistent snapshot by pausing processing and aligning all state to a single point. They are simpler but can introduce higher latencies.

Non-Aligned Checkpoints: These allow processing to continue while the checkpoint is taken, reducing latency but requiring more complex state management.

Best Practice: Choose aligned checkpoints for simplicity and strong consistency requirements. Opt for non-aligned checkpoints for low-latency requirements and when your application can handle eventual consistency.

env.getCheckpointConfig().enableUnalignedCheckpoints();

Large State Handling

Managing large state in Apache Flink can be challenging, especially when dealing with checkpoints. Checkpoints are crucial for fault tolerance as they enable Flink to recover from failures by saving the state of the application periodically. However, large state sizes can lead to issues such as increased checkpointing times, higher memory usage, and potential performance degradation. Here are some best practices to efficiently handle very large state with checkpoints in Apache Flink:

1. Use Incremental Checkpoints

Incremental Checkpoints save only the changes made to the state since the last checkpoint, rather than saving the entire state. This significantly reduces the amount of data written during each checkpoint, which can lead to faster checkpointing and reduced storage requirements.

How to Enable:

env.setStateBackend(new RocksDBStateBackend("file:///checkpoints", true));

2. Choose the Right State Backend

Flink provides different state backends for managing state. The RocksDB state backend is particularly well-suited for large states as it stores state on disk, allowing it to handle much larger state sizes compared to in-memory state backends.

Recommended State Backend:

env.setStateBackend(new RocksDBStateBackend("file:///checkpoints"));

3. Adjust Checkpointing Interval

Setting an appropriate checkpoint interval is crucial. Too frequent checkpoints can cause high overhead, while too infrequent checkpoints can lead to longer recovery times.

Best Practice:

  • Start with a checkpoint interval of 10–15 minutes and adjust based on the application’s performance and requirements.javaCopy code
env.enableCheckpointing(600000); // 10 minutes

4. Optimize State Size

Minimize the state size by keeping only necessary information. Regularly clean up expired or unnecessary state data to reduce the overall state size.

Example:

  • Use state TTL (Time-to-Live) to automatically clean up old state entries.javaCopy code
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.minutes(10))
.cleanupFullSnapshot()
.build();

5. Use Asynchronous Checkpointing

Asynchronous checkpointing allows Flink to take checkpoints without blocking the processing of records. This reduces the latency impact on the stream processing.

Enable Asynchronous Checkpointing:

env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

6. Monitor and Tune Checkpointing Performance

Regularly monitor checkpointing performance and tune related parameters. Use Flink’s built-in metrics and external tools like Prometheus and Grafana for monitoring.

Key Metrics to Monitor:

  • Checkpoint duration
  • Checkpoint size
  • Checkpoint alignment time

Example:

env.getConfig().setLatencyTrackingInterval(1000); // Track latency every second

7. Parallelize State Access

Ensure that state access is parallelized to distribute the load across multiple task managers. This can help in reducing the pressure on individual nodes and improve the overall performance.

Example:

  • Use keyed state to distribute state access across multiple parallel instances.
stream.keyBy(value -> value.getKey());

8. Use Externalized Checkpoints

Externalized checkpoints allow you to retain checkpoints on a distributed file system like HDFS or S3. This can help in recovering from failures more efficiently and provide durability.

Enable Externalized Checkpoints:

env.getCheckpointConfig().enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

9. Scale Resources Appropriately

Ensure that your cluster has sufficient resources (CPU, memory, and storage) to handle the large state. Scaling the resources appropriately can help in managing large state and improving checkpoint performance.

Best Practice:

  • Monitor resource usage and scale up or down based on the application’s needs.

10. Perform Regular State Backups

Regularly back up the state to ensure that you have a recovery point in case of catastrophic failures. Use Flink’s savepoint mechanism for creating state backups.

Create Savepoints:

flink savepoint :jobId :savepointDir

Exactly-Once Semantics

Exactly-once semantics ensure that each record is processed exactly once, which is crucial for applications requiring high data accuracy.

Best Practice: Use Flink’s support for exactly-once processing by enabling it in the checkpointing configuration.

env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

Two-Phase Commit Protocol

For transactional data sources and sinks, the two-phase commit protocol ensures that changes are atomically committed, preventing data loss or duplication.

Best Practice: Implement the two-phase commit protocol in your sources and sinks to ensure atomicity.

public class TwoPhaseCommitSink extends TwoPhaseCommitSinkFunction<String, MyTransaction, Void> {
// Implement necessary methods for two-phase commit
}

Code Example: Setting Up Checkpointing

Here’s a complete example that demonstrates setting up checkpointing with best practices:

import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.checkpoint.CheckpointingMode;
import org.apache.flink.runtime.state.filesystem.FsStateBackend;

public class CheckpointingExample {

public static void main(String[] args) throws Exception {
// Set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Configure checkpointing
env.enableCheckpointing(10000); // 10 seconds
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setCheckpointTimeout(60000); // 1 minute
env.getCheckpointConfig().enableUnalignedCheckpoints();
env.setStateBackend(new FsStateBackend("file:///checkpoints", true));

// Add source, transformations, and sink here

// Execute program
env.execute("Flink Checkpointing Example");
}
}

Conclusion

Properly configuring checkpointing in Apache Flink is critical for ensuring fault tolerance and maintaining performance. By following these best practices, you can optimize your Flink applications to handle failures gracefully while minimizing overhead. Remember to monitor your checkpoints, adjust configurations based on your application’s needs, and leverage Flink’s robust features to achieve reliable stream processing.

About — Flink POD — Flink Committers & Experts

FlinkPOD — https://www.FlinkPOD.com

FlinkPOD is a specialized consulting team of VerticalServe, helping clients with Flink Architecture, Production Health Checks, Implementations etc.

VerticalServe Inc — Niche Cloud, Data & AI/ML Premier Consulting Company, Partnered with Google Cloud, Confluent, AWS, Azure…50+ Customers and many success stories..

Website: http://www.VerticalServe.com

Contact: contact@verticalserve.com

--

--

No responses yet