why Apache Spark
why Apache Spark
Apache Spark has been reported as one of the most valuable tech skills to learn by indeed.com.
Demand for Spark, Python and Big data hes exploded exponentially over the last decade.
MLOps
Model creation should be scalable, collaborative and reproducible.
The principles, tools and techniques that make models scalable, collaborative and reproducible are known as MLOps.
but why ?
Apache Kafka for Stream
Timestamps
Timestamp derive the behavior of Kafka Streams.
Timestamps are a critical component of Kafka. The Kafka message format has a dedicated timestamp field.
You can set timestamp by Producer or it could be set by brocker.
Time event
- A Producer (including Kafka Streams library) automatically sets this timestamp field if user does not.
- This is current time of Producer environment whet the event is created.
- A Broker (Apache Kafka Server) - set processing timestamp or ingestion-time.
Time concept
- Time moves forward in Kafka by these timestamps
- For windowing operations this means the timestamps govern the opening and closing of windows
- Howl long a window remains open depends on timestamps only!
- Kafka Stream has concept of Stream time
Stream Time
- Largest timestamps seen so far
- Only moves forward, never backward
- if an out-of-order event arrives, stream-time stays where it is
Event with event-time < stream-time are considered as Out-of-order
- For windowed operations, this means the event-timestamps is less than the current stream-time, but within the window time.
- Out-of-order records are accepted and processed
Late input
- The grace period, a per-window setting, defines a cut-off for out-of-order events.
- Any out-of-order events that arrive after the grace period are considered (too) late, and thus are ignored and not processed.
- The delay of an event is determined by stream-time - event-timestamp