The division due to the possibility of incremental training with the use of the stream of sent data:
Data is generated in an unlimited form - it appears as a result of continuous operation of systems. Today (and even during these classes!) You have generated a lot of data on your phone. Will you not generate them during the next class or tomorrow? Batch processing breaks the data into chunks of a fixed length of time and starts the processing at a user specified time time.
In the previous class, we learned what a single event is. We also defined a continuous stream of events.
Let us take a few such events and organize them into consecutive time points.
You have already dealt with many systems that handle data streams. These are, for example:
a company is an organization that generates and responds to a continuous stream of data.
In batch processing, the source (but also the result of processing) of data is a ** file **. It is written once and can be referenced (or run on) by multiple processes (tasks). The file name is an element that identifies a set of records.
In the case of a stream, the event is generated only once by the so-called producer (also known as a sender or supplier). The resulting event can be processed by many so-called consumers (recipients). Streaming events are grouped into the so-called topic.
In the case of batch processing, we process historical data and the start time of the processing process has nothing to do with the time of occurrence of the analyzed events.
For streaming data, we have two time concepts:
In an ideal situation:
In fact, data processing always takes place with a certain delay, which is represented by the points appearing below the function for the ideal situation (below the diagonal).
In stream processing applications, the differences between the time of the occurrence of an event and its processing prove to be important. The most common causes of delay are data transmission over the network or lack of communication between the device and the network. A simple example is driving a car through a tunnel and tracking the position via a GPS application.
Of course, you can count the number of such missed events and trigger an alarm if there are too many such rejects. The second (probably more often) used method is the use of the so-called correction. \ it {watermarking}.
The real-time event processing process can be represented as a step function, represented in the figure:
As can be seen, not all events contribute to the analysis and processing. The implementation of the processing process along with additional time for the occurrence of events (watermarking) can be presented as a process covering all events above the dashed line. The extra time allowed for additional events to be processed, but there may still be points that will not be taken into account.
The situations presented in the graphs clearly indicate why the concept of time is an important factor and requires precise definition already at the level of defining business needs. Timestamping data (events) is a difficult task.
Tumbling window is a fixed-length window. Its characteristic feature is that each event belongs to only one window.
Sliding window includes all events occurring in a certain length among themselves.
disjoint window has a fixed length, but allows one window to overlap another. Typically used to smooth data.