Real-Time Analytics Lectures for SGH students

Lecture 3 - Time in Stream

Types of machine learning - classic division

  1. Supervised learning
    • classification
    • Linear Regression
  2. Unsupervised learning
  3. Reinforcement learning

Machine learning differently

The division due to the possibility of incremental training with the use of the stream of sent data:

  1. Batch Learning - Systems using all stored data. It takes a lot of time and resources. Often referred to as off-line processing. Such a system is first learned and then implemented into the production cycle and he is no longer trained!. The model exchange takes place after the new model has been trained for all data - _ this process is easy to automate_.
  2. Incremental learning also called on-line process, the model is trained for sequentially added (new) data. What to do when the incoming data is no longer correct? (anomaly detection).

The actual data generation process

Data is generated in an unlimited form - it appears as a result of continuous operation of systems. Today (and even during these classes!) You have generated a lot of data on your phone. Will you not generate them during the next class or tomorrow? Batch processing breaks the data into chunks of a fixed length of time and starts the processing at a user specified time time.

In the previous class, we learned what a single event is. We also defined a continuous stream of events.

Let us take a few such events and organize them into consecutive time points.

You have already dealt with many systems that handle data streams. These are, for example:

  • data warehouses
  • device operation monitoring systems (IoT)
  • transaction systems
  • website analytical systems
  • on-line advertising
  • social media
  • logging systems
  • ….

a company is an organization that generates and responds to a continuous stream of data.

In batch processing, the source (but also the result of processing) of data is a ** file **. It is written once and can be referenced (or run on) by multiple processes (tasks). The file name is an element that identifies a set of records.

In the case of a stream, the event is generated only once by the so-called producer (also known as a sender or supplier). The resulting event can be processed by many so-called consumers (recipients). Streaming events are grouped into the so-called topic.

Time in real-time data analysis

In the case of batch processing, we process historical data and the start time of the processing process has nothing to do with the time of occurrence of the analyzed events.

For streaming data, we have two time concepts:

  1. event time - time in which the event happened.
  2. processing time - time during which the system processes the event.

In an ideal situation:

In fact, data processing always takes place with a certain delay, which is represented by the points appearing below the function for the ideal situation (below the diagonal).

In stream processing applications, the differences between the time of the occurrence of an event and its processing prove to be important. The most common causes of delay are data transmission over the network or lack of communication between the device and the network. A simple example is driving a car through a tunnel and tracking the position via a GPS application.

Of course, you can count the number of such missed events and trigger an alarm if there are too many such rejects. The second (probably more often) used method is the use of the so-called correction. \ it {watermarking}.

The real-time event processing process can be represented as a step function, represented in the figure:

As can be seen, not all events contribute to the analysis and processing. The implementation of the processing process along with additional time for the occurrence of events (watermarking) can be presented as a process covering all events above the dashed line. The extra time allowed for additional events to be processed, but there may still be points that will not be taken into account.

The situations presented in the graphs clearly indicate why the concept of time is an important factor and requires precise definition already at the level of defining business needs. Timestamping data (events) is a difficult task.

time-windows

Tumbling window is a fixed-length window. Its characteristic feature is that each event belongs to only one window.

Sliding window includes all events occurring in a certain length among themselves.

disjoint window has a fixed length, but allows one window to overlap another. Typically used to smooth data.