Lecture 2 - Time to Stream

Batch vs Stream Processing

Expectations vs Reality

Batch Processing

When to take a business decision?

Batch Processing

Data Processing Types

Batch = Big, historical datasets
Stream = stream data, on line, generated and send continuously

Czas uruchomienia przetwarzania

Batch = minutes, houres, days (Data warehouses)
Stream = Real-time/near-real-time

Reprocessing

Batch = possible and used very often
Stream = ,,impossible’’

Batch Processing

ETL

Extract, Transform, Load is a basic pattern for data processing, commonly known in data warehousing. It’s all about extracting data from a source, transforming the data (business rules) and at the end writing/loading everything to a target (Hadoop, Relational Database, Data Warehouse etc.)

Big Data

Big Data system can be a part of (source of) data warehouses (ex. Data Lake, Enterprise Data Hub)

But Data Warehouses are not Big Data Systems!

Data warehouses
- highly structured data retention
- focused on the analysis and reporting process
- 100 \% accuracy
Big Data
- data of any structure
- serves a variety of data-driven purposes (analytics, data science …)
- less than 100 \% accuracy

Batch Processing

ELT

ELT process is similar as ETL, has the same stages involved, but order of performing ETL stages is different. Extract data from one or meny sources and Load it to destination system for example ,,data lake’’. After that You can Transform Your data in more dynamicaly fashion on demand.

Use Case:

Demanding scalability requirements of Big Data
Streaming analytics
Integration of highly distributed data sources
Multiple data products from the same sources

ELT is an emerging trend:

Big Data -> Cloud computing
ELT separates the data pipeline from processing
More flexibility
No information loss (by transformations)

Difference between ETL and ELT

Transformations for ETL happen within the data pipeline
Transformations for ELT happen in the destination environment
ETL is rigid - pipelines are engineered to user specifications
ELT is flexible - end users build their own transformations
Organizations use ETL for relational data, on-premise - scalability is difficult
ELT solves scalability problems, handling both structured and unstructured Big Data in the colud
ETL workflows take time to specify and develop
ELT supports self-serve, interactive analytics in real time

ELT is evolution of ETL ! - increasing demand for access to raw data.

ETL still has its place for many applications:

Lengthy time-to-insight
Challenges imposed by Big Data
Demand for access to siloed information

Data Extraction Techniques

Examples of raw data sources:

Paper documents
Web pages
Analog audio/video
Survey, statistics, economics
Transactional data
Social media
Weather station networks
IoT
Medical records
Human genomes

Data extraction techniques include:

OCR
ADC sampling, CCD sampling
Mail, phone, or in-person surveys and polls
Cookies, user logs
Web scraping
API’s
Database querying
Egde computing
Biomedical devices

Data Transformation Techniques

can involve various operations, such as:

Data typing
Data structuring
Anonymizing, encrypting
Cleaning: duplicate records, missing values
Normalizing: converting data to common units
Filtering, sorting, aggregating, binning
Joining data sources

Schema-on-write

is the conventional ETL approach:

Consistency and efficiency
Limited versatility

Schema-on-read

applies to the modern ELT approach:

Versatility
Enhanced storage flexibility = more data

Transformed data could be loss information.

data compression
filtering
aggregation
Edge computing devices

Data Load Techniques

Full
Incremental
Scheduled
ON-demand
Batch and stream
push and pull
parallel and serial

Hadoop Map Reduce

Batch Processing

Find a simple map reduce algorithm in any programming language and run it.

Batch Processing

How to improve?

APACHE SPARK

Batch Processing

Stream of Data

Check Your knowladge with streaming data

Definition - Event is everything that we can observe at a certain moment in time (pure physics). Define - In the case of data, event is understood as unchangeable record in the data stream encoded as JSON, XML, CSV or binary.

Which of the following statements do not represent events?

It was a warm day.
The API client did not work.
The API client stopped working at midnight.
The car was black.
Murzynek was tasty.
In 2013, the NASDAQ exchange opened at 9:30 am every day.

Definition - A continuous stream of events is an infinite set of individual events arranged in time, e.g. logs from a device.

Systems generating continuous streams of events:

Transaction systems - e.g. purchase / sale
Data warehouses - stores the history of events for later analysis (Fact Table).
Data from sensors: Monitoring systems, IoT, any API (Twitter), queuing systems (Kafka),
Location tracking
User interaction: what are your site visitors doing?
Mobile devices
Cloud applications
web browsers

Batch Processing

Webserver logs
Recommendation systems (personalization)

Streaming analytics allows you to quickly catch trends and stay ahead of the competition that uses only batch processing. Reacting to sudden events on the stock market or in social networks in order to prevent or minimize losses.

which vehicle of the company’s fleet is almost empty and - where to send the driver for refueling.
Which vehicle in your fleet consumes the most fuel and why?
Which equipment in the plant or the factory may fail in the coming days?
What spare parts will you need to replace and on which machines in the near future?
How many customers are currently shopping in the store and can you offer them something?
Is the customer calling to terminate the contract?
and many others.

An enterprise is an organization that generates and responds to a continuous stream of events.

Data Streams - Definitions

Definition - Data stream is data created incrementally over time, generated from static data (database, reading lines from a file) or dynamically (logs, sensors, functions).

Streaming analytics is also called event stream processing - processing large amounts of data already at the stage of their generation.

They are generated as a direct result of an action.

Regardless of the technology used, all data is created as a continuous stream of events (user actions on the website, system logs, measurements from sensors).

Find a process that creates an entire data table at once?

The application that processes the stream of events should enable the processing and saving of the event and access (at the same time) to other data so that it can process the event (perform any conversion on it) and save it as a local state. This state can be saved in many places, eg program variables, local files, ext, and ext of the database. One of the most famous applications of this type is Apache Kafka, which can be combined with e.g. Apache Spark or Apache Flink.

Find info on how log processing systems work.

These types of applications are used in three classes of applications:

event-driven applications
data pipeline applications
data analytics applications

Real implementations most often use several classes in one application.

Event-driven apps

All Event-driven applications process data as Stateful stream processing, processing events with a certain set (mostly business) logic. These applications can trigger actions in response to the analyzed events, e.g. send an alarm in the form of a text message or email, save and forward the appropriate event (e.g. fraud) to the output application (consumer).

Standard applications for this type of application are:

Real-time recommendations - e.g. while browsing websites
Pattern detection or complex analysis, eg fraud detection in credit card transactions.
Anomaly detection - detection of intrusions in a computer network

Applications of this type are treated as a natural evolution of microservices. They communicate with each other using REST requests (be sure to check where you are using them !!!) with the use of straightforward data. They also use data in the form of JSON.

What is REST?

Log communication (usually asynchronous) distinguishes the sending application (producer) and the consuming application (consumer).

What is asynchronous communication?

Data Pipelines

The data generated by companies is most often stored in many different formats and data structures, which is related to the use of various systems e.g. relational and non-relational databases, event logging, file systems (e.g. Hadoop), storage in memory etc. Recording the same data in different systems increases their availability and ease of processing.

However, this approach requires that these systems are always synchronized with each other (contain the same data).

The traditional approach to data synchronization is to perform ETL tasks periodically. However, they very often do not meet the performance and quality assumptions (processing time). Alternatively, you can use the information distributed in the event logs. In this situation, data aggregation or normalization can be performed only during the processing of a given event.

We call these types of applications, which allow to achieve very high performance (low latency), data pipelines. In general, these are applications that read a lot of data from various input applications and process everything in a very short time (e.g. Kafka, Apache Flink).

Data analytics applications (streaming analytics)

ETL jobs periodically (and periodically over time) import data into the database, with the data being processed by scheduled queries. Regardless of the data warehouse architecture (or components such as Hadoop), this processing is ‘batch’. Analytical processes (OLAP) are performed in the warehouse, which is supplemented as described above. Although periodic data completion (i.e. the entire ETL process) is a kind of ‘art’, it introduces considerable delays in the analytical process. Depending on the data upload schedule, the next data point may appear after an hour or even several days. Even in the case of continuous replenishment of the warehouse (data pipeline app), there is still a significant delay in the analytical polling of such a database. While such delays were acceptable in the past, today’s applications must collect, process and analyze data in real time (personalizing user behavior on websites or dynamically changing conditions in mobile games).

Instead of waiting for periodic triggers, the streaming application continuously downloads event streams and updates the results to include the latest events with a low latency. Streaming applications store their results in an external datastore that supports efficient updates (high-performance database or key-value database). Streaming application performance refreshed live can feed into various types of graphs (shown live).

OLAP system

Streaming Analytics Applications are used in:

Monitoring the quality of telephone networks
Behavioral analysis of mobile application users
…

Real-Time Analytics Lectures for SGH students

Lecture 2 - Time to Stream

Batch vs Stream Processing

Data Processing Types

Czas uruchomienia przetwarzania

Reprocessing

ETL

Big Data

ELT

Data Extraction Techniques

Data Transformation Techniques

Schema-on-write

Schema-on-read

Data Load Techniques

Hadoop Map Reduce

APACHE SPARK

Stream of Data

Data Streams - Definitions

Event-driven apps

Data Pipelines

Data analytics applications (streaming analytics)