Expectations vs Reality
When to take a business decision?
Extract, Transform, Load is a basic pattern for data processing, commonly known in data warehousing. It’s all about extracting data from a source, transforming the data (business rules) and at the end writing/loading everything to a target (Hadoop, Relational Database, Data Warehouse etc.)
Big Data system can be a part of (source of) data warehouses (ex. Data Lake, Enterprise Data Hub)
But Data Warehouses are not Big Data Systems!
ELT process is similar as ETL, has the same stages involved, but order of performing ETL stages is different. Extract data from one or meny sources and Load it to destination system for example ,,data lake’’. After that You can Transform Your data in more dynamicaly fashion on demand.
Use Case:
ELT is an emerging trend:
Difference between ETL and ELT
ELT is evolution of ETL ! - increasing demand for access to raw data.
ETL still has its place for many applications:
Examples of raw data sources:
Data extraction techniques include:
can involve various operations, such as:
is the conventional ETL approach:
applies to the modern ELT approach:
Transformed data could be loss information.
Find a simple map reduce algorithm in any programming language and run it.
How to improve?
Check Your knowladge with streaming data
Definition - Event is everything that we can observe at a certain moment in time (pure physics). Define - In the case of data, event is understood as unchangeable record in the data stream encoded as JSON, XML, CSV or binary.
Which of the following statements do not represent events?
Definition - A continuous stream of events is an infinite set of individual events arranged in time, e.g. logs from a device.
Systems generating continuous streams of events:
Streaming analytics allows you to quickly catch trends and stay ahead of the competition that uses only batch processing. Reacting to sudden events on the stock market or in social networks in order to prevent or minimize losses.
An enterprise is an organization that generates and responds to a continuous stream of events.
Definition - Data stream is data created incrementally over time, generated from static data (database, reading lines from a file) or dynamically (logs, sensors, functions).
Streaming analytics is also called event stream processing - processing large amounts of data already at the stage of their generation.
They are generated as a direct result of an action.
Regardless of the technology used, all data is created as a continuous stream of events (user actions on the website, system logs, measurements from sensors).
Find a process that creates an entire data table at once?
The application that processes the stream of events should enable the processing and saving of the event and access (at the same time) to other data so that it can process the event (perform any conversion on it) and save it as a local state. This state can be saved in many places, eg program variables, local files, ext, and ext of the database. One of the most famous applications of this type is Apache Kafka, which can be combined with e.g. Apache Spark or Apache Flink.
Find info on how log processing systems work.
These types of applications are used in three classes of applications:
Real implementations most often use several classes in one application.
All Event-driven applications process data as Stateful stream processing, processing events with a certain set (mostly business) logic. These applications can trigger actions in response to the analyzed events, e.g. send an alarm in the form of a text message or email, save and forward the appropriate event (e.g. fraud) to the output application (consumer).
Standard applications for this type of application are:
Applications of this type are treated as a natural evolution of microservices. They communicate with each other using REST requests (be sure to check where you are using them !!!) with the use of straightforward data. They also use data in the form of JSON.
What is REST?
Log communication (usually asynchronous) distinguishes the sending application (producer) and the consuming application (consumer).
What is asynchronous communication?
The data generated by companies is most often stored in many different formats and data structures, which is related to the use of various systems e.g. relational and non-relational databases, event logging, file systems (e.g. Hadoop), storage in memory etc. Recording the same data in different systems increases their availability and ease of processing.
However, this approach requires that these systems are always synchronized with each other (contain the same data).
The traditional approach to data synchronization is to perform ETL tasks periodically. However, they very often do not meet the performance and quality assumptions (processing time). Alternatively, you can use the information distributed in the event logs. In this situation, data aggregation or normalization can be performed only during the processing of a given event.
We call these types of applications, which allow to achieve very high performance (low latency), data pipelines
.
In general, these are applications that read a lot of data from various input applications and process everything in a very short time (e.g. Kafka, Apache Flink).
ETL jobs periodically (and periodically over time) import data into the database, with the data being processed by scheduled queries. Regardless of the data warehouse architecture (or components such as Hadoop), this processing is ‘batch’. Analytical processes (OLAP) are performed in the warehouse, which is supplemented as described above. Although periodic data completion (i.e. the entire ETL process) is a kind of ‘art’, it introduces considerable delays in the analytical process. Depending on the data upload schedule, the next data point may appear after an hour or even several days. Even in the case of continuous replenishment of the warehouse (data pipeline app), there is still a significant delay in the analytical polling of such a database. While such delays were acceptable in the past, today’s applications must collect, process and analyze data in real time (personalizing user behavior on websites or dynamically changing conditions in mobile games).
Instead of waiting for periodic triggers, the streaming application continuously downloads event streams and updates the results to include the latest events with a low latency. Streaming applications store their results in an external datastore that supports efficient updates (high-performance database or key-value database). Streaming application performance refreshed live can feed into various types of graphs (shown live).
Streaming Analytics Applications are used in: