Data Processing And Storage Methods
The course focuses on how to build large-scale data processing systems and existing tools in this area. The objective of the course is to provide an understanding of the internal device, the mechanics of work, the areas of applicability of existing solutions, highlight the strengths and weaknesses, and teach practical skills to analyse large amounts of information.
Technology such as HDFS, Hadoop MapReduce, HBase, Cassandra, Spark, Kafka, Spark Streaming, Storm. The consistency of the story repeats the history of data processing technologies and development. At the beginning, we will meet HDFS and MapReduce, select key architectural solutions and limitations on the applicability of these systems (including in view of 10 years of operation). Let's get this straight. Data storage We'll meet the different traffic-offs in the construction of systems like a key value in HBase Cassandra. Let us gradually move to Spark, the modern cluster system. We recognize the fundamental differences in the data-processing package and fluxes, studying Kafka - a data delivery tyre with minimum delay - and Storm together with Spark Streaming - flux computing systems. In addition, we will select the accompanying technologies (zooKeeper, Hive) that sometimes make life easier in the development of applications.
The practical part of the course consists of several assignments under one cross-cutting model business assignment. Your main objective will be to build a continuum of data processing to calculate audit statistics on model social media, user profile collection and storage, ad-hoc of analysts. The assessment for the practical part is derived from the correctness and stability of your decision.