Big Data Hadoop Developer

Course number: CGIBDHD40 - 5 Days (weekdays or on-demand)

Hadoop is an Apache project (i.e. an open source software) to store and process Big Data. Hadoop stores Big Data in a distributed and fault-tolerant manner over commodity hardware. Afterwards, Hadoop tools are used to perform parallel data processing over HDFS (Hadoop Distributed File System).

As organizations have realized the benefits of Big Data Analytics, so there is a huge demand for Big Data & Hadoop professionals. Companies are looking for Big Data & Hadoop experts with the knowledge of Hadoop Ecosystem and best practices about HDFS, MapReduce, Spark, HBase, Hive, Pig, Oozie, Sqoop & Flume.

Hadoop Training is designed to help you become a certified Big Data practitioner by providing you rich hands-on training on Hadoop Ecosystem. This Hadoop developer certification training is a stepping stone to your Big Data journey, and you will get the opportunity to work on various Big Data projects.

Mastering Hadoop and related tools: The course provides you with an in-depth understanding of the Hadoop framework including HDFS, YARN, and MapReduce. You will learn to use Pig, Hive, and Impala to process and analyze large datasets stored in the HDFS, and use Sqoop and Flume for data ingestion.

Mastering real-time data processing using Spark: You will learn to do functional programming in Spark, implement Spark applications, understand parallel processing in Spark, and use Spark RDD optimization techniques. You will also learn the various interactive algorithm in Spark and use Spark SQL for creating, transforming, and querying data form.

As a part of the course, you will be required to execute real-life industry-based projects using CloudLab. The projects included are in the domains of Banking, Telecommunication, Social Media, Insurance, and E- commerce. This Big Data course also prepares you for the Cloudera CCA175 certification.

Objectives

Big Data Hadoop Certification Training is designed by industry experts to help you become a Certified Big Data Practitioner. The Big Data Hadoop course offers:

In-depth knowledge of Big Data and Hadoop including HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator) & MapReduce
Comprehensive knowledge of various tools that fall in Hadoop Ecosystem like Pig, Hive, Sqoop, Flume, Oozie, and HBase
The capability to ingest data in HDFS using Sqoop & Flume, and analyze those large datasets stored in the HDFS
The exposure to many real world industry-based projects which will be executed in Edureka’s CloudLab
Projects which are diverse in nature covering various data sets from multiple domains such as banking, telecommunication, social media, insurance, and e-commerce
Rigorous involvement of a Hadoop expert throughout the Big Data Hadoop Training to learn industry standards and best practices

Prerequisites

There are no prerequisites for this course. However, prior knowledge of Core Java and SQL will be helpful, but is not mandatory.

Target Audience

The market for Big Data analytics is growing across the world, and this strong growth pattern translates into a great opportunity for IT professionals. Hiring managers are looking for certified Big Data Hadoop professionals. Our Big Data & Hadoop Certification Training helps you to take advantage of this opportunity and accelerate your career. Our Big Data Hadoop Course can be pursued by current professionals as well as those who are new to the industry. It is best suited for:

Software Developers
Project Managers
Software Architects
ETL and Data Warehousing Professionals
Data Engineers
Data Analysts & Business Intelligence Professionals
DBAs and DB professionals
Senior IT Professionals
Testing professionals
Mainframe professionals
Graduates looking to build a career in the Big Data Field

Certification

Big Data Hadoop Developer by Cloudera

Exam

Cloudera CCA175 - Big Data

Accreditation

Post class completion, students can appear for the Cloudera CCA175 - Big Data exam.
Students will also receive a “Certificate of Completion” from COMNet Group Inc.

Course Outline

Lesson 1: Understanding Big Data and Hadoop

In this lesson, you will learn what Big Data is, the limitations of the traditional solutions for Big Data problems, how Hadoop solves those Big Data problems, Hadoop Ecosystem, Hadoop Architecture, HDFS, Anatomy of File Read and Write, & how MapReduce works.

Topics:

Introduction to Big Data & Big Data
Challenges Limitations & Solutions of Big Data Architecture
Hadoop & its Features
Hadoop Ecosystem
Hadoop 2.x Core Components
Hadoop Storage: HDFS (Hadoop Distributed File System)
Hadoop Processing: MapReduce Framework
Different Hadoop Distributions

Lesson 2: Hadoop Architecture and HDFS

In this lesson, you will learn Hadoop Cluster Architecture, important configuration files of Hadoop Cluster, Data Loading Techniques using Sqoop & Flume, and how to setup Single Node and Multi-Node Hadoop Cluster.

Topics:

Hadoop 2.x Cluster Architecture
Federation and High Availability Architecture
Hadoop Cluster Modes
Common Hadoop Shell Commands
Hadoop 2.x Configuration Files
Single Node Cluster & Multi-Node Cluster set up
Basic Hadoop Administration

Lesson 3: Hadoop MapReduce Framework

In this lesson, you will understand Hadoop MapReduce framework comprehensively, the working of MapReduce on data stored in HDFS. You will also learn the advanced MapReduce concepts like Input Splits, Combiner & Partitioner.

Topics:

Traditional way vs MapReduce way
Why MapReduce
YARN Components
YARN Architecture
YARN MapReduce Application Execution Flow
YARN Workflow
Anatomy of MapReduce Program
Input Splits, Relation between Input Splits and HDFS Blocks
MapReduce: Combiner & Partitioner
Demo of Health Care Dataset
Demo of Weather Dataset

Lesson 4: Advanced Hadoop Map Reduce

In this lesson, you will learn Advanced MapReduce concepts such as Counters, Distributed Cache, MRunit, Reduce Join, Custom Input Format, Sequence Input Format and XML parsing.

Topics:

Counters
Distributed
Cache MRunit
Reduce Join
Custom Input Format
Sequence Input Format
XML file Parsing using MapReduce

Lesson 5: Apache Pig

In this lesson, you will learn Apache Pig, types of use cases where we can use Pig, tight coupling between Pig and MapReduce, and Pig Latin scripting, Pig running modes, Pig UDF, Pig Streaming & Testing Pig Scripts. You will also be working on healthcare dataset.

Topics:

Introduction to Apache Pig
MapReduce vs Pig
Pig Components & Pig Execution
Pig Data Types & Data Models in Pig
Pig Latin Programs
Shell and Utility Commands
Pig UDF & Pig Streaming
Aviation use-case in PIG
Pig Demo of Healthcare Dataset

Lesson 6: Apache Hive

This lesson will help you in understanding Hive concepts, Hive Data types, loading and querying data in Hive, running hive scripts and Hive UDF.

Topics:

Introduction to Apache Hive
Hive vs Pig
Hive Architecture and Components
Hive Metastore
Limitations of Hive
Comparison with Traditional Database
Hive Data Types and Data Models
Hive Partition
Hive Bucketing
Hive Tables (Managed Tables and External Tables)
Importing Data
Querying Data & Managing Outputs
Hive Script & Hive UDF
Retail use case in Hive
Hive Demo on Healthcare Dataset

Lesson 7: Advanced Apache Hive and HBase

In this lesson, you will understand advanced Apache Hive concepts such as UDF, Dynamic Partitioning, Hive indexes and views, and optimizations in Hive. You will also acquire in- depth knowledge of Apache HBase, HBase Architecture, HBase running modes and its components.

Topics:

Hive QL: Joining Tables, Dynamic Partitioning
Custom MapReduce Scripts
Hive Indexes and views
Hive Query Optimizers
Hive Thrift Server
Hive UDF
Apache HBase: Introduction to NoSQL Databases and HBase
HBase v/s RDBMS
HBase Components
HBase Architecture
HBase Run Modes
HBase Configuration
HBase Cluster Deployment

Lesson 8: Advanced Apache HBase

This lesson will cover advance Apache HBase concepts. We will see demos on HBase Bulk Loading & HBase Filters. You will also learn what Zookeeper is all about, how it helps in monitoring a cluster, and why HBase uses Zookeeper.

Topics:

HBase Data Model
HBase Client API
Hive Data Loading Techniques
Apache Zookeeper Introduction
ZooKeeper Data Model
Zookeeper Service
HBase Bulk Loading
Getting and Inserting Data
HBase Filters

Lesson 9: Processing Distributed Data with Apache Spark

In this lesson, you will learn about Apache Spark, SparkContext & Spark Ecosystem. You will learn how to work in Resilient Distributed Datasets (RDD) in Apache Spark. You will be running applications on Spark Cluster & comparing the performance of MapReduce and Spark.

Topics :

What is Spark
Spark Ecosystem
Spark Components
What is Scala
Why Scala
SparkContext
Spark RDD

Lesson 10: Oozie and Hadoop Project

In this lesson, you will learn how multiple Hadoop ecosystem components work together to solve Big Data problems. This module will also cover Flume & Sqoop demo, Apache Oozie Workflow Scheduler for Hadoop Jobs, and Hadoop Talend integration.

Topics:

Oozie
Oozie Components
Oozie Workflow
Scheduling Jobs with Oozie Scheduler
Demo of Oozie Workflow
Oozie Coordinator
Oozie Commands
Oozie Web Console
Oozie for MapReduce
Combining flow of MapReduce Jobs
Hive in Oozie
Hadoop Project Demo
Hadoop Talend Integration

Lesson 11: Certification Project

1) Analyses of an Online Book Store

Find out the frequency of books published each. (Hint: Sample dataset will be provided)
Find out in which year the maximum number of books were published.
Find out how many books were published based on ranking in the year.

Sample Dataset Description

The Book-Crossing dataset consists of 3 tables that will be provided to you.

Airlines Analysis

Find the list of Airports operating in the country of India.
Find the list of Airlines having zero stops.
List of Airlines operating with code share.
Which country (or) territory has the highest number of Airports?
Find the list of Active Airlines in the United States.

Sample Dataset Description

In this use case, there are 3 data sets. Final airlines, routes.dat, airports_mod.dat

Available Formats

Live Online

To use reCAPTCHA you must get an API key from https://www.google.com/recaptcha/admin