Apache Hadoop: Manipulation and Transformation of Data Performance 교육 과정

Course Code

ApHadm1

Duration

21 hours (usually 3 days including breaks)

Requirements

Attendees are not required to have any specific skill as the training is focused on end users skills for both the administration and the manipulation of data under Apache Hadoop

Overview


이 과정은 개발자, 건축가, 데이터 과학자 또는 집중적으로 또는 정기적으로 데이터에 액세스해야하는 모든 프로필을 대상으로합니다.

이 과정의 주요 초점은 데이터 조작과 변형입니다.

Hadoop 생태계의 도구 Hadoop 과정에는 데이터 변환 및 조작에 많이 사용되는 Pig 및 Hive 사용이 포함됩니다.

이 교육은 또한 성능 메트릭 및 성능 최적화를 해결합니다.

이 과정은 전적으로 손에 있으며 이론적 인 측면의 프레젠테이션으로 강조 표시됩니다.

Machine Translated

Course Outline

1.1Hadoop Concepts

1.1.1HDFS

  • The Design of HDFS
  • Command line interface
  • Hadoop File System

1.1.2Clusters

  • Anatomy of a cluster
  • Mater Node / Slave node
  • Name Node / Data Node

1.2Data Manipulation

1.2.1MapReduce detailed

  • Map phase
  • Reduce phase
  • Shuffle

1.2.2Analytics with Map Reduce

  • Group-By with MapReduce
  • Frequency distributions and sorting with MapReduce
  • Plotting results (GNU Plot)
  • Histograms with MapReduce
  • Scatter plots with MapReduce
  • Parsing complex datasets
  • Counting with MapReduce and Combiners
  • Build reports

 

1.2.3Data Cleansing

  • Document Cleaning
  • Fuzzy string search
  • Record linkage / data deduplication
  • Transform and sort event dates
  • Validate source reliability
  • Trim Outliers

1.2.4Extracting and Transforming Data

  • Transforming logs
  • Using Apache Pig to filter
  • Using Apache Pig to sort
  • Using Apache Pig to sessionize

1.2.5Advanced Joins

  • Joining data in the Mapper using MapReduce
  • Joining data using Apache Pig replicated join
  • Joining sorted data using Apache Pig merge join
  • Joining skewed data using Apache Pig skewed join
  • Using a map-side join in Apache Hive
  • Using optimized full outer joins in Apache Hive
  • Joining data using an external key value store

1.3Performance Diagnosis and Optimization Techniques

  • Map
    • Investigating spikes in input data
    • Identifying map-side data skew problems
    • Map task throughput
    • Small files
    • Unsplittable files
  • Reduce
    • Too few or too many reducers
    • Reduce-side data skew problems
    • Reduce tasks throughput
    • Slow shuffle and sort
  • Competing jobs and scheduler throttling
  • Stack dumps & unoptimized code
  • Hardware failures
  • CPU contention
  • Tasks
    • Extracting and visualizing task execution times
    • Profiling your map and reduce tasks
  • Avoid the reducer
  • Filter and project
  • Using the combiner
  • Fast sorting with comparators
  • Collecting skewed data
  • Reduce skew mitigation

회원 평가

★★★★★
★★★★★

Related Categories

고객 회사

is growing fast!

We are looking to expand our presence in South Korea!

As a Business Development Manager you will:

  • expand business in South Korea
  • recruit local talent (sales, agents, trainers, consultants)
  • recruit local trainers and consultants

We offer:

  • Artificial Intelligence and Big Data systems to support your local operation
  • high-tech automation
  • continuously upgraded course catalogue and content
  • good fun in international team

If you are interested in running a high-tech, high-quality training and consulting business.

Apply now!