Data Science for Big Data Analytics 교육 과정

Course Code

dsbda

Duration

35 hours (usually 5 days including breaks)

Overview

빅 데이터는 너무 방대하고 복잡한 데이터 세트로 전통적인 데이터 처리 응용 프로그램 소프트웨어가 처리하기에 부적합합니다. 큰 데이터 문제로는 데이터 캡처, 데이터 저장, 데이터 분석, 검색, 공유, 전송, 시각화, 쿼리, 업데이트 및 정보 프라이버시가 있습니다.

Machine Translated

Course Outline

Introduction to Data Science for Big Data Analytics

  • Data Science Overview
  • Big Data Overview
  • Data Structures
  • Drivers and complexities of Big Data
  • Big Data ecosystem and a new approach to analytics
  • Key technologies in Big Data
  • Data Mining process and problems
    • Association Pattern Mining
    • Data Clustering
    • Outlier Detection
    • Data Classification

Introduction to Data Analytics lifecycle

  • Discovery
  • Data preparation
  • Model planning
  • Model building
  • Presentation/Communication of results
  • Operationalization
  • Exercise: Case study

From this point most of the training time (80%) will be spent on examples and exercises in R and related big data technology.

Getting started with R

  • Installing R and Rstudio
  • Features of R language
  • Objects in R
  • Data in R
  • Data manipulation
  • Big data issues
  • Exercises

Getting started with Hadoop

  • Installing Hadoop
  • Understanding Hadoop modes
  • HDFS
  • MapReduce architecture
  • Hadoop related projects overview
  • Writing programs in Hadoop MapReduce
  • Exercises

Integrating R and Hadoop with RHadoop

  • Components of RHadoop
  • Installing RHadoop and connecting with Hadoop
  • The architecture of RHadoop
  • Hadoop streaming with R
  • Data analytics problem solving with RHadoop
  • Exercises

Pre-processing and preparing data

  • Data preparation steps
  • Feature extraction
  • Data cleaning
  • Data integration and transformation
  • Data reduction – sampling, feature subset selection,
  • Dimensionality reduction
  • Discretization and binning
  • Exercises and Case study

Exploratory data analytic methods in R

  • Descriptive statistics
  • Exploratory data analysis
  • Visualization – preliminary steps
  • Visualizing single variable
  • Examining multiple variables
  • Statistical methods for evaluation
  • Hypothesis testing
  • Exercises and Case study

Data Visualizations

  • Basic visualizations in R
  • Packages for data visualization ggplot2, lattice, plotly, lattice
  • Formatting plots in R
  • Advanced graphs
  • Exercises

Regression (Estimating future values)

  • Linear regression
  • Use cases
  • Model description
  • Diagnostics
  • Problems with linear regression
  • Shrinkage methods, ridge regression, the lasso
  • Generalizations and nonlinearity
  • Regression splines
  • Local polynomial regression
  • Generalized additive models
  • Regression with RHadoop
  • Exercises and Case study

Classification

  • The classification related problems
  • Bayesian refresher
  • Naïve Bayes
  • Logistic regression
  • K-nearest neighbors
  • Decision trees algorithm
  • Neural networks
  • Support vector machines
  • Diagnostics of classifiers
  • Comparison of classification methods
  • Scalable classification algorithms
  • Exercises and Case study

Assessing model performance and selection

  • Bias, Variance and model complexity
  • Accuracy vs Interpretability
  • Evaluating classifiers
  • Measures of model/algorithm performance
  • Hold-out method of validation
  • Cross-validation
  • Tuning machine learning algorithms with caret package
  • Visualizing model performance with Profit ROC and Lift curves

Ensemble Methods

  • Bagging
  • Random Forests
  • Boosting
  • Gradient boosting
  • Exercises and Case study

Support vector machines for classification and regression

  • Maximal Margin classifiers
    • Support vector classifiers
    • Support vector machines
    • SVM’s for classification problems
    • SVM’s for regression problems
  • Exercises and Case study

Identifying unknown groupings within a data set

  • Feature Selection for Clustering
  • Representative based algorithms: k-means, k-medoids
  • Hierarchical algorithms: agglomerative and divisive methods
  • Probabilistic base algorithms: EM
  • Density based algorithms: DBSCAN, DENCLUE
  • Cluster validation
  • Advanced clustering concepts
  • Clustering with RHadoop
  • Exercises and Case study

Discovering connections with Link Analysis

  • Link analysis concepts
  • Metrics for analyzing networks
  • The Pagerank algorithm
  • Hyperlink-Induced Topic Search
  • Link Prediction
  • Exercises and Case study

Association Pattern Mining

  • Frequent Pattern Mining Model
  • Scalability issues in frequent pattern mining
  • Brute Force algorithms
  • Apriori algorithm
  • The FP growth approach
  • Evaluation of Candidate Rules
  • Applications of Association Rules
  • Validation and Testing
  • Diagnostics
  • Association rules with R and Hadoop
  • Exercises and Case study

Constructing recommendation engines

  • Understanding recommender systems
  • Data mining techniques used in recommender systems
  • Recommender systems with recommenderlab package
  • Evaluating the recommender systems
  • Recommendations with RHadoop
  • Exercise: Building recommendation engine

Text analysis

  • Text analysis steps
  • Collecting raw text
  • Bag of words
  • Term Frequency –Inverse Document Frequency
  • Determining Sentiments
  • Exercises and Case study

회원 평가

★★★★★
★★★★★

고객 회사

is growing fast!

We are looking to expand our presence in South Korea!

As a Business Development Manager you will:

  • expand business in South Korea
  • recruit local talent (sales, agents, trainers, consultants)
  • recruit local trainers and consultants

We offer:

  • Artificial Intelligence and Big Data systems to support your local operation
  • high-tech automation
  • continuously upgraded course catalogue and content
  • good fun in international team

If you are interested in running a high-tech, high-quality training and consulting business.

Apply now!