Apache Spark

1. Big Data Fundamentals

  • Emergence of Big Data
  • Big Data use cases
  • History

2. Hadoop Basics

  • HDFS
  • MapReduce
  • Hive

3. Spark Features

  • Advantges of Spark over MapReduce
  • In-memory caching
  • Language support: Scala, Java, Python and R
  • Documentaion

4. Setting up Spark Cluster & HADOOP 2.x with YARN

  • Installation
  • Cluster Manager and its Types: Standalone, YARN, Mesos and EC2

5. Spark configuration and performance optimization

  • SparkConf properties
  • Components of execution
  • Logs
  • Parallelism
  • Serialization
  • Memory management
  • Hardware provisioning

6. Data formats and its sources :

  • File formats: JSON, XML, CSV, Text, Parquet and Sequence
  • Sources: Hive, HBase, HDFS, S3, Local and Database

7. Spark Core API (RDD)

  • SparkContext and SparkConf
  • Basic RDD Operations (Transformation and Actions)
  • Lazy evaluation
  • DAG
  • Pair RDD, joins, union, zip, cartesian, sorting and aggregate
  • RDD partitions
  • Caching and Persistence
  • Distributed variables: Accumulator and Broadcast
  • Task and Schedular
  • Cluster manager types, Master, Worker node, Driver and Executer concepts
  • Job submission: REPL and Spark-submit

8. Spark SQL

  • SQLContext and HiveContext
  • DataFrame and Row operations
  • Caching
  • Loading data from : Hive, Jason and Parquet

9. Spark Streaming

  • Streaming concepts
  • DStream
  • Multi-Batch operations
  • State Operations
  • Sliding Window Operations
  • Parallelism
  • Advanced Data Sources
  • Checkpointing