Friday, 3 May 2019

PySpark Index

PySpark

Core module
SQL module

Spark Core module index

Part – 1: Fundamentals

A couple of minutes discussion
Big Data

Too Much Data discussion

Challenges with BigData
Initially who solved Big Data challenges
Hadoop creator profile
What can do by Hadoop?

Advantages
Limitations of Hadoop
Spark overcome those limitations

Spark creator profile

Part – 2: Introduction

What is Apache Spark?
Purpose of Spark
Spark is written in
Can Spark integrate with Hadoop?
What kind of files spark support?
Is Spark depending on Hadoop?
Can I install spark on windows?
History of Spark
Spark shines on where?
Spark is Fast
Spark features
Data Processing terminology
Spark - Before and After
Spark is called as unified stack

Part – 3: Spark modules and terminology

Apache Spark Components or modules

Core
SQL
Streaming
MLib
GraphX
SparkR

Cluster Managers
Storage Layers for Spark
Spark Execution Model
Spark Terminology table
Spark follows…
Driver program
Executors
SparkContext
How many SparkContext objects can create for one application?
Stopping SparkContext object
SparkContext responsibilities
Spark 1.x version
Solution in Spark 2.x
Understanding Spark Cluster Architecture
Anatomy of Spark Application
Components
Py4J
Spark clusters

Spark clusters: Standalone cluster
Spark on YARN

YARN - client mode
YARN - cluster mode

Part – 4: RDD

Importance of RDD
Partitions in RDD
Creating RDD
Caching
Persistent
Fault-Recovery Mechanism
If RAM is inefficient to store RDD then where it stores?
RDD features
Spark RDD Operations
Transformations
Types of Transformations

Narrow Transformations
Wide Transformations

Actions
Limitation of RDD
RDD Operations

Transformations & Actions

Programs
Coalesce and Re-partition
Internals of Job execution in Spark
PySpark URL to find more examples from official website

Spark SQL module index

Spark SQL introduction
How can we write Spark SQL programs?
Spark SQL features

Integrated
Unified data access
Performance optimization

DataFrame
Introduction about DataFrame
Creating DataFrame by loading csv file
DataFrame can support what kind of file formats?
DataFrame characteristics
Programming languages to created DataFrame
Why DataFrame
Custom management
Optimized execution plan
Spark SQL execution plan
Spark SQL terminology
Spark SQL programs
PySpark URL to find more examples from official website