PySpark
- Core module
- SQL module
Spark
Core module index
Part – 1: Fundamentals
- A couple of minutes discussion
- Big Data
- Too Much Data
discussion
- Challenges with BigData
- Initially who solved Big Data challenges
- Hadoop creator profile
- What can do by Hadoop?
- Advantages
- Limitations of Hadoop
- Spark overcome those limitations
- Spark creator profile
Part – 2: Introduction
- What is Apache Spark?
- Purpose of Spark
- Spark is written in
- Can Spark integrate with Hadoop?
- What kind of files spark support?
- Is Spark depending on Hadoop?
- Can I install spark on windows?
- History of Spark
- Spark shines on where?
- Spark is Fast
- Spark features
- Data Processing terminology
- Spark - Before and After
- Spark is called as unified stack
Part – 3: Spark modules
and terminology
- Apache Spark Components or modules
- Core
- SQL
- Streaming
- MLib
- GraphX
- SparkR
- Cluster Managers
- Storage Layers for Spark
- Spark Execution Model
- Spark Terminology table
- Spark follows…
- Driver program
- Executors
- SparkContext
- How many SparkContext objects can create for one application?
- Stopping SparkContext object
- SparkContext responsibilities
- Spark 1.x version
- Solution in Spark 2.x
- Understanding Spark Cluster Architecture
- Anatomy of Spark Application
- Components
- Py4J
- Spark clusters
- Spark clusters: Standalone cluster
- Spark on YARN
- YARN - client mode
- YARN - cluster mode
Part – 4: RDD
- Importance of RDD
- Partitions in
RDD
- Creating RDD
- Caching
- Persistent
- Fault-Recovery Mechanism
- If RAM is inefficient to store RDD then where it stores?
- RDD features
- Spark RDD Operations
- Transformations
- Types of Transformations
- Narrow Transformations
- Wide Transformations
- Actions
- Limitation of RDD
- RDD Operations
- Transformations & Actions
- Programs
- Coalesce and Re-partition
- Internals of Job execution in Spark
- PySpark URL to find more examples from official website
Spark SQL module index
- Spark SQL introduction
- How can we write Spark SQL programs?
- Spark SQL features
- Integrated
- Unified data access
- Performance optimization
- DataFrame
- Introduction about DataFrame
- Creating DataFrame by loading csv file
- DataFrame can support what kind of file formats?
- DataFrame characteristics
- Programming languages to created DataFrame
- Why DataFrame
- Custom management
- Optimized execution plan
- Spark SQL execution plan
- Spark SQL terminology
- Spark SQL programs
- PySpark URL to find more examples from official website