Applied data science using Pyspark : learn the end-to-end predictive model-building cycle / Ramcharan Kakarla, Sundar Krishnan, Sridhar Alla.

Discover the capabilities of PySpark and its application in the realm of data science. This comprehensive guide with hand-picked examples of daily use cases will walk you through the end-to-end predictive model-building cycle with the latest techniques and tricks of the trade. Applied Data Science U...

Full description

Saved in:

Bibliographic Details
Main Author:	Kakarla, Ramcharan
Other Authors:	Krishnan, Sundar, Alla, Sridhar
Format:	eBook
Language:	English
Published:	Berkeley, CA : Apress, 2021.
Subjects:	Big data. Machine learning. Python (Computer program language) Parallel processing (Electronic computers) Big data Computer software Machine learning
Online Access:	Click for online access

Table of Contents:

Intro
Table of Contents
About the Authors
About the Technical Reviewer
Acknowledgments
Foreword 1
Foreword 2
Foreword 3
Introduction
Chapter 1: Setting Up the PySpark Environment
Local Installation using Anaconda
Step 1: Install Anaconda
Step 2: Conda Environment Creation
Step 3: Download and Unpack Apache Spark
Step 4: Install Java 8 or Later
Step 5: Mac & Linux Users
Step 6: Windows Users
Step 7: Run PySpark
Step 8: Jupyter Notebook Extension
Docker-based Installation
Why Do We Need to Use Docker?
What Is Docker?
Create a Simple Docker Image
Download PySpark Docker
Step-by-Step Approach to Understanding the Docker PySpark run Command
Databricks Community Edition
Create Databricks Account
Create a New Cluster
Create Notebooks
How Do You Import Data Files into the Databricks Environment?
Basic Operations
Upload Data
Access Data
Calculate Pi
Summary
Chapter 2: PySpark Basics
PySpark Background
PySpark Resilient Distributed Datasets (RDDs) and DataFrames
Data Manipulations
Reading Data from a File
Reading Data from Hive Table
Reading Metadata
Counting Records
Subset Columns and View a Glimpse of the Data
Missing Values
One-Way Frequencies
Sorting and Filtering One-Way Frequencies
Casting Variables
Descriptive Statistics
Unique/Distinct Values and Counts
Filtering
Creating New Columns
Deleting and Renaming Columns
Summary
Chapter 3: Utility Functions and Visualizations
Additional Data Manipulations
String Functions
Registering DataFrames
Window Functions
Other Useful Functions
Collect List
Sampling
Caching and Persisting
Saving Data
Pandas Support
Joins
Dropping Duplicates
Data Visualizations
Introduction to Machine Learning
Summary
Chapter 4: Variable Selection
Exploratory Data Analysis
Cardinality
Missing Values
Missing at Random (MAR)
Missing Completely at Random (MCAR)
Missing Not at Random (MNAR)
Code 1: Cardinality Check
Code 2: Missing Values Check
Step 1: Identify Variable Types
Step 2: Apply StringIndexer to Character Columns
Step 3: Assemble Features
Built-in Variable Selection Process: Without Target
Principal Component Analysis
Mechanics
Singular Value Decomposition
Built-in Variable Selection Process: With Target
ChiSq Selector
Model-based Feature Selection
Custom-built Variable Selection Process
Information Value Using Weight of Evidence
Monotonic Binning Using Spearman Correlation
How Do You Calculate the Spearman Correlation by Hand?
How Is Spearman Correlation Used to Create Monotonic Bins for Continuous Variables?
Custom Transformers
Main Concepts in Pipelines
Voting-based Selection
Summary
Chapter 5: Supervised Learning Algorithms
Basics
Regression
Classification
Loss Functions
Optimizers

Applied data science using Pyspark : learn the end-to-end predictive model-building cycle / Ramcharan Kakarla, Sundar Krishnan, Sridhar Alla.

Similar Items