Machine Learning with Spark - Second Edition.

Saved in:
Bibliographic Details
Main Author: Dua, Rajdeep
Other Authors: Ghotra, Manpreet Singh, Pentreath, Nick
Format: eBook
Language:English
Published: Birmingham : Packt Publishing, 2016.
Edition:2nd ed.
Subjects:
Online Access:Click for online access

MARC

LEADER 00000cam a2200000Mu 4500
001 ocn990674354
003 OCoLC
005 20240909213021.0
006 m o d
007 cr |n|---|||||
008 170624s2016 enk o 000 0 eng d
040 |a EBLCP  |b eng  |e pn  |c EBLCP  |d MERUC  |d CHVBK  |d OCLCQ  |d OCLCO  |d OCLCF  |d OCLCQ  |d LVT  |d OCLCQ  |d LOA  |d OCLCO  |d K6U  |d OCLCQ  |d OCLCO  |d OCLCL 
020 |a 9781785886423 
020 |a 1785886428 
035 |a (OCoLC)990674354 
050 4 |a QA76.9.D343  |b .D83 2017 
049 |a HCDD 
100 1 |a Dua, Rajdeep. 
245 1 0 |a Machine Learning with Spark - Second Edition. 
250 |a 2nd ed. 
260 |a Birmingham :  |b Packt Publishing,  |c 2016. 
300 |a 1 online resource (523 pages) 
336 |a text  |b txt  |2 rdacontent 
337 |a computer  |b c  |2 rdamedia 
338 |a online resource  |b cr  |2 rdacarrier 
588 0 |a Print version record. 
505 0 |a Cover -- Credits -- About the Authors -- About the Reviewer -- www.PacktPub.com -- Customer Feedback -- Table of Contents -- Preface -- Chapter 1: Getting Up and Running with Spark -- Installing and setting up Spark locally -- Spark clusters -- The Spark programming model -- SparkContext and SparkConf -- SparkSession -- The Spark shell -- Resilient Distributed Datasets -- Creating RDDs -- Spark operations -- Caching RDDs -- Broadcast variables and accumulators -- SchemaRDD -- Spark data frame -- The first step to a Spark program in Scala -- The first step to a Spark program in Java -- The first step to a Spark program in Python -- The first step to a Spark program in R -- SparkR DataFrames -- Getting Spark running on Amazon EC2 -- Launching an EC2 Spark cluster -- Configuring and running Spark on Amazon Elastic Map Reduce -- UI in Spark -- Supported machine learning algorithms by Spark -- Benefits of using Spark ML as compared to existing libraries -- Spark Cluster on Google Compute Engine -- DataProc -- Hadoop and Spark Versions -- Creating a Cluster -- Submitting a Job -- Summary -- Chapter 2: Math for Machine Learning -- Linear algebra -- Setting up the Scala environment in Intellij -- Setting up the Scala environment on the Command Line -- Fields -- Real numbers -- Complex numbers -- Vectors -- Vector spaces -- Vector types -- Vectors in Breeze -- Vectors in Spark -- Vector operations -- Hyperplanes -- Vectors in machine learning -- Matrix -- Types of matrices -- Matrix in Spark -- Distributed matrix in Spark -- Matrix operations -- Determinant -- Eigenvalues and eigenvectors -- Singular value decomposition -- Matrices in machine learning -- Functions -- Function types -- Functional composition -- Hypothesis -- Gradient descent -- Prior, likelihood, and posterior -- Calculus -- Differential calculus -- Integral calculus. 
505 8 |a Lagranges multipliers -- Plotting -- Summary -- Chapter 3: Designing a Machine Learning System -- What is Machine Learning? -- Introducing MovieStream -- Business use cases for a machine learning system -- Personalization -- Targeted marketing and customer segmentation -- Predictive modeling and analytics -- Types of machine learning models -- The components of a data-driven machine learning system -- Data ingestion and storage -- Data cleansing and transformation -- Model training and testing loop -- Model deployment and integration -- Model monitoring and feedback -- Batch versus real time -- Data Pipeline in Apache Spark -- An architecture for a machine learning system -- Spark MLlib -- Performance improvements in Spark ML over Spark MLlib -- Comparing algorithms supported by MLlib -- Classification -- Clustering -- Regression -- MLlib supported methods and developer APIs -- Spark Integration -- MLlib vision -- MLlib versions compared -- Spark 1.6 to 2.0 -- Summary -- Chapter 4: Obtaining, Processing, and Preparing Data with Spark -- Accessing publicly available datasets -- The MovieLens 100k dataset -- Exploring and visualizing your data -- Exploring the user dataset -- Count by occupation -- Movie dataset -- Exploring the rating dataset -- Rating count bar chart -- Distribution of number ratings -- Processing and transforming your data -- Filling in bad or missing data -- Extracting useful features from your data -- Numerical features -- Categorical features -- Derived features -- Transforming timestamps into categorical features -- Extract time of Day -- Extract time of day -- Text features -- Simple text feature extraction -- Sparse Vectors from Titles -- Normalizing features -- Using ML for feature normalization -- Using packages for feature extraction -- TFID -- IDF -- Word2Vector -- Skip-gram model -- Standard scalar -- Summary. 
505 8 |a Chapter 5: Building a Recommendation Engine with Spark -- Types of recommendation models -- Content-based filtering -- Collaborative filtering -- Matrix factorization -- Explicit matrix factorization -- Implicit Matrix Factorization -- Basic model for Matrix Factorization -- Alternating least squares -- Extracting the right features from your data -- Extracting features from the MovieLens 100k dataset -- Training the recommendation model -- Training a model on the MovieLens 100k dataset -- Training a model using Implicit feedback data -- Using the recommendation model -- ALS Model recommendations -- User recommendations -- Generating movie recommendations from the MovieLens 100k dataset -- Inspecting the recommendations -- Item recommendations -- Generating similar movies for the MovieLens 100k dataset -- Inspecting the similar items -- Evaluating the performance of recommendation models -- ALS Model Evaluation -- Mean Squared Error -- Mean Average Precision at K -- Using MLlib's built-in evaluation functions -- RMSE and MSE -- MAP -- FP-Growth algorithm -- FP-Growth Basic Sample -- FP-Growth Applied to Movie Lens Data -- Summary -- Chapter 6: Building a Classification Model with Spark -- Types of classification models -- Linear models -- Logistic regression -- Multinomial logistic regression -- Visualizing the StumbleUpon dataset -- Extracting features from the Kaggle/StumbleUpon evergreen classification dataset -- StumbleUponExecutor -- Linear support vector machines -- The naive Bayes model -- Decision trees -- Ensembles of trees -- Random Forests -- Gradient-Boosted Trees -- Multilayer perceptron classifier -- Extracting the right features from your data -- Training classification models -- Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset -- Using classification models. 
505 8 |a Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset -- Evaluating the performance of classification models -- Accuracy and prediction error -- Precision and recall -- ROC curve and AUC -- Improving model performance and tuning parameters -- Feature standardization -- Additional features -- Using the correct form of data -- Tuning model parameters -- Linear models -- Iterations -- Step size -- Regularization -- Decision trees -- Tuning tree depth and impurity -- The naive Bayes model -- Cross-validation -- Summary -- Chapter 7: Building a Regression Model with Spark -- Types of regression models -- Least squares regression -- Decision trees for regression -- Evaluating the performance of regression models -- Mean Squared Error and Root Mean Squared Error -- Mean Absolute Error -- Root Mean Squared Log Error -- The R-squared coefficient -- Extracting the right features from your data -- Extracting features from the bike sharing dataset -- Training and using regression models -- BikeSharingExecutor -- Training a regression model on the bike sharing dataset -- Linear regression -- Generalized linear regression -- Decision tree regression -- Ensembles of trees -- Random forest regression -- Gradient boosted tree regression -- Improving model performance and tuning parameters -- Transforming the target variable -- Impact of training on log-transformed targets -- Tuning model parameters -- Creating training and testing sets to evaluate parameters -- Splitting data for Decision tree -- The impact of parameter settings for linear models -- Iterations -- Step size -- L2 regularization -- L1 regularization -- Intercept -- The impact of parameter settings for the decision tree -- Tree depth -- Maximum bins -- The impact of parameter settings for the Gradient Boosted Trees -- Iterations -- MaxBins -- Summary. 
505 8 |a Chapter 8: Building a Clustering Model with Spark -- Types of clustering models -- k-means clustering -- Initialization methods -- Mixture models -- Hierarchical clustering -- Extracting the right features from your data -- Extracting features from the MovieLens dataset -- K-means -- training a clustering model -- Training a clustering model on the MovieLens dataset -- K-means -- interpreting cluster predictions on the MovieLens dataset -- Interpreting the movie clusters -- Interpreting the movie clusters -- K-means -- evaluating the performance of clustering models -- Internal evaluation metrics -- External evaluation metrics -- Computing performance metrics on the MovieLens dataset -- Effect of iterations on WSSSE -- Bisecting KMeans -- Bisecting K-means -- training a clustering model -- WSSSE and iterations -- Gaussian Mixture Model -- Clustering using GMM -- Plotting the user and item data with GMM clustering -- GMM -- effect of iterations on cluster boundaries -- Summary -- Chapter 9: Dimensionality Reduction with Spark -- Types of dimensionality reduction -- Principal components analysis -- Singular value decomposition -- Relationship with matrix factorization -- Clustering as dimensionality reduction -- Extracting the right features from your data -- Extracting features from the LFW dataset -- Exploring the face data -- Visualizing the face data -- Extracting facial images as vectors -- Loading images -- Converting to grayscale and resizing the images -- Extracting feature vectors -- Normalization -- Training a dimensionality reduction model -- Running PCA on the LFW dataset -- Visualizing the Eigenfaces -- Interpreting the Eigenfaces -- Using a dimensionality reduction model -- Projecting data using PCA on the LFW dataset -- The relationship between PCA and SVD -- Evaluating dimensionality reduction models. 
630 0 0 |a Spark (Electronic resource : Apache Software Foundation) 
630 0 7 |a Spark (Electronic resource : Apache Software Foundation)  |2 fast 
650 0 |a Machine learning. 
650 7 |a Machine learning  |2 fast 
700 1 |a Ghotra, Manpreet Singh. 
700 1 |a Pentreath, Nick. 
758 |i has work:  |a Machine Learning with Spark - Second Edition (Text)  |1 https://id.oclc.org/worldcat/entity/E39PCYTyG3Gm8QRPqQCBFw8kCP  |4 https://id.oclc.org/worldcat/ontology/hasWork 
776 0 8 |i Print version:  |a Dua, Rajdeep.  |t Machine Learning with Spark - Second Edition.  |d Birmingham : Packt Publishing, ©2016 
856 4 0 |u https://ebookcentral.proquest.com/lib/holycrosscollege-ebooks/detail.action?docID=4853045  |y Click for online access 
903 |a EBC-AC 
994 |a 92  |b HCD