图书介绍

高性能Spark 影印版pdf电子书版本下载

高性能Spark  影印版
  • Holden Karau,Rachel Warren著 著
  • 出版社: 南京:东南大学出版社
  • ISBN:9787564175184
  • 出版时间:2018
  • 标注页数:344页
  • 文件大小:42MB
  • 文件页数:360页
  • 主题词:数据处理软件-英文

PDF下载


点此进入-本书在线PDF格式电子书下载【推荐-云解压-方便快捷】直接下载PDF格式图书。移动端-PC端通用
种子下载[BT下载速度快] 温馨提示:(请使用BT下载软件FDM进行下载)软件下载地址页 直链下载[便捷但速度慢]   [在线试读本书]   [在线获取解压码]

下载说明

高性能Spark 影印版PDF格式电子书版下载

下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。

建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台)。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如 BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用!后期资源热门了。安装了迅雷也可以迅雷进行下载!

(文件页数 要大于 标注页数,上中下等多册电子书除外)

注意:本站所有压缩包均有解压码: 点击下载压缩包解压工具

图书目录

1.Introduction to High Performance Spark 1

What Is Spark and Why Performance Matters 1

What You Can Expect to Get from This Book 2

Spark Versions 3

Why Scala? 3

To Be a Spark Expert You Have to Learn a Little Scala Anyway 3

The Spark Scala API Is Easier to Use Than the Java API 4

Scala Is More Performant Than Python 4

Why Not Scala? 4

Learning Scala 5

Conclusion 6

2.How SparkWorks 7

How Spark Fits into the Big Data Ecosystem 8

Spark Components 8

Spark Model of Parallel Computing:RDDs 10

Lazy Evaluation 11

In-Memory Persistence and Memory Management 13

Immutability and the RDD Interface 14

Types of RDDs 16

Functions on RDDs:Transformations Versus Actions 17

Wide Versus Narrow Dependencies 17

Spark Job Scheduling 19

Resource Allocation Across Applications 20

The Spark Application 20

The Anatomy of a Spark Job 22

The DAG 22

Jobs 23

Stages 23

Tasks 24

Conclusion 26

3.Data Frames,Datasets,and Spark SQL 27

Getting Started with the SparkSession(or HiveContext or SQLContext) 28

Spark SQL Dependencies 30

Managing Spark Dependencies 31

Avoiding Hive JARs 32

Basics of Schemas 33

DataFrame API 36

Transformations 36

Multi-DataFrame Transformations 48

Plain Old SQL Queries and Interacting with Hive Data 49

Data Representation in DataFrames and Datasets 49

Tungsten 50

Data Loading and Saving Functions 51

DataFrameWriter and DataFrameReader 51

Formats 52

Save Modes 61

Partitions(Discovery and Writing) 62

Datasets 62

Interoperability with RDDs,DataFrames,and Local Collections 63

Compile-Time Strong Typing 64

Easier Functional(RDD“like”)Transformations 65

Relational Transformations 65

Multi-Dataset Relational Transformations 65

Grouped Operations on Datasets 66

Extending with User-Defined Functions and Aggregate Functions(UDFs,UDAFs) 67

Query Optimizer 69

Logical and Physical Plans 69

Code Generation 70

Large Query Plans and Iterative Algorithms 70

Debugging Spark SQL Queries 71

JDBC/ODBC Server 71

Conclusion 72

4.Joins(SQL and Core) 75

Core Spark Joins 75

Choosing a Join Type 77

Choosing an Execution Plan 78

Spark SQL Joins 81

DataFrame Joins 82

Dataset Joins 85

Conclusion 86

5.Effective Transformations 87

Narrow Versus Wide Transformations 88

Implications for Performance 90

Implications for Fault Tolerance 91

The Special Case of coalesce 92

What Type of RDD Does Your Transformation Return? 92

Minimizing Object Creation 94

Reusing Existing Objects 94

Using Smaller Data Structures 97

Iterator-to-Iterator Transformations with mapPartitions 100

What Is an Iterator-to-Iterator Transformation? 101

Space and Time Advantages 102

An Example 103

Set Operations 106

Reducing Setup Overhead 107

Shared Variables 108

Broadcast Variables 108

Accumulators 109

Reusing RDDs 114

Cases for Reuse 114

Deciding if Recompute Is Inexpensive Enough 117

Types of Reuse:Cache,Persist,Checkpoint,Shuffle Files 118

Alluxio(nee Tachyon) 122

LRU Caching 123

Noisy Cluster Considerations 124

Interaction with Accumulators 125

Conclusion 126

6.Working with Key/Value Data 127

The Goldilocks Example 129

Goldilocks Version 0:Iterative Solution 130

How to Use PairRDDFunctions and OrderedRDDFunctions 132

Actions on Key/Value Pairs 133

What’s So Dangerous About the groupByKey Function 134

Goldilocks Version 1:groupByKey Solution 134

Choosing an Aggregation Operation 138

Dictionary of Aggregation Operations with Performance Considerations 138

Multiple RDD Operations 141

Co-Grouping 141

Partitioners and Key/Value Data 142

Using the Spark Partitioner Object 144

Hash Partitioning 144

Range Partitioning 144

Custom Partitioning 145

Preserving Partitioning Information Across Transformations 146

Leveraging Co-Located and Co-Partitioned RDDs 146

Dictionary of Mapping and Partitioning Functions PairRDDFunctions 148

Dictionary of OrderedRDDOperations 149

Sorting by Two Keys with SortByKey 151

Secondary Sort and repartitionAndSortWithinPartitions 151

Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function 152

How Not to Sort by Two Orderings 155

Goldilocks Version 2:Secondary Sort 156

A Different Approach to Goldilocks 159

Goldilocks Version 3:Sort on Cell Values 164

Straggler Detection and Unbalanced Data 165

Back to Goldilocks(Again) 167

Goldilocks Version 4:Reduce to Distinct on Each Partition 167

Conclusion 173

7.Going Beyond Scala 175

Beyond Scala within the JVM 176

Beyond Scala,and Beyond the JVM 180

How PySpark Works 181

How SparkR Works 189

Spark.jl(Julia Spark) 191

How Eclair JS Works 192

Spark on the Common Language Runtime(CLR)—C#and Friends 193

Calling Other Languages from Spark 193

Using Pipe and Friends 193

JNI 195

Java Native Access(JNA) 198

Underneath Everything Is FORTRAN 199

Getting to the GPU 200

The Future 201

Conclusion 201

8.Testing and Validation 203

Unit Testing 203

General Spark Unit Testing 204

Mocking RDDs 208

Getting Test Data 210

Generating Large Datasets 210

Sampling 211

Property Checking with ScalaCheck 213

Computing RDD Difference 213

Integration Testing 216

Choosing Your Integration Testing Environment 216

Verifying Performance 217

Spark Counters for Verifying Performance 217

Projects for Verifying Performance 218

Job Validation 219

Conclusion 220

9.Spark MLlib and ML 221

Choosing Between Spark MLlib and Spark ML 221

Working with MLlib 222

Getting Started with MLlib(Organization and Imports) 222

MLlib Feature Encoding and Data Preparation 223

Feature Scaling and Selection 228

MLlib Model Training 228

Predicting 229

Serving and Persistence 230

Model Evaluation 232

Working with Spark ML 233

Spark ML Organization and Imports 233

Pipeline Stages 234

Explain Params 235

Data Encoding 236

Data Cleaning 239

Spark ML Models 239

Putting It All Together in a Pipeline 240

Training a Pipeline 241

Accessing Individual Stages 241

Data Persistence and Spark ML 242

Extending Spark ML Pipelines with Your Own Algorithms 244

Model and Pipeline Persistence and Serving with Spark ML 252

General Serving Considerations 252

Conclusion 253

10.Spark Components and Packages 255

Stream Processing with Spark 257

Sources and Sinks 257

Batch Intervals 259

Data Checkpoint Intervals 260

Considerations for DStreams 261

Considerations for Structured Streaming 262

High Availability Mode(or Handling Driver Failure or Checkpointing) 270

GraphX 271

Using Community Packages and Libraries 271

Creating a Spark Package 273

Conclusion 274

A.Tuning,Debugging,and Other Things Developers Like to Pretend Don’t Exist 275

Index 325

精品推荐