图书介绍

大规模并行处理器程序设计pdf电子书版本下载

（美）柯克（Kirk.D.）著著
出版社：北京市：清华大学出版社
ISBN：9787302229735
出版时间：2010
标注页数：258页
文件大小：59MB
文件页数：278页
主题词：并行程序－程序设计－高等学校－教材－英文

PDF下载

点此进入-本书在线PDF格式电子书下载【推荐-云解压-方便快捷】直接下载PDF格式图书。移动端-PC端通用
种子下载[BT下载速度快] 温馨提示：（请使用BT下载软件FDM进行下载）软件下载地址页直链下载[便捷但速度慢] [在线试读本书] [在线获取解压码]

点击复制MD5值：bd07a3dba77008f1f60bcb5ea02c8858

下载说明

大规模并行处理器程序设计PDF格式电子书版下载

下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。

点击复制85GB完整离线版磁力链接到迅雷FDM等BT下载工具进行下载详情点击-查看共享计划

建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台）。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如 BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用！后期资源热门了。安装了迅雷也可以迅雷进行下载！

（文件页数要大于标注页数，上中下等多册电子书除外）

注意：本站所有压缩包均有解压码： 点击下载压缩包解压工具

图书目录

CHAPTER 1 INTRODUCTION 1

1.1 GPUs as Parallel Computers 2

1.2 Architecture of a Modern GPU 8

1.3 Why More Speed or Parallelism? 10

1.4 Parallel Programming Languages and Models 13

1.5 Overarching Goals 15

1.6 Organization of the Book 16

CHAPTER 2 HISTORY OF GPU COMPUTING 21

2.1 Evolution of Graphics Pipelines 21

2.1.1 The Era of Fixed-Function Graphics Pipelines 22

2.1.2 Evolution of Programmable Real-Time Graphics 26

2.1.3 Unified Graphics and Computing Processors 29

2.1.4 GPGPU:An Intermediate Step 31

2.2 GPU Computing 32

2.2.1 Scalable GPUs 33

2.2.2 Recent Developments 34

2.3 Future Trends 34

CHAPTER 3 INTRODUCTION TO CUDA 39

3.1 Data Parallelism 39

3.2 CUDA Program Structure 41

3.3 A Matrix-Matrix Multiplication Example 42

3.4 Device Memories and Data Transfer 46

3.5 Kernel Functions and Threading 51

3.6 Summary 56

3.6.1 Function declarations 56

3.6.2 Kernel launch 56

3.6.3 Predefined variables 56

3.6.4 Runtime API 57

CHAPTER 4 CUDA THREADS 59

4.1 CUDA Thread Organization 59

4.2 Using blockIdx and threadIdx 64

4.3 Synchronization and Transparent Scalability 68

4.4 Thread Assignment 70

4.5 Thread Scheduling and Latency Tolerance 71

4.6 Summary 74

4.7 Exercises 74

CHAPTER 5 CUDATM MEMORIES 77

5.1 Importance of Memory Access Efficiency 78

5.2 CUDA Device Memory Types 79

5.3 A Strategy for Reducing Global Memory Traffic 83

5.4 Memory as a Limiting Factor to Parallelism 90

5.5 Summary 92

5.6 Exercises 93

CHAPTER 6 PERFORMANCE CONSIDERATIONS 95

6.1 More on Thread Execution 96

6.2 Global Memory Bandwidth 103

6.3 Dynamic Partitioning of SM Resources 111

6.4 Data Prefetching 113

6.5 Instruction Mix 115

6.6 Thread Granularity 116

6.7 Measured Performance and Summary 118

6.8 Exercises 120

CHAPTER 7 FLOATING POINT CONSIDERATIONS 125

7.1 Floating-Point Format 126

7.1.1 Normalized Representation of M 126

7.1.2 Excess Encoding of E 127

7.2 Representable Numbers 129

7.3 Special Bit Patterns and Precision 134

7.4 Arithmetic Accuracy and Rounding 135

7.5 Algorithm Considerations 136

7.6 Summary 138

7.7 Exercises 138

CHAPTER 8 APPLICATION CASE STUDY:ADVANCED MRI RECONSTRUCTION 141

8.1 Application Background 142

8.2 Iterative Reconstruction 144

8.3 Computing FHd 148

Step 1．Determine the Kernel Parallelism Structure 149

Step 2．Getting Around the Memory Bandwidth Limitation 156

Step 3．Using Hardware Trigonometry Functions 163

Step 4．Experimental Performance Tuning 166

8.4 Final Evaluation 167

8.5 Exercises 170

CHAPTER 9 APPLICATION CASE STUDY:MOLECULAR VISUALIZATION AND ANALYSIS 173

9.1 Application Background 174

9.2 A Simple Kernel Implementation 176

9.3 Instruction Execution Efficiency 180

9.4 Memory Coalescing 182

9.5 Additional Performance Comparisons 185

9.6 Using Multiple GPUs 187

9.7 Exercises 188

CHAPTER 10 PARALLEL PROGRAMMING AND COMPUTATIONAL THINKING 191

10.1 Goals of Parallel Programming 192

10.2 Problem Decomposition 193

10.3 Algorithm Selection 196

10.4 Computational Thinking 202

10.5 Exercises 204

CHAPTER 11 A BRIEF INTRODUCTION TO OPENCLTM 205

11.1 Background 205

11.2 Data Parallelism Model 207

11.3 Device Architecture 209

11.4 Kernel Functions 211

11.5 Device Management and Kernel Launch 212

11.6 Electrostatic Potential Map in OpenCL 214

11.7 Summary 219

11.8 Exercises 220

CHAPTER 12 CONCLUSION AND FUTURE OUTLOOK 221

12.1 Goals Revisited 221

12.2 Memory Architecture Evolution 223

12.2.1 Large Virtual and Physical Address Spaces 223

12.2.2 Unified Device Memory Space 224

12.2.3 Configurable Caching and Scratch Pad 225

12.2.4 Enhanced Atomic Operations 226

12.2.5 Enhanced Global Memory Access 226

12.3 Kernel Execution Control Evolution 227

12.3.1 Function Calls within Kernel Functions 227

12.3.2 Exception Handling in Kernel Functions 227

12.3.3 Simultaneous Execution of Multiple Kernels 228

12.3.4 Interruptible Kernels 228

12.4 Core Performance 229

12.4.1 Double-Precision Speed 229

12.4.2 Better Control Flow Efficiency 229

12.5 Programming Environment 230

12.6 A Bright Outlook 230

APPENDIX A MATRIX MULTIPLICATION HOST-ONLY VERSION SOURCE CODE 233

A.1 matrixmul.cu 233

A.2 matrixmul_gold.cpp 237

A.3 matrixmul.h 238

A.4 assist.h 239

A.5 Expected Output 243

APPENDIX B GPU COMPUTE CAPABILITIES 245

B.1 GPU Compute Capability Tables 245

B.2 Memory Coalescing Variations 246

Index 251