图书介绍

网站运维工程（影印版）pdf电子书版本下载

BETSY BEYER，CHRIS JONES，JENNIFER PETOFF，NIALL RICHARD MURPHY编著
出版社：南京：东南大学出版社
ISBN：9787564172961
出版时间：2018
标注页数：528页
文件大小：75MB
文件页数：554页
主题词：网站建设－英文；网站－维护－英文

PDF下载

点此进入-本书在线PDF格式电子书下载【推荐-云解压-方便快捷】直接下载PDF格式图书。移动端-PC端通用
种子下载[BT下载速度快] 温馨提示：（请使用BT下载软件FDM进行下载）软件下载地址页直链下载[便捷但速度慢] [在线试读本书] [在线获取解压码]

点击复制MD5值：02f9f2d134df757b7722eb64daadd25e

下载说明

网站运维工程（影印版）PDF格式电子书版下载

下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。

点击复制85GB完整离线版磁力链接到迅雷FDM等BT下载工具进行下载详情点击-查看共享计划

建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台）。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如 BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用！后期资源热门了。安装了迅雷也可以迅雷进行下载！

（文件页数要大于标注页数，上中下等多册电子书除外）

注意：本站所有压缩包均有解压码： 点击下载压缩包解压工具

图书目录

Part Ⅰ．Introduction 3

1．Introduction 3

The Sysadmin Approach to Service Management 3

Google’s Approach to Service Management：Site Reliability Engineering 5

Tenets of SRE 7

The End of the Beginning 12

2．The Production Environment at Google,from the Viewpoint of an SRE 13

Hardware 13

System Software That“Organizes”the Hardware 15

Other System Software 18

Our Software Infrastructure 19

Our Development Environment 19

Shakespeare：A Sample Service 20

Part Ⅱ．Principles 25

3．Embracing Risk 25

Managing Risk 25

Measuring Service Risk 26

Risk Tolerance of Services 28

Motivation for Error Budgets 33

4．Service Level Objectives 37

Service Level Terminology 37

Indicators in Practice 40

Objectives in Practice 43

Agreements in Practice 47

5．Eliminating Toil 49

Toil Defined 49

Why Less Toil Is Better 51

What Qualifies as Engineering? 52

Is Toil Always Bad? 52

Conclusion 54

6．Monitoring Distributed Systems 55

Definitions 55

Why Monitor? 56

Setting Reasonable Expectations for Monitoring 57

Symptoms Versus Causes 58

Black-Box Versus White-Box 59

The Four Golden Signals 60

Worrying About Your Tail（or,Instrumentation and Performance） 61

Choosing an Appropriate Resolution for Measurements 62

As Simple as Possible,No Simpler 62

Tying These Principles Together 63

Monitoring for the Long Term 64

Conclusion 66

7．The Evolution of Automation at Google 67

The Value of Automation 67

The Value for Google SRE 70

The Use Cases for Automation 70

Automate Yourself Out of a Job：Automate ALL the Things! 73

Soothing the Pain：Applying Automation to Cluster Turnups 75

Borg：Birth of the Warehouse-Scale Computer 81

Reliability Is the Fundamental Feature 83

Recommendations 84

8．Release Engineering 87

The Role of a Release Engineer 87

Philosophy 88

Continuous Build and Deployment 90

Configuration Management 93

Conclusions 95

9．Simplicity 97

System Stability Versus Agility 97

The Virtue of Boring 98

I Won’t Give Up My Code! 98

The“Negative Lines of Code”Metric 99

Minimal APIs 99

Modularity 100

Release Simplicity 100

A Simple Conclusion 101

Part Ⅲ．Practices 107

10．Practical Alerting from Time-Series Data 107

The Rise of Borgmon 108

Instrumentation of Applications 109

Collection of Exported Data 110

Storage in the Time-Series Arena 111

Rule Evaluation 114

Alerting 118

Sharding the Monitoring Topology 119

Black-Box Monitoring 120

Maintaining the Configuration 121

Ten Years On... 122

11．Being On-Call 125

Introduction 125

Life of an On-Call Engineer 126

Balanced On-Call 127

Feeling Safe 128

Avoiding Inappropriate Operational Load 130

Conclusions 132

12．Effective Troubleshooting 133

Theory 134

In Practice 136

Negative Results Are Magic 144

Case Study 146

Making Troubleshooting Easier 150

Conclusion 150

13．Emergency Response 151

What to Do When Systems Break 151

Test-Induced Emergency 152

Change-Induced Emergency 153

Process-Induced Emergency 155

All Problems Have Solutions 158

Learn from the Past.Don’t Repeat It． 158

Conclusion 159

14．Managing Incidents 161

Unmanaged Incidents 161

The Anatomy of an Unmanaged Incident 162

Elements of Incident Management Process 163

A Managed Incident 165

When to Declare an Incident 166

In Summary 166

15．Postmortem Culture：Learning from Failure 169

Google’s Postmortem Philosophy 169

Collaborate and Share Knowledge 171

Introducing a Postmortem Culture 172

Conclusion and Ongoing Improvements 175

16．Tracking Outages 177

Escalator 178

Outalator 178

17．Testing for Reliability 183

Types of Software Testing 185

Creating a Test and Build Environment 190

Testing at Scale 192

Conclusion 204

18．Software Engineering in SRE 205

Why Is Software Engineering Within SRE Important? 205

Auxon Case Study：Project Background and Problem Space 207

Intent-Based Capacity Planning 209

Fostering Software Engineering in SRE 218

Conclusions 222

19．Load Balancing at the Frontend 223

Power Isn’t the Answer 223

Load Balancing Using DNS 224

Load Balancing at the Virtual IP Address 227

20．Load Balancing in the Datacenter 231

The Ideal Case 232

Identifying Bad Tasks：Flow Control and Lame Ducks 233

Limiting the Connections Pool with Subsetting 235

Load Balancing Policies 240

21．Handling Overload 247

The Pitfalls of“Queries per Second” 248

Per-Customer Limits 248

Client-Side Throttling 249

Criticality 251

Utilization Signals 253

Handling Overload Errors 253

Load from Connections 257

Conclusions 258

22．Addressing Cascading Failures 259

Causes of Cascading Failures and Designing to Avoid Them 260

Preventing Server Overload 265

Slow Startup and Cold Caching 274

Triggering Conditions for Cascading Failures 276

Testing for Cascading Failures 278

Immediate Steps to Address Cascading Failures 280

Closing Remarks 283

23．Managing Critiol State：Distributed Consensus for Reliability 285

Motivating the Use of Consensus：Distributed Systems Coordination Failure 288

How Distributed Consensus Works 289

System Architecture Patterns for Distributed Consensus 291

Distributed Consensus Performance 296

Deploying Distributed Consensus-Based Systems 304

Monitoring Distributed Consensus Systems 312

Conclusion 313

24．Distributed Periodic Scheduling with Cron 315

Cron 315

Cron Jobs and Idempotency 316

Cron at Large Scale 317

Building Cron at Google 319

Summary 326

25．Data Processing Pipelines 327

Origin of the Pipeline Design Pattern 327

Initial Effect of Big Data on the Simple Pipeline Pattern 328

Challenges with the Periodic Pipeline Pattern 328

Trouble Caused By Uneven Work Distribution 328

Drawbacks of Periodic Pipelines in Distributed Environments 329

Introduction to Google Workflow 333

Stages of Execution in Workflow 335

Ensuring Business Continuity 337

Summary and Concluding Remarks 338

26．Data Integrity：What You Read Is What You Wrote 339

Data Integrity’s Strict Requirements 340

Google SRE Objectives in Maintaining Data Integrity and Availability 344

How Google SRE Faces the Challenges of Data Integrity 349

Case Studies 360

General Principles of SRE as Applied to Data Integrity 367

Conclusion 368

27．Reliable Product Launchesat Scale 369

Launch Coordination Engineering 370

Setting Up a Launch Process 372

Developing a Launch Checklist 375

Selected Techniques for Reliable Launches 380

Development of LCE 384

Conclusion 387

Part Ⅳ．Management 391

28．Accelerating SREs to On-Call and Beyond 391

You’ve Hired Your Next SRE(s),Now What? 391

Initial Learning Experiences：The Case for Structure Over Chaos 394

Creating Stellar Reverse Engineers and Improvisational Thinkers 397

Five Practices for Aspiring On-Callers 400

On-Call and Beyond：Rites of Passage,and Practicing Continuing Education 406

Closing Thoughts 406

29．Dealing with Interrupts 407

Managing Operational Load 408

Factors in Determining How Interrupts Are Handled 408

Imperfect Machines 409

30．Embedding an SRE to Recover from Operational Overload 417

Phase 1：Learn the Service and Get Context 418

Phase 2：Sharing Context 420

Phase 3：Driving Change 421

Conclusion 423

31．Communication and Collaboration in SRE 425

Communications：Production Meetings 426

Collaboration within SRE 430

Case Study of Collaboration in SRE：Viceroy 432

Collaboration Outside SRE 437

Case Study：Migrating DFP to F1 437

Conclusion 440

32．The Evolving SRE Engagement Model 441

SRE Engagement：What,How,and Why 441

The PRR Model 442

The SRE Engagement Model 443

Production Readiness Reviews：Simple PRR Model 444

Evolving the Simple PRR Model：Early Engagement 448

Evolving Services Development：Frameworks and SRE Platform 451

Conclusion 456

Part Ⅴ．Conclusions 459

33．Lessons Learned from Other Industries 459

Meet Our Industry Veterans 460

Preparedness and Disaster Testing 462

Postmortem Culture 465

Automating Away Repetitive Work and Operational Overhead 467

Structured and Rational Decision Making 469

Conclusions 470

34．Conclusion 473

A．Availability Table 477

B．A Collection of Best Practices for Production Services 479

C．Example Incident State Document 485

D．Example Postmortem 487

E．Launch Coordination Checklist 493

F．Example Production Meeting Minutes 497

Bibliography 501

Index 513