Oppenheimer et al. study the failures of three large Internet services organizations: one providing mostly online services, one primarily hosting content, and one serving mostly reads. Their study shows that human operator error and custom software maintenance are the leading cause of significant service failures for two of the three organizations.  Not because operator faults occur more frequently, but because they are less isolated and much more difficult to mask.  For the third organization, network faults are the most significant cause of service failures due to the lower degree of redundancy in network components. 

The strength of this paper is in providing a new vantage point for mitigating large-scale service failures.  In most papers, service failure mitigation techniques are strictly technology oriented.  However, Oppenheimer et al. suggest that current fault-tolerant technologies are "good enough" and that focus should be given to reducing human operator faults.  The paper notes that a human operator's environment is often too complex and prone to error.  They propose that fundamental advances in an operator's environment (intelligent consistency checking tools, understanding of system architecture, etc.) will go a long way in mitigating service failures.

Although the paper makes good arguments, there is a lack of significant concrete evidence for those arguments.  Two out of three organizations is definitely not enough to draw any conclusions - the paper is more of a case study.  Furthermore, the results rely on logs and written documentation that the authors admit are not always correctly written.  The study (overall) relies on significant speculation and subjective analysis.

However, the concept that human faults are the "weakest link" in a complex large-scale system is not unreasonable.  This is an easily overlooked fact when designing complex systems and this paper does a good job of bringing it into the light.