- How to Read a Paper S. Keshav David R. Cheriton School of Computer Science, University of Waterloo Waterloo, ON, Canada
Virtual Time and Global State in Distributed Systems
*The core papers are: 1, 2, 4, 7, 9, 10
- L. Lamport, "Time, Clocks and the Ordering of Events in a Distributed System", Communications of the ACM, 1978
- K. M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems", ACM Transactions on Computer Systems, 1985
- D. Jefferson, "Distributed Simulation and the Time Warp Operating System", ACM Symposium on Operating Systems Principles, 1987
- F. Mattern, "Virtual Time and Global States of Distributed Systems", Proc. Workshop on Parallel and Distributed Algorithms, 1989
- C. Fetzer and F. Cristian, "An optimal internal clock synchronization algorithm", COMPASS 1995
- F. Cristian and C. Fetzer, "Fault-tolerant external clock synchronization", ICDCS 1995
- A. Kshemkalyanit, M. Raynalt and M. Singhals, "An introduction to snapshot algorithms in distributed computing"
- C.l Fidge, Timestamps in message-passing systems that preserve the partial ordering , Australian Computer Sci. Comm. 10 (I) (February 1988) 56-66.
- C.l Fidge, Fundamentals of distributed system observation, IEEE Software 13 (6) (November 1996) 77-83.
- Raynal M. and Singhal M., Logical time: Capturing causality in distributed systems, Computer, vol. 29, pp. 49-56, 1996.
- C.l Fidge, A limitation of vector timestamps for reconstructing distributed computations, in: Elsevier Science, 1998, Information Processing Letters 87-91.
- Mukesh Singhal and Ajay Kshemkalyani, An efficient implementation of vector clocks in Elsevier Science publishers.
- Facebook's Cassandra uses synchronized clocks for its 'Last Write Wins' policy for conflict resolution
- Spanner: Google’s Globally-Distributed Database estimates worst-case clock drift.
- LinkedIn's Project Voldemort uses vector clocks for versioning, conflict resolution, and repairing replicas.
- Schwarz, R. and Mattern, F. "Detecting causal relationships in distributed computations", Distributed Computing, 1994.
- Sundararaman, B., Buy, U. and Kshemkalyani, A.D. "Clock Synchronization for Wireless Sensor Networks: A Survey", Ad hoc networks 3, no. 3 (2005).
Distributed Operating Systems
- Remote Procedure Calls and Distributed Shared Memory
- A. Birrell, and B. Nelson, "Implementing remote procedure calls", ACM Transactions on Computer Systems, 1984
- P. G. Soares, "On remote procedure call", Proc. of the 1992 conference of the Centre for Advanced Studies on Collaborative research, 1992
- A. L. Ananda, B. H. Tay and E. K. Koh, "A survey of asynchronous remote procedure calls", SIGOPS Operating Systems Review, 1992
- A lecture of RPC, "http://www.cs.cf.ac.uk/Dave/C/node33.html"
- Mutual Exclusion
- G. Ricart and A. Agrawala, "An optimal algorithm for mutual exclusion in computer
networks Communications of the ACM, 1981
- L. Lamport, "Mutual Exclusion Problem": part1", "part 2", Journal of the ACM, 1986
- L. Lamport, "A Fast Mutual Exclusion Algorithm", ACM Transactions on Computer Systems, 1987
- K. Raymond, "A Tree Based Algorithm for Distributed Mutual Exclusion", ACM Transactions on Computer Systems, 1989
- Leader Election
- H. Garcia-Molina, "Elections
in a Distributed Computing Systems"
- Distributed Deadlocks
- A. K. Elmagarmid, "A survey of distributed deadlock detection algorithms", ACM SIGMOD, 1986
- M. Singhal, "Deadlock detection in distributed systems", IEEE Computer, 1989
- Distributed File Systems
- M. Satyanarayanan, "A Survey of Distributed File Systems", Annual Review of Computer Science, 1989
- B. Noble and M. Satyanarayanan, "An Empirical Study of a Highly Available File System", ACM Sigmetrics, 1994
- M. Spasojevic and M. Satyanarayanan, "An Empirical Study of a Wide-Area Distributed File System", ACM Transactions on Computer Systems, 1996
- J. Kubiatowicz, "OceanStore: An Architecture for Global-Scale Persistent Storage", ACM ASPLOS 2000
- J. Kubiatowicz, "The Google File System", ACM SOSP, 2003
- Process Migration
- J. M. Smith, "A survey of process migration mechanisms", ACM SIGOPS Operating Systems, 1988
- A Barak, O Laden, Y Yarom - Citeseer, "The NOW MOSIX and its preemptive process migration scheme", 1995.
- Processing and Load Balancing
- M. H. Willebeek-LeMair, A. P. Reeves, "Strategies for Dynamic Load Balancing on Highly
Parallel Computers", IEEE Transactions on Parallel and Distributed Systems, 1993
- N. Venkatasubramanian, S. Ramanathan, "Load Management in Distributed Video
Servers", ICDCS 1997
- V. Cardellini, M. Colajanni, "Dynamic
Load Balancing on Web-server Systems", Journal IEEE Internet Computing, 1999
- T. Schnekenburger, "Load Balancing
in CORBA: A Survey, Response to the Aggregated Computing RFI".
- Distributed Operating Systems
- W. J. Bolosky, R. P. Draves, R. P. Fitzgerald,
C. W. Fraser, M. B. Jones, T. B. Knoblock and R. Rashid
"Operating System Directions
for the Next Millenium", Proc. of the 6th Workshop on Hot Topics in Operating Systems, 1997
- M. Rozier, V. Abrossimov, F. Armund et al, Overview of the Chorus Distributed Operating
System
- Andrew S. Tanenbaum, M. Frans Kaashoek, Robert van Renesse, Henri E.
Bal, The Amoeba Distributed Operating System - A
Status Report
- Case Studies
- Distributed Computing Frameworks: DCE,
"http://www.opengroup.org/dce/"
- Object-based Middleware: CORBA specification, www.omg.org
- Java
- Jini: "Architectural Overview", Sun Microsystems
- Java RMI: "Java RMI Tutorial"
- EJB: "Enterprise JavaBeans Technology", Sun Developer Network
- J2EE: "Overview", Sun Developer Network
- Service Oriented Architectures
- Web services: "Part of the lectures" by M. Fisher
- .NET: "The .NET Framework"
- SOAP: "Specification"
Messaging and Group Communication in Distributed Systems
- A Case for Message Oriented Middleware, G. Banavar et al.
- D. Dolev and D. Malkhi, "The Transis Approach to High Availability Cluster Communication". Other Interesting Reading: Documentation and papers about Transis
are also avaiable at "http://www.cs.huji.ac.il/labs/transis/
- Y. Amir, et al, "Group Communication as an Infrastructure for Distributed System Management", Proc. of the 3rd Workshop on Services in Distributed and Networked Environments, 1996
- Y. Amir, et al, "The Spread Wide Area Group Communication System".
- R. V. Renesse, K. P. Birman, and S. Maffeis, "Horus: A Flexible Group Communication System", Communications of the ACM, 1996
- S. Banerjee, B. Bhattacharjee and C. Kommareddy, "Scalable Application Layer Multicast", ACM SIGCOMM 2002
- Y. Amir, C. Nita-Rotaru, J. Stanton, G. Tsudik , "Secure Spread: An Integrated Architecture for Secure Group Communication", IEEE Transactions on Dependable and Secure Computing, 2005
- M. Deshpande, B. Xing, I. Lazardis, B. Hore, N. Venkatasubramanian and S. Mehrotra, "CREW: A Gossip-based Flash-Dissemination System", ICDCS 2006
- K. Kim, N. Venkatasubramanian and S. Mehrotra, "FaReCast: Fast, Reliable Application Layer Multicast for Flash Dissemination", ACM Middleware 2010
- The Many Faces of Publish/Subscribe, PATRICK TH. EUGSTER
Fault Tolerance and Reliability
- Consensus
- M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of Distributed Consensus with One Faulty Process", Journal of ACM, 1985
- D. Dolev, C. Dwork, L. Stockmeyer, "On the Minimal Synchronism Needed for Distributed Consensus", Journal of ACM, 1987
- Failure Detectors
- T. D. Chandra and S. Toueg, "Unreliable Failure Detectors for Reliable Distributed Systems", Journal of ACM, 1985
- T. D. Chandra, V. Hadzilacos and S. Toueg, "The Weakest Failure Detector for Solving Consensus", Journal of ACM, 1996
- M. K. Aguilera, W. Chen, and S. Toueg, "Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication, Cornel, 1997
- Replication
- H. S. Sandhu and S. Zhou, "Cluster-based file replication in large-scale distributed systems", ACM SIGMETRICS, 1992
- J. Gray, P. Helland, P. Neil and D. Shasha , "The dangers of replication and a solution", ACM SIGMOD, 1996
- Logging
- A. P. Sistla and J. L. Welch, "Efficient distributed recovery using message logging", ACM SIOPS, 1989