The directory-based cache coherence protocol for the DASH multiprocessor

From AcaWiki
Jump to: navigation, search

Citation: Daniel Lenoski, James Pierce Laudon, Kourosh Gharachorloo, Anoop Gupta, John L. Hennessy (1990/05) The directory-based cache coherence protocol for the DASH multiprocessor. ACM SIGARCH Computer Architecture News (RSS)
DOI (original publisher): 10.1145/325096.325132
Semantic Scholar (metadata): 10.1145/325096.325132
Sci-Hub (fulltext): 10.1145/325096.325132
Internet Archive Scholar (search for fulltext): The directory-based cache coherence protocol for the DASH multiprocessor
Download: https://dl.acm.org/doi/abs/10.1145/325096.325132
Tagged: Computer Science (RSS) Computer Architecture (RSS)

Summary

Uniprocessors can only handle small problem sizes, so the future lies in multi-processor systems. However, these have multiple separate memories. Data has to be communicated between them. DASH uses a directory-based coherence protocol with point-to-point messages.

Theoretical and Practical Relevance

Directory-based cache coherence is essential to modern high-performance computers (supercomputers), although we have evolved to more complex directory structures. The larger Stanford DASH project would be hugely influential in academia and industry, proving the viability of a cache-coherent NUMA machine and release consistency.


Elsewhere

Problem

  1. Uniprocessors have inherent limitations in the size of problems a single one can process, but they are easy to replicate
  2. They either use shared memory/cache coherence or message passing to communicate.
    • Shared memory: User to program (no need for explicit data partitioning esp. in dynamic load distribution) but has problems scaling up.
    • Message passing: Scales up better
  3. How do we implement scalable shared memory?

Background

  • Memory Consistency model: What values can a store return (when there have been multiple stores on different processors)?
    • Stronger have fewer permissible values for loads. Easier to program, harder to optimize.
  • Case study on DASH Multiprocessor:
    • processing node := processors and private cache.
    • cluster := N processing nodes (shared memory through snooping-based cache coherence), memory, directory, and remote access cache.
    • System := M clusters (shared memory through directory-based cache coherence) connected with a high-bandwidth low-latency interconnect.
  • Cache coherence: Either snoopy or directory-based.
    • This system chooses directory, because it scales better.
    • Correctness: should guarantee a specified memory consistency model, deadlock free, detect processor failures
    • Performance metrics: latency, bandwidth (throughput)

Solution

  • Release consistency (not novel): accesses can be delayed until a release (aka fence).
    • DASH provides a full fence (all accesses have to be completed) and a write fence (all writes have to be completed).
  • DASH Cache Coherence protocol: invalidation-based, ownership-based protocol between clusters, which connects to a snooping-based, MESI protocol within clusters. This paper is primarily about the inter-cluster protocol.
    • States: uncached remote, shared remote, dirty remote. They work like you would expect.
    • Directory: bit-vector with one bit per remote cluster.
      • Future work needs to reduce this space for the system to become scalable. Perhaps, group clusters into regions and have a bit per region or use a sparse directory.
    • No formal verification, which is mostly a product of the time. Modern model-checkers make this feasible.
    • However, there is some possibility of livelock.
  • No performance/scalability evaluation. This is probably also a product of the times.