SWEL: Hardware Cache Coherence Protocols to Map Shared Data onto Shared Caches - PDF

Please download to get full document.

View again

of 11
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Memoirs

Published:

Views: 4 | Pages: 11

Extension: PDF | Download: 0

Share
Related documents
Description
SWEL: Hardware Cache Coherence Protocols to Map Shared Data onto Shared Caches Seth H. Pugsley, Josef B. Spjut, David W. Nellans, Rajeev Balasubramonian School of Computing, University of Utah {pugsley,
Transcript
SWEL: Hardware Cache Coherence Protocols to Map Shared Data onto Shared Caches Seth H. Pugsley, Josef B. Spjut, David W. Nellans, Rajeev Balasubramonian School of Computing, University of Utah {pugsley, sjosef, dnellans, ABSTRACT Snooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, frequent indirections, and are more prone to design bugs. In this paper, we propose a novel coherence protocol that greatly reduces the number of coherence operations and falls back on a simple broadcast-based snooping protocol when infrequent coherence is required. This new protocol is based on the premise that most blocks are either private to a core or read-only, and hence, do not require coherence. This will be especially true for future large-scale multi-core machines that will be used to execute message-passing workloads in the HPC domain, or multiple virtual machines for servers. In such systems, it is expected that a very small fraction of blocks will be both shared and frequently written, hence the need to optimize coherence protocols for a new common case. In our new protocol, dubbed SWEL (protocol states are Shared, Written, Exclusivity Level), the L1 cache attempts to store only private or read-only blocks, while shared and written blocks must reside at the shared L2 level. These determinations are made at runtime without software assistance. While accesses to blocks banished from the L1 become more expensive, SWEL can improve throughput because directory indirection is removed for many common write-sharing patterns. Compared to a MESI based directory implementation, we see up to 15% increased performance, a maximum degradation of 2%, and an average performance increase of 2.5% using SWEL and its derivatives. Other advantages of this strategy are reduced protocol complexity (achieved by reducing transient states) and significantly less storage overhead than traditional directory protocols. Categories and Subject Descriptors B.3.2 [Memory Structures]: Design Styles Cache Memories This work was supported in parts by NSF grants CCF , CCF , CCF , NSF CAREER award CCF , SRC grant , Intel, HP, and the University of Utah. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PACT 10, September 11 15, 2010, Vienna, Austria. Copyright 2010 ACM /10/09...$ General Terms Design, Performance, Experimentation Keywords Cache Coherence, Shared Memory 1. INTRODUCTION It is expected that multi-core processors will continue to support cache coherence in the future. Cache coherence protocols have been well-studied in the multi-socket multiprocessor era [9] and several snooping-based and directory-based protocols have emerged as clear favorites. Many of these existing solutions are being directly employed in modern multi-core machines. However, we believe that there are several differences between the traditional workloads that execute on modern multiprocessors, and workloads that will be designed for future many-core machines. Expensive multi-socket systems with hardware cache coherence were designed with specific shared-memory applications in mind, i.e., they were not intended as general-purpose desktop machines that execute a myriad of single and multi-threaded applications. Many of the applications running on today s multi-core machines are still single-threaded applications that do not explicitly rely on cache coherence. Further, future many-cores in servers and datacenters will likely execute multiple VMs (each possibly executing a multi-programmed workload), with no data sharing between VMs, again obviating the need for cache coherence. We are not claiming that there will be zero data sharing and zero multi-threaded applications on future multi-cores; we are simply claiming that the percentage of cycles attributed to shared memory multi-threaded execution (that truly needs cache coherence) will be much lower in the many-core era than it ever was with traditional hardware cache coherent multiprocessors. If the new common case workload in many-core machines does not need cache coherence, a large investment (in terms of die area and design effort) in the cache coherence protocol cannot be justified. Continuing the above line of reasoning, we also note that manycore processors will also be used in the high performance computing (HPC) domain, where highly optimized legacy MPI applications are the common case. Data sharing in these programs is done by passing messages and not directly through shared memory. However, on multicore platforms these messages are passed through shared memory buffers, and their use shows a strong producerconsumer sharing pattern. State-of-the-art directory-based cache coherence protocols, which are currently employed in large-scale multi-cores, are highly inefficient when handling producer-consumer sharing. This is because of the indirection introduced by the direc- tory: the producer requires three serialized messages to complete its operation and the consumer also requires three serialized messages. Thus, directory protocols are likely to be highly inefficient for both the on-chip and off-chip sharing patterns that are becoming common in large-scale multi-cores. Given the dramatic shift in workloads, there is a need to reconsider the design of coherence protocols; new coherence protocols must be designed to optimize for a new common case. We must first optimize for the producer-consumer sharing pattern. Secondly, if most blocks are only going to be touched by a single core, the storage overhead of traditional directories that track large sharing vectors is over-provisioned and should be eliminated. Finally, we need simpler protocols that can lower design and verification efforts when scaling out. In this paper, we propose a novel hardware cache coherence protocol that tries to achieve the above goals. Our protocol (named SWEL after the protocol states) is based on the premise that a large fraction of blocks are either private to a core or are shared by multiple cores in read-only mode. Such blocks do not require any cache coherence. Blocks must be both shared and written for coherence operations to be necessary. A key example of such blocks are those used in producer-consumer relationships. We recognize that blocks of this type are best placed in the nearest shared cache in the memory hierarchy, eliminating the need for constant, expensive use of coherence invalidations and updates between local private caches. By eliminating the traditional coherence invalidate/update pattern, we can avoid implementing a costly sharer-tracking coherence mechanism. Instead, we propose using a simpler mechanism that can categorize blocks in one of only two categories (private or read-only vs. shared and written). Traditional directory overheads are now replaced with the book-keeping state required to achieve this categorization. This new protocol therefore has lower storage overhead and fewer transient states. The protocol does have some disadvantages, borne out of the fact that some blocks are relegated to a slower, shared cache level (L2 in this paper), and are therefore more expensive to access. Our results show that on average a SWEL based protocol can outperform a traditional MESI directorybased protocol by 2.5% on multi-threaded workloads from PAR- SEC, SPLASH-2, and NAS. This improvement is accompanied by lower storage overhead and design effort. In Section 2 of this paper, we discuss the background of coherence protocols, their purposes, and the motivation for the features SWEL offers. Section 3 outlines the SWEL protocol, and we discuss its operations, drawbacks and how some of those drawbacks can be improved upon. In Section 4 we discuss the theoretical differences between the performance of MESI and SWEL, and the circumstances under which each should have optimal and worst performance. Sections 5 and 6 deal with our simulation methodology and results. In Section 7 we discuss related work and in Section 8 we wrap up our discussion of the SWEL protocol. 2. BACKGROUND & MOTIVATION 2.1 Data Sharing in Multi-threaded Workloads All cache coherence protocols operate on the basic assumption that all data may be shared at any time, and measures need to be taken at every step to ensure that correctness is enforced when this sharing occurs. Traditional coherence systems over-provision for the event that all data may be shared by every processor at the same time, which is an extremely unlikely scenario. While private data never will require coherence operations, shared data may or may not require coherence support. Shared data can be broken down into two classes: read-only and read-write. Shared, read-only blocks are not complicated to handle efficiently, as simple replication of the data is sufficient. Shared data that is also written to, however, must be handled with care to guarantee that correctness and consistency is maintained between cores. Figure 1 shows the sharing profile of several 16 threaded applications from the NAS Parallel Benchmarks [4], Parsec [5], and Splash2 [22] suite by (a) location, and (b) references. Breaking down sharing by locations and references provides two different views of how sharing occurs. Figure 1a indicates that very little data is actually shared by two or more cores; on average 77.0% of all memory locations are touched by only a single processor. Figure 1b however, shows that in terms of memory references, private data locations are accessed very infrequently (only 12.9% of all accesses). This implies that the vast majority of accesses in workload execution actually reference a very small fraction of the total memory locations. While the majority of accesses are to locations which are shared, very few of those locations (7.5% on average) are both shared and written, the fundamental property on which we base this work. Because shared/written data is a fundamental synchronization overhead that limits application scalability, we expect future workloads to try and minimize these accesses even further. 2.2 Coherence Protocols For this study, we assume an on-chip cache hierarchy where each core has private L1 instruction and data caches and a shared L2 cache. The L2 cache is physically distributed on-chip such that each processor tile includes a bank of the L2 cache. We assume an S-NUCA [16] style L2 cache, where the address is enough to identify the bank that contains the relevant data. Our focus is the implementation of hardware coherence among the many private L1s and the shared L2, though our proposal could easily be extended to handle a multi-level hierarchy. Coherence is typically implemented with snooping or directorybased protocols [9]. Bus-based snooping protocols are generally simpler to design, but are not scalable because of the shared bus. Directory-based protocols will likely have to be employed for manycore architectures of the future. As a baseline throughout this study, we will employ a MESI directory-based and invalidation-based coherence protocol. The salient features of this protocol are as follows: Directory storage: Each cache block in L2 must maintain a directory to keep track of how the block is being shared by the L1s. In an unoptimized design, each L2 cache block maintains a bit per L1 cache in the directory. Each bit denotes if the corresponding L1 has a copy of this cache block. The directory includes additional bits per cache block to denote the state of the cache block. Thus, the directory grows linearly with the number of cores (or private L1s). This storage overhead can be reduced by maintaining a bit per group of cores [9, 17]. If a bit is set in the directory, it means that one of the cores in that group of cores has a copy of the block. Therefore, when invalidating a cache block, the message must be broadcast to all the cores in a group marked as a sharer. This trades off some directory storage overhead for a greater number of on-chip network messages. It must also be noted that each L1 cache block requires two bits to track one of the four MESI coherence states. Indirection: All L1 misses must contact the directory in L2 before being serviced. When performing a write, the directory is often contacted and the directory must send out invalidations to other sharers. The write can proceed only after acknowledgments are received from all sharers. Thus, mul- a. Sharing profile by memory locations b. Sharing profile by memory references Figure 1: Motivational Data: Memory sharing profile of 16 core/thread workloads tiple messages must be exchanged on the network before the coherence operation is deemed complete. Similarly, when an L1 requests a block that has been written by another L1, the directory is first contacted and the request is forwarded to the L1 that has the only valid copy. Thus, many common coherence operations rely on the directory to serve as a liaison between L1 caches. Unlike a snooping-based protocol, the involved L1 caches cannot always directly communicate with each other. This indirection introduced by the directory can slow down common communication patterns. A primary example is the producer-consumer sharing pattern where one core writes into a cache block and another core attempts to read the same cache block. As described above, each operation requires three serialized messages on the network. Complexity: Directory-based coherence protocols are often error-prone and entire research communities are tackling their efficient design with formal verification. Since it is typically assumed that the network provides no ordering guarantees, a number of corner cases can emerge when handling a coherence operation. This is further complicated by the existence of transient coherence states in the directory. In this work, we attempt to alleviate the above three negative attributes of directory-based coherence protocols. 3. SWEL PROTOCOL AND OPTIMIZATIONS We first describe the basic workings of the SWEL protocol from a high level. The protocol design is intended to overcome three major deficiencies in a baseline directory-based protocol: the directory storage overhead, the need for indirection, and the protocol complexity. The basic premise is this: (i) many blocks do not need coherence and can be freely placed in L1 caches; (ii) blocks that would need coherence if placed in L1 are only placed in L2. Given this premise, it appears that the coherence protocol is all but eliminated. This is only partially true as other book-keeping is now required to identify which of the above two categories a block falls into. If a cache block is either private or is read-only, then that block can be safely cached in the L1 without ever needing to worry about coherence. If the block is both shared (not private) and written (not read-only), then it must never exist in the L1 and must exist at the lowest common level of the cache hierarchy, where all cores have equal access to it without fear of ever requiring additional coherence operations. If a block is mis-categorized as read-only or as private, then it must be invalidated and evicted from all L1 caches and must reside permanently in the lowest common level of cache. Consider from a high level how a block undergoes various transitions over its lifetime. When a block is initially read, it is brought into both L1 and L2, because at this early stage the block appears to be a private block. Some minor book-keeping is required in the L2 to keep track of the fact that only one core has ever read this block and that there have been no writes to it. If other cores read this block or if the block is ever written to, then the book-keeping state is updated. When the state for a block is shared + written, the block is marked as un-cacheable in L1 and an invalidation is broadcast to all private caches. All subsequent accesses to this block are serviced by the lowest common level of cache, which in our experiments and descriptions is L SWEL States and Transitions States We now explain the details of the protocol and the new states introduced. Every L2 cache block has 3 bits of state associated with it, and every L1 cache block has 1 bit of state. The first state bit in L2, Shared (S), keeps track of whether the block has been touched by multiple cores. The second state bit, Written (W), keeps track of whether the block has been written to. The third state bit, Exclusivity Level (EL), which is also the one state bit in the L1, denotes which cache has exclusive access to this block. The exclusivity level bit may only be set in one cache in the entire system, be it the L2 or one of the L1s. We therefore also often refer to it as the EL Token. The storage requirement for SWEL (3 bits per L2 block and 1 bit per L1 block) does not depend on the number of sharers or L1 caches (unlike the MESI directory protocol); it is based only on the number of cache blocks. These 4 bits are used to represent 5 distinct states in the collapsed state diagram shown in Figure 2(a). We next walk through various events in detail. For now, we will assume that the L1 cache is write-through and the L1-L2 hierarchy is inclusive Initial Access When a data block is first read by a CPU, the block is brought into the L2 and the corresponding L1, matching the Private Read state in the diagram. The EL state bit in the L1 is set to denote the EL token ceases to hold much meaning so it is unimportant for the EL token to be reclaimed from the L1 that holds it at this time. Additional shared readers beyond the first do not have any additional impact on the L2 s state. (a) SWEL (b) MESI Figure 2: Collapsed State Machine Diagram for the SWEL and MESI Protocols that the block is exclusive to the L1. The block in L2 is set as non-shared, non-written, and not exclusive to the L2. If this block is evicted from the L1 while in this state, it sends a message back to the L2 giving up its EL bit, matching the L2 Only state in the diagram. The state in the L2 is now non-shared, non-written and exclusive to the L2. If read again by a CPU, the block will re-enter the Private Read state Writes When a block is first written by a CPU, assuming it was either in the L2 Only state or the Private Read state, it will enter the Private R/W state. If this initial write resulted in an L2 miss, then the block will enter this state directly. A message is sent as part of the writethrough policy to the L2 so that it may update its state to set the W bit. The W bit in the L2 is sticky, and will not change until the block is evicted from the L2. This message also contains a bit that is set if this writing CPU has the EL token for that block. This is so the L2 knows to not transition to a shared state. From the Private R/W state, an eviction of this block will send a message to the L2 giving back its EL bit and it will go back to the L2 Only state. Private data spends all of its time in these three states: L2 Only, Private Read and Private R/W Determining the Shared State If a cache reads or writes a block and neither the cache nor the L2 have the corresponding EL token, then the L2 must set its S bit and enter either the Shared Read or Shared Written state. Once a block has its S bit set in the L2, that bit can never be changed unless the block is evicted from the L2. Since this is a sticky state, L1 Invalidation and L2 Relegation Shared Read blocks are still able to be cached in the L1s, but Shared R/W blocks must never be. When a block first enters the Shared R/W st
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks