Random-access memory(RAM)comes in two varieties—static and dynamic. Static RAM (SRAM) is faster and significantly more expensive than Dynamic RAM (DRAM). SRAM is used for cache memories, both on and off the CPU chip. DRAM is used for the main memory plus the frame buffer of a graphics system.
SRAM stores each bit in a bistable memory cell. it can stay indefinitely in either of two different voltage configurations, or states. SRAM memory cell will retain its value indefinitely,as long as it is kept powered.
DRAM stores each bit as charge on a capacitor. A DRAM memory cell is very sensitive to any disturbance. The memory system must periodically refresh every bit of memory by reading it out and then rewriting it.
The trade-off is that SRAM cells use more transistors than DRAM cells, and thus have lower densities, are more expensive, and consume more power.
Each DRAM chip is connected to some circuitry, known as the memory controller, that can transfer w bits at a time to and from each DRAM chip. To read the contents of supercell (i, j ), the memory controller sends the row address i to the DRAM, followed by the column address j . The DRAM responds by sending the contents of supercell (i, j ) back to the controller. The row address i is called a RAS (Row Access Strobe) request. The column address j is called a CAS (Column Access Strobe) request. Notice that the RAS and CAS requests share the same DRAM address pins.
DRAM chips are packaged in memory modules that plug into expansion slots on the main system board (motherboard). Common packages include the 168-pin dual inline memory module (DIMM), which transfers data to and from the memory controller in 64-bit chunks, and the 72-pin single inline memory module (SIMM), which transfers data in 32-bit chunks. The example module stores a total of 64 MB (megabytes) using eight 64-Mbit 8M × 8 DRAM chips, numbered 0 to 7. Each supercell stores 1 byte of main memory. The memory controller converts A to a supercell address (i, j ) and sends it to the memory module, which then broadcasts i and j to each DRAM. Main memory can be aggregated by connecting multiple memory modules to the memory controller. In this case, when the controller receives an address A, the controller selects the module k that contains A, converts A to its (i, j ) form, and sends (i, j ) to module k.
DRAMs and SRAMs are volatile in the sense that they lose their information if the supply voltage is turned off. Nonvolatile memories, on the other hand, retain their information even when they are powered off. they are referred to collectively as read-only memories (ROMs), even though some types of ROMs can be written to as well as read. Flash memory is a type of nonvolatile memory, based on EEPROMs (electrically erasable programmable ROM), that has become an important storage technology.
Programs stored in ROM devices are often referred to as firmware. When a computer system is powered up, it runs firmware stored in a ROM. for example, a PC’s BIOS (basic input/output system) routines.
Data flows back and forth between the processor and the DRAM main memory over shared electrical conduits called buses. Each transfer of data between the CPU and memory is accomplished with a series of steps called a bus transaction. A read transaction transfers data from the main memory to the CPU. A write transaction transfers data from the CPU to the main memory. A bus is a collection of parallel wires that carry address, data, and control signals.
Input/output (I/O) devices such as graphics cards, monitors, mice, keyboards, and disks are connected to the CPU and main memory using an I/O bus such as Intel’s Peripheral Component Interconnect (PCI) bus.
After the disk controller receives the read command from the CPU, it translates the logical block number to a sector address, reads the contents of the sector, and transfers the contents directly to main memory, without any intervention from the CPU (Figure 6.12(b)). This process, whereby a device performs a read or write bus transaction on its own, without any involvement of theCPU, is known as direct memory access (DMA). The transfer of data is known as a DMA transfer.
A page can be written only after the entire block to which it belongs has been erased (typically this means that all bits in the block are set to 1). However, once a block is erased, each page in the block can be written once with no further erasing.
Well-written computer programs tend to exhibit good locality. That is, they tend to reference data items that are near other recently referenced data items, or that were recently referenced themselves. This tendency, known as the principle of locality
A function that visits each element of a vector sequentially is said to have a stride-1 reference pattern (with respect to the element size). sometimes refer to stride-1 reference patterns as sequential reference patterns. Visiting every kth element of a contiguous vector is called a stride-k reference pattern.
Data is always copied back and forth between level k and level k + 1 in blocksized transfer units. while the block size is fixed between any particular pair of adjacent levels in the hierarchy, other pairs of levels can have different block sizes. devices lower in the hierarchy (further from the CPU) have longer access times, and thus tend to use larger block sizes in order to amortize these longer access times.
This process of overwriting an existing block is known as replacing or evicting the block. The block that is evicted is sometimes referred to as a victim block. The decision about which block to replace is governed by the cache’s replacement policy.
If the cache at level k is empty, then any access of any data object will miss. An empty cache is sometimes referred to as a cold cache, and misses of this kind are called compulsory misses or cold misses.
hardware caches typically implement a more restricted placement policy that restricts a particular block at level k + 1 to a small subset (sometimes a singleton) of the blocks at level k. For example, we might decide that a block i at level k + 1 must be placed in block (i mod n) at level k. conflict miss, in which the cache is large enough to hold the referenced data objects, but because they map to the same cache block, the cache keeps missing.
Programs often run as a sequence of phases (e.g., loops) where each phase accesses some reasonably constant set of cache blocks. This set of blocks is called the working set of the phase. When the size of the working set exceeds the size of the cache, the cache will experience what are known as capacity misses. In other words, the cache is just too small to handle this particular working set.
A cache with exactly one line per set (E = 1) is known as a direct-mapped cache
thrashing describes any situation where a cache is repeatedly loading and evicting
the same sets of cache blocks. One easy solution is to put B bytes of padding at the end of each array. For example, instead of defining x to be float x, we define it to be float x. With the padding at the end of x, x[i] and y[i] now map to different sets, which eliminates the thrashing conflict misses.
A set associative cache relaxes this constraint so each set holds more than one cache
line. A cache with 1<E <C/B is often called an E-way set associative cache. the cache must search each line in the set, searching for a valid line whose tag matches the tag in the address. A fully associative cache consists of a single set (i.e., E = C/B) that contains all of the cache lines.
write-through, is to immediately write w’s cache block to the next lower level. While simple, write-through has the disadvantage of causing bus traffic with every write. Another approach, known as write-back, defers the update as long as possible by writing the updated block to the next lower level only when it is evicted from the cache by the replacement algorithm. The cache must maintain an additional dirty bit for each cache line that indicates whether or not the cache block has been modified.
Here is the basic approach we use to try to ensure that our code is cache friendly.
1. Make the common case go fast. Programs often spend most of their time in a few core functions. These functions often spend most of their time in a few loops. So focus on the inner loops of the core functions and ignore the rest.
2. Minimize the number of cache misses in each inner loop.All other things being equal, such as the total number of loads and stores, loops with better miss rates will run faster.
we can recover a fascinating two-dimensional function of read throughput versus temporal and spatial locality. This function is called a memory mountain.
. Focus your attention on the inner loops, where the bulk of the computations and memory accesses occur.
. Try to maximize the spatial locality in your programs by reading data objects sequentially, with stride 1, in the order they are stored in memory.
. Try to maximize the temporal locality in your programs by using a data object as often as possible once it has been read from memory.