Skip to content
CSEE 4340 Final Review

L15 · Virtual Memory

L15 · Virtual Memory

Topic: memory · 32 pages

CSEE 4340 Lecture 15 — Virtual Memory

CSEE 4340 Lecture 15 — Virtual Memory (slide 1)
Slide 1

Announcements

Announcements (slide 2)
Slide 2
Show slide text

Announcements

  • Project #4
    • Due: May 1, 11:59 PM Eastern Time
  • Office hours
    • End this week
  • Final project presentation
    • May 4, 2026
    • 8:40 to 9:50 AM
    • 8 minutes per group with 2 minutes for questions & answers
    • Schedule and example slides on Courseworks

https://courseworks2.columbia.edu/courses/237081

Readings

Readings (slide 3)
Slide 3
Show slide text

Readings

For today:

Recap: Prefetching

Recap: Prefetching (slide 4)
Slide 4

What is Prefetching?

What is Prefetching? (slide 5)
Slide 5
Show slide text

What is Prefetching?

  • Fetch memory ahead of time
  • Targets compulsory, capacity, & coherence misses

Big challenges:

  1. knowing “what” to fetch
    • Fetching useless info wastes valuable resources
  2. “when” to fetch it
    • Fetching too early clutters storage
    • Fetching too late defeats the purpose of “pre”-fetching

This single-slide recap sets the stage by reminding us that hardware prefetching is a speculative miss-hiding technique whose effectiveness hinges on two orthogonal questions: what address to fetch and when to issue the request. Both can fail. Fetching the wrong block wastes memory bandwidth and can pollute the cache by evicting useful lines (a useless prefetch is worse than no prefetch). Fetching at the wrong time also fails: a prefetch that arrives after the demand reference is too late to hide the miss, while a prefetch that arrives too early can be evicted before its consumer issues. The slide notes that prefetching attacks the three main miss categories from L13 — compulsory (first reference), capacity (working set exceeds cache), and coherence (another core invalidated the line) — but cannot help conflict misses, which are an associativity problem rather than a temporal one. With this recap finished, the lecture pivots from the data side of the cache hierarchy to the address side: how does a process even produce the addresses the cache sees?

Improving Cache Performance: Summary

Improving Cache Performance: Summary (slide 6)
Slide 6
Show slide text

Improving Cache Performance: Summary

Miss rate

  • large block size
  • higher associativity
  • victim caches
  • skewed-/pseudo-associativity
  • hardware/software prefetching
  • compiler optimizations

Miss penalty

  • give priority to read misses over writes/writebacks
  • subblock placement
  • early restart and critical word first
  • non-blocking caches
  • multi-level caches

Hit time (difficult?)

  • small and simple caches
  • avoiding translation during L1 indexing (later)
  • pipelining writes for fast write hits
  • subblock placement for fast write hits in write through caches

This is a wrap-up of the cache-optimization toolbox accumulated across L12–L14, organized by which term of AMAT = hit time+miss rate×miss penalty\text{hit time} + \text{miss rate} \times \text{miss penalty} each technique attacks. Reducing miss rate uses larger blocks (better spatial locality up to a point), higher associativity (fewer conflict misses), victim and pseudo-associative caches, prefetching, and compiler-level layout/loop transformations. Reducing miss penalty overlaps misses with useful work: prioritize critical reads over background writebacks, use sub-block valid bits, restart the pipeline as soon as the requested word arrives (early restart / critical-word-first), allow misses to proceed in parallel (non-blocking lockup-free caches), and add levels (L2/L3). Reducing hit time is the hardest because L1 sits on the critical path: keep the L1 small and direct-mapped-ish, avoid the address-translation step in the indexing path (the bullet “later” — that’s exactly what the rest of this lecture builds toward with virtually-indexed caches), and pipeline writes through a store buffer.

Virtual Memory

Virtual Memory (slide 7)
Slide 7

2 Parts to Modern VM

2 Parts to Modern VM (slide 8)
Slide 8
Show slide text

2 Parts to Modern VM

VM provides each process with the illusion of a large, private, uniform memory

Part A: Protection

  • each process sees a large, contiguous memory segment without holes
  • each process’s memory space is private, i.e. protected from access by other processes

Part B: Demand Paging

  • capacity of secondary memory (swap space on disk)
  • at the speed of primary memory (DRAM)

Based on a common HW mechanism: address translation

  • user process operates on “virtual” or “effective” addresses
  • HW translates from virtual to physical on each reference
    • controls which physical locations can be named by a process
    • allows dynamic relocation of physical backing store (DRAM vs. HD)
  • VM HW and memory management policies controlled by the OS

This is the foundational slide of the lecture. Virtual memory does two distinct jobs that just happen to share one mechanism. Part A — protection gives every process an isolated address space: each program sees a flat, contiguous range starting at 00, with no holes punched out for the kernel or for other processes, and no ability to read or write another process’s memory. Part B — demand paging uses disk as a backing store so the working set in DRAM can be smaller than the program’s memory footprint; the hardware faults when a referenced page is not resident, the OS fetches it from swap, and execution resumes transparently. Both jobs are implemented with a single primitive: hardware address translation on every load and store. The user program emits virtual (a.k.a. effective) addresses; the MMU rewrites each one to a physical address before it reaches DRAM. Two consequences fall out: (1) the OS controls which physical frames a process can name (protection), and (2) the OS can relocate the physical backing for a virtual page at will — including moving it to disk — without the user program noticing (demand paging). The OS owns the policy; hardware enforces it.

Evolution of Protection Mechanisms

Evolution of Protection Mechanisms (slide 9)
Slide 9
Show slide text

Evolution of Protection Mechanisms

Earliest machines had no concept of protection and address translation

  • no need — single process, single user
  • automatically “private and uniform” (but not very large)
  • programs operated on physical addresses directly
    • no multitasking protection, no dynamic relocation (at least not very easily)

Before virtual memory existed, a single program owned the entire physical address space, so privacy and uniformity were free — there was nobody else around to violate them. The downside is that the address space was bounded by however much DRAM was physically installed (“not very large”), and migrating from one physical layout to another required relinking or relocating the binary. Once timesharing arrived, two new requirements showed up that physical addressing cannot satisfy: (a) protection between concurrently resident processes, and (b) the ability to move a process’s pages around in physical memory after it has started running. The next slides walk through three increasingly sophisticated answers — base/bound registers, segmentation, and finally paging — that each addressed part of the problem before paged virtual memory subsumed all of them.

Base and bound registers

Base and bound registers (slide 10)
Slide 10
Show slide text

Base and bound registers

In a multi-tasking system

Each process is given a non-overlapping, contiguous physical memory region, everything belonging to a process must fit in that region

When a process is swapped in, OS sets base to the start of the process’s memory region and bound to the end of the region

HW translation and protection check (on each memory reference)

PA=EA+basePA = EA + base

provided (PA<bound)(PA < bound), else violations

\Rightarrow Each process sees a private and uniform address space (0..max)(0..max)

privileged control registers

Base and Bound point to the active process’s region in physical memory; another process’s region sits below.

Bound can also be formulated as a range

Base-and-bound registers are the simplest hardware mechanism that delivers a private address space. The OS gives each process one contiguous slab of physical DRAM, then loads two privileged registers on context switch: base (the slab’s starting physical address) and bound (its end). On every memory reference, hardware adds base to the effective address — that’s the entire translation — and traps if the result exceeds bound. From the program’s point of view it has a uniform [0,max][0, \text{max}] address space; the OS relocates that range anywhere in DRAM by editing two registers. The mechanism is fast and cheap, but it has two crippling limitations that paging will eventually fix. First, a process’s entire memory image must fit in a single contiguous physical region — there is no way to split heap from stack across a fragmented free pool. Second, growing a process requires either pre-reserving slack space inside its region or copying the whole image to a larger free slot. These limitations motivate the segmented and then the paged designs in the next slides.

Segmented Address Space

Segmented Address Space (slide 11)
Slide 11
Show slide text

Segmented Address Space

segment == a base and bound pair

segmented addressing gives each process multiple segments

  • initially, separate code and data segments
    • 2 sets of base-and-bound reg’s for inst and data fetch
    • allowed sharing code segments
  • became more and more elaborate: code, data, stack, etc.
  • also (ab)used as a way for an ISA with a small EA space to address a larger physical memory space

Diagram: [SEG#EA][\text{SEG}\#\,|\,\text{EA}] \to segment table indexed by SEG# returns (base,bound)(\text{base},\text{bound}) pair, then +,<+,< produces PA\text{PA} and an okay? signal.

segment tables must be

  1. privileged data structures and
  2. private/unique to each process

Segmented addressing generalizes base-and-bound from one slab per process to many. The effective address now carries an explicit segment selector that indexes a per-process segment table; the indexed entry is itself a (base, bound) pair, and the rest of the EA acts as the offset within that segment. Two early benefits: code and data could live in different physical regions (often with different protection — read-only for code), and several processes could share a single physical code segment by having its entry appear in each of their segment tables. The mechanism became more elaborate over time (separate segments for code, initialized data, stack, BSS) and was sometimes abused — Intel’s 8086 used segments mainly to extend a 16-bit effective-address space to a 20-bit physical-address space rather than for protection. Like base-and-bound, the segment table must be a privileged structure private to each process; the OS reloads its base register on context switch. The next slide explains why segmentation is still not enough.

Paged Address Space

Paged Address Space (slide 12)
Slide 12
Show slide text

Paged Address Space

Segmented addressing creates fragmentation problems,

  • a system may have plenty of unallocated memory locations
  • they are useless if they do not form a contiguous region of a sufficient size

In a Paged Memory System:

PA space is divided into fixed size segments (e.g. 4kbyte), more commonly known as “page frames”

EA is interpreted as page number and page offset

Diagram: [Page No.Page Offset][\text{Page No.}\,|\,\text{Page Offset}] \to page table indexed by page number returns page frame base & okay?, then ++ with offset produces PA.

page tables must be

  1. privileged data structures and
  2. private/unique to each process

Variable-sized segments suffer from external fragmentation: after segments come and go, free memory ends up scattered as small holes that cannot satisfy a large allocation even though their total size would. Paging fixes this by chopping physical memory into fixed-size frames (e.g. 4 KB) and chopping each virtual address space into matching-size pages. Any virtual page can map to any physical frame; the page table for the process records the mapping. Because every frame is the same size, free physical memory never fragments — the OS just keeps a free list of frames and grabs any one when a new page is allocated. The effective address is split into a high Page No. field that indexes the page table and a low Page Offset field that is concatenated unchanged onto the looked-up page-frame base to form the physical address (the diagram draws this as + but it is logically a concatenation when page sizes are powers of two). The page table is once again a privileged, per-process structure pointed to by a hardware base register the OS reloads on context switch.

Demand Paging

Demand Paging (slide 13)
Slide 13
Show slide text

Demand Paging

Main memory and Disk as automatically managed levels in the memory hierarchies

analogous to cache vs. main memory

Drastically different size and time scales

\Rightarrow very different design decisions

Early attempts

  • von Neumann already described manual memory hierarchies
  • Brookner’s interpretive coding, 1960
    • a software interpreter that managed paging between a 40kb main memory and a 640Kb drum
  • Atlas, 1962
    • hardware demand paging between a 32-page (512 word/page) main memory and 192 pages on drums
    • user program believes it has 192 pages

Demand paging reuses the cache idea — keep a hot working set in fast memory, evict cold lines to slow memory — but applies it between DRAM and disk instead of between SRAM and DRAM. The mechanism is the same (resident vs. non-resident, replacement on miss, write-back of dirty data) but the constants are wildly different. DRAM is megabytes-to-gigabytes; disk is gigabytes-to-terabytes. A cache miss costs hundreds of cycles; a page fault costs millions. That gap forces different design points: page sizes are kilobytes (not 64 bytes), miss handling is in software (the OS handles a page fault), and replacement policies can afford to be much more sophisticated because the per-decision time budget is so large. The historical bullets put the idea in context: von Neumann described manual memory hierarchies in the original IAS reports, the 1960 Brookner system implemented paging in software via an interpreter, and the 1962 Manchester Atlas was the first machine with hardware demand paging — its 32 resident frames gave the user the illusion of 192 pages, exactly the modern model.

Demand Paging vs. Caching: 2016

Demand Paging vs. Caching: 2016 (slide 14)
Slide 14
Show slide text

Demand Paging vs. Caching: 2016

CacheDemand Paging
capacity64KBMB / 1GB1TB?? / ??
block size16~128 Byte4K to 64K Byte
hit time1~3 cyc50–150 cyc
miss penalty10~300 cycles1M to 10M cycles
miss rate0.1~10%0.00001~0.001%
hit handlinghwhw
miss handlinghwsw

The numerical comparison drives home why demand paging needs a fundamentally different design from a hardware cache, even though both are “keep the hot stuff close” hierarchies. The capacity gap is roughly 1000× (MB vs GB) and the latency gap is even larger: DRAM hit takes 50–150 cycles while a page fault to disk costs 10610^610710^7 cycles — five orders of magnitude. That ratio dictates several design choices listed elsewhere in the lecture: (1) blocks are much bigger (4–64 KB pages vs 64 B cache lines) because amortizing a multi-million-cycle miss over a small block is wasteful; (2) the miss rate must be made tiny (10510^{-5}) because the miss penalty is so catastrophic; (3) miss handling is done in software (the OS page-fault handler) because the millions of cycles available let software inspect, log, and decide replacement with no measurable cost. Hits, in contrast, are handled by hardware on both sides — paying 10610^6 cycles per hit would defeat the whole hierarchy.

Page-Based Virtual Memory

Page-Based Virtual Memory (slide 15)
Slide 15
Show slide text

Page-Based Virtual Memory

Virtual address (64-bit) split into Virtual Page Number (52-bit) and Page Offset (12-bit).

The VPN feeds a decoder into a Translation memory (page table); each entry is ~8 bytes.

Lookup yields the Physical Page Number, which is concatenated with the page offset to form the Physical address (40-bit).

A decoder on the physical address selects among Main memory pages (1–10 GBytes) backed by disk (10–100 GBytes).

Where to hold this translation memory and how much translation memory do we need?

This is the canonical page-based virtual memory picture with concrete bit widths. A 64-bit virtual address splits into a 52-bit virtual page number (VPN) and a 12-bit page offset — implying a 212=4 KB2^{12}=4\text{ KB} page size. The VPN indexes the page table (each entry is ~8 bytes — a 64-bit physical address plus permission and status bits) to look up the 28-bit physical page number (PPN). The PPN is concatenated with the 12-bit offset to form the 40-bit physical address that addresses the 240=1 TB2^{40}=1\text{ TB} physical address space. The page table itself lives in main memory and is backed by disk. The slide closes with the question that motivates the next several slides: a flat 2522^{52}-entry page table with 8 B entries would be 2552^{55} bytes — vastly more than physical memory! So the table must be stored and organized cleverly. Hierarchical and inverted page tables, plus the TLB for caching translations, are the answer.

Page table organization

Page table organization (slide 16)
Slide 16

Hierarchical Page Table

Hierarchical Page Table (slide 17)
Slide 17
Show slide text

Hierarchical Page Table

Effective address split: [p1(10 bit)p2(10 bit)P.O.(12 bit)][\text{p1}\,(10\text{ bit})\,|\,\text{p2}\,(10\text{ bit})\,|\,\text{P.O.}\,(12\text{ bit})]

A privileged register points to the Base of the Page Table of the page table (root). p1 indexes the root table, returning either a pointer to a second-level Page Table of the page table node or a marker (page in swap disk / page in main memory / page does not exist).

p2 then indexes the second-level pages of the page table to return the data-page descriptor; combined with the page offset dd it locates the data pages.

Storage of overhead of translation should be proportional to the size of physical memory and not the virtual address space.

Hierarchical (multi-level / radix) page tables solve the size problem on the previous slide by storing the page table itself paged. The VPN is broken into multiple fields — here p1 and p2, each 10 bits — which index successive levels of the radix tree. A single privileged register holds the root, exactly like CR3 on x86. Most virtual address space is unused in any real process (a 64-bit process rarely uses the full 2522^{52} pages), so most root entries point to nothing — “page does not exist” — and the corresponding subtree consumes zero physical memory. Subtrees that do exist may themselves be paged out: an entry can mean “this page-table page is in main memory,” “this page-table page is on swap,” or “this entry is unused.” The key property is captured in the italicized invariant on the slide: translation overhead is now proportional to the resident working set, not the (much larger) virtual address space. The cost is an extra memory reference per level on each TLB miss, which is exactly why the TLB matters so much.

Inverted or Hashed Page Tables

Inverted or Hashed Page Tables (slide 18)
Slide 18
Show slide text

Inverted or Hashed Page Tables

(PID, VPN)hashTable Offset\text{(PID, VPN)} \to \text{hash} \to \text{Table Offset}, then +Base of Table=PA of IPTE+\,\text{Base of Table} = \text{PA of IPTE}, which indexes the Inverted Page Table whose entries hold VPN,PID,PTE\langle \text{VPN}, \text{PID}, \text{PTE} \rangle. The Inverted Page Table sits in physical memory.

  • Size of Inverted Page table only needs to be proportional to the size of the physical memory
  • Each VPN can only be mapped to a small set of entries according to a hash function
  • To translate a VPN, check all allowed table entries for matching VPN and PID

An inverted page table flips the indexing direction. Instead of one entry per virtual page, it has one entry per physical frame, holding the (VPN, PID, PTE) tuple of whichever process currently owns that frame. Translation hashes the (PID, VPN) pair into a small set of candidate entries; hardware (or the TLB-miss handler) walks those entries and matches on VPN and PID. The structural benefit is exactly the goal stated on slide 17: table size grows with physical memory, not with virtual address space — even a process with a sparse, huge virtual layout occupies at most one entry per resident frame. The drawback is sharing: when two processes map the same physical frame, the table can hold only one entry per frame, so shared pages need extra mechanism (chains, separate sharing tables). Inverted page tables were used by PowerPC and IA-64; modern x86 and ARM use hierarchical tables instead.

Virtual-to-Physical Translation

Virtual-to-Physical Translation (slide 19)
Slide 19

Translation Look-aside Buffer (TLB)

Translation Look-aside Buffer (TLB) (slide 20)
Slide 20
Show slide text

Translation Look-aside Buffer (TLB)

Essentially a cache of recent address translations avoids going to the page table on every reference

  • indexed by lower bits of VPN (virtual page #)
  • tag = unused bits of VPN + process ID
  • data = a page-table entry, i.e. PPN (physical page #) and access permission
  • status = valid, dirty
  • the usual cache design choices (placement, replacement policy, multi-level, etc.) apply here too

Diagram: Virtual address [VPNPage offset][\text{VPN}\,|\,\text{Page offset}]. The lower bits of the VPN form the Index into the TLB; the upper bits form the Tag that is compared against the stored tag. A hit returns the Physical page no., which is concatenated with the Page offset to form the Physical address.

What should be the relative sizes of ITLB and I-cache?

The Translation Lookaside Buffer is just a regular cache whose lines hold page-table entries instead of memory data. Each entry maps a VPN+PID to a PPN plus permission/status bits (valid, dirty, R/W/X). On every load, store, and instruction fetch, the MMU looks up the VPN in the TLB. A hit is essentially free (≤ 1 cycle) and produces the PPN immediately. A miss triggers a page-table walk — done by hardware (x86, ARM) or by software (early MIPS, Alpha) — that reloads the missing translation. Because it is just a cache, every cache-design choice from L12–L13 reappears: associativity, replacement policy, multi-level (L1 ITLB + L1 DTLB + L2 TLB on modern x86), inclusion, and so on. The closing question — relative sizes of ITLB vs. I-cache — is a pointed reminder that one TLB entry covers an entire 4 KB page, so a small TLB (e.g., 64 entries × 4 KB = 256 KB of code reach) typically covers many times more bytes than the I-cache itself.

Virtual to Physical Address Translation

Virtual to Physical Address Translation (slide 21)
Slide 21
Show slide text

Virtual to Physical Address Translation

Flow on each Effective Address:

  1. TLB Lookup (1\le 1 pclk)
  2. On hit \to Protection Check (1\le 1 pclk)
    • permitted \to Physical Address To Cache
    • denied \to Protection Fault
  3. On miss \to Page Table Walk (100’s pclk by HW or SW)
    • succeed \to Update TLB \to Protection Check (flow continues)
    • fail \to Page Fault: OS Table Walk (10000’s pclk), which then services the fault and updates the TLB

This flow chart is the single most important diagram in the lecture. Three latency tiers stack vertically: a TLB hit is nearly free (1\le 1 processor clock); a TLB miss that resolves with a page-table walk in hardware costs hundreds of clocks; a true page fault (the page is not even in DRAM) costs tens of thousands of clocks because the OS must read the page back from disk. After the translation succeeds, a protection check runs against the PTE’s permission bits — denial raises a protection fault rather than a page fault, because the translation existed but the access was illegal (e.g., a store to a read-only page or a user access to a kernel page). This is the diagram you should be able to redraw on an exam: TLB hit/miss → walk succeed/fail → protection permitted/denied, with the three latency labels pinned to the right edges.

hit

miss

succeed

fail

permitted

denied

Effective Address

TLB Lookup
≤ 1 pclk

Protection Check
≤ 1 pclk

Page Table Walk
100s pclk
HW or SW

Update TLB

Page Fault
OS Table Walk
10000s pclk

Physical Address
to Cache

Protection Fault

Address-translation flow. Fast path (blue) — TLB hit followed by protection check — is bounded by ≤ 1 pclk per stage. TLB-miss page-table walk (orange) is hundreds of pclks. A true page fault (pink) escalates to the OS and costs tens of thousands of pclks before the TLB can be refilled.

Cache Placement and Address Translation

Cache Placement and Address Translation (slide 22)
Slide 22
Show slide text

Cache Placement and Address Translation

Physical Cache (Most Systems)

CPUVAMMUPAPhysical CachePhysical Memory\text{CPU} \xrightarrow{VA} \text{MMU} \xrightarrow{PA} \text{Physical Cache} \to \text{Physical Memory}

fetch critical path runs through CPU, MMU, and Physical Cache.

→ longer hit time

Virtual Cache (SPARC2’s)

CPUVAVirtual CacheVAMMUPAPhysical Memory\text{CPU} \xrightarrow{VA} \text{Virtual Cache} \xrightarrow{VA} \text{MMU} \xrightarrow{PA} \text{Physical Memory}

fetch critical path runs through CPU and Virtual Cache before reaching the MMU.

→ aliasing problem → cold start after context switch

Virtual caches are not popular anymore because MMU and CPU can be integrated on one chip

Where we place the MMU relative to the cache picks the central tradeoff of this lecture. A physical cache translates first, then looks up — the cache sees only physical addresses, so context switches and synonyms are non-issues, but the TLB now sits on the L1 hit critical path and lengthens the cycle. A virtual cache indexes and tags entirely in the virtual address space, so the TLB is consulted only on a miss; this is fast, but two problems appear. First, aliasing/synonyms — two virtual addresses that map to the same physical address can occupy two different cache lines, causing visibility bugs after writes. Second, cold start on context switch — virtual tags are meaningless to the new process, so the cache must be flushed (or process-tagged) on every context switch. The slide notes that virtual caches were used in machines like the SPARC2 but are no longer popular, partly because tightly integrating the MMU and CPU on the same die has shrunk the TLB latency enough that physical caches no longer cost a cycle. The next several slides develop a hybrid — virtually-indexed, physically-tagged — that keeps the speed of virtual indexing without the synonym problem.

Physically Indexed Cache

Physically Indexed Cache (slide 23)
Slide 23
Show slide text

Physically Indexed Cache

Virtual Address (n=v+gn = v + g bits): [TagIndexPage Offset (PO)][\text{Tag}\,|\,\text{Index}\,|\,\text{Page Offset (PO)}] where the VPN field is vv bits (vkv - k tag, kk index) and the page offset is gg bits.

The full VPN (vv bits) goes through the TLB to produce the Physical Page No. (PPN), pp bits.

Physical Address (m=p+gm = p + g bits): [PPNPO][\text{PPN}\,|\,\text{PO}] is then re-split as [Tag(t)Index(i)BO(b)][\text{Tag}\,(t)\,|\,\text{Index}\,(i)\,|\,\text{BO}\,(b)] for the D-cache, which returns Data.

A Physically-Indexed, Physically-Tagged (PIPT) cache is the simplest and safest design: the TLB translates the virtual address to a physical one first, and the cache uses bits from the physical address for both the index and the tag. Because the tag is physical, there are no synonyms: a given physical block has exactly one home set and one tag, no matter how many virtual addresses point at it. Because the index is physical, the index field can be as wide as you like and the cache can be as large as you like without aliasing constraints. The price is that the L1 hit latency includes the TLB lookup serially: TLB → cache index → tag compare → data, which is the longer critical path the previous slide warned about. The next slide shows how to overlap the two.

Virtually Indexed Cache — Parallel Access to TLB and Cache

Virtually Indexed Cache — Parallel Access to TLB and Cache (slide 24)
Slide 24
Show slide text

Virtually Indexed Cache

Parallel Access to TLB and Cache arrays

Left path: VPN [TagIndex][\text{Tag}\,|\,\text{Index}] + Page Offset \to TLB produces PPN(p)\text{PPN}\,(p).

Right path: in parallel, the lower bits of the virtual address — taken entirely from the Page Offset — provide the cache Index (i)(i) and BO (b)(b) to the D-cache, which returns the candidate Data and a stored PPN tag.

The TLB-supplied PPN is then compared with the cache’s stored PPN tag (=)(=) to produce Hit/Miss.

How large can a virtually indexed cache get?

This is the Virtually-Indexed, Physically-Tagged (VIPT) design that almost every modern L1 uses. Trick: take the index bits entirely from the page offset. The page offset is identical in the virtual and physical address (translation only changes the page-number bits), so indexing with offset bits is equivalent to physical indexing — no aliasing — yet it requires no TLB lookup and can run in parallel with the TLB. The TLB and cache arrays both fire on the same cycle; when the TLB returns the PPN it is compared against the physically-stored tag to confirm hit/miss. The effect is a PIPT-correctness, virtual-cache-speed L1. The closing question — how large can a virtually indexed cache get? — asks for the constraint: index bits must fit inside the page offset, so cache sizepage size×associativity\text{cache size} \le \text{page size} \times \text{associativity}. With 4 KB pages and 8-way associativity the maximum is 32 KB, which is exactly the L1 size used by Intel and AMD for two decades.

Large Virtually Indexed Cache

Large Virtually Indexed Cache (slide 25)
Slide 25
Show slide text

Large Virtually Indexed Cache

Same structure as the parallel VIPT cache, but the cache index now extends above the page offset — the highlighted slice aa comes from the VPN portion of the virtual address rather than from the page offset.

Left: VPN [TagIndex][\text{Tag}\,|\,\text{Index}] \to TLB \to PPN (p)(p).

Right: D-cache index (i)(i) now includes the bits aa that overlap into the VPN; the cache returns Data and a stored PPN, which is compared to the TLB-supplied PPN at == for Hit/Miss.

If two VPNs differ in aa, but both map to the same PPN then there is an aliasing problem.

Once the VIPT cache size grows beyond page size×associativity\text{page size} \times \text{associativity}, the index field has to extend into the VPN bits — labelled aa on the diagram. Those extra index bits can differ between two virtual addresses that translate to the same physical address (the OS chose different VPNs for the same PPN), so the same physical block may now sit in two different cache sets. That is the aliasing/synonym problem reborn: a write to one virtual alias does not update the other alias, and a later read of the second alias will see stale data. The next two slides develop the synonym story explicitly and the slides after them survey four solutions, including the MIPS R10K’s L2-mediated approach.

Virtual Address Synonyms

Virtual Address Synonyms (slide 26)
Slide 26
Show slide text

Virtual Address Synonyms

Two Virtual pages that map to the same physical page

  • within the same virtual address space
  • across address spaces

Diagram: VA1 and VA2 (red) both arrow to the same PA (blue).

Using VA bits as IDX, PA data may reside in different sets in cache!!

Synonyms (also called aliases) are pairs of virtual addresses that translate to the same physical address. They arise naturally in two contexts: (1) within one process, when the process maps the same file or shared-memory region at two virtual offsets, and (2) across processes, when two processes share a code page or a shared library by mapping the same physical frame into both their address spaces. Either way, when the cache index is partly derived from the VPN (as in the large-VIPT design), VA1 and VA2 can land in different cache sets — meaning the same physical line has two cache homes. A write to one alias updates one set; a subsequent read of the other alias misses the updated set and returns stale data. The bold red sentence on the slide captures the bug succinctly: physical data can reside in different sets when the index uses virtual bits. This is the aliasing problem the next slides will solve.

Synonym (or Aliasing)

Synonym (or Aliasing) (slide 27)
Slide 27
Show slide text

Synonym (or Aliasing)

When VPN bits are used in indexing, two virtual addresses that map to the same physical address can end up sitting in two cache lines.

In other words, two copies of the same physical memory location may exist in the cache

\Rightarrow modification to one copy won’t be visible in the other

Diagram: large-VIPT cache. The slice aa of the VPN bleeds into the index ii. The cache returns Data and a PPN that is compared with the PPN supplied by the TLB to determine Hit/Miss.

If the two VPNs do not differ in aa then there is no aliasing problem

This restates the aliasing hazard introduced two slides earlier and pinpoints the exact bit field that causes it: the slice aa where the cache index reaches above the page offset into the VPN. If two synonyms happen to share the same value of aa — accidentally or by OS arrangement — they fall into the same cache set and aliasing disappears, because the cache’s tag comparison (using the TLB-supplied PPN) deduplicates them. If they differ in aa, both copies can live in the cache simultaneously and writes to one will not propagate to the other. The italicized closing phrase telegraphs page coloring (one of the solutions on the next slide): if the OS allocates pages so that virtual and physical addresses agree in the index bits (VA[a]=PA[a]\text{VA}[a] = \text{PA}[a]), the synonym condition cannot arise and the hazard is eliminated by software policy.

Synonym Solutions

Synonym Solutions (slide 28)
Slide 28
Show slide text

Synonym Solutions

  • Limit cache size to page size times associativity
    • get index from page offset
  • Search all sets in parallel
    • 64K 4-way cache, 4K pages, search 4 sets (16 entries)
    • Slow!
  • Restrict page placement in OS
    • make sure index(VA)=index(PA)\text{index}(VA) = \text{index}(PA)
  • Eliminate by OS convention
    • single virtual space
    • restrictive sharing model

Four cures for the synonym problem, in roughly increasing implementation effort:

  1. Cap the cache. Keep cache size \le page size ×\times associativity so the entire index fits in the page offset. The clean VIPT design from slide 24. Easy, but caps L1 at e.g. 32 KB for 4 KB pages and 8-way.
  2. Search every possible alias set. For a 64 KB / 4-way / 4 KB-page cache, two index bits come from the VPN, so an alias might live in any of 22=42^2 = 4 sets — search all 16 entries on every access. Correct but slow.
  3. Page coloring (OS policy). When the OS allocates a physical page, it ensures the new PPN agrees with the VPN on the index bits (index(VA)=index(PA)\text{index}(VA) = \text{index}(PA)). Aliases then fall in the same set and the hardware tag-compare catches them. Common in MIPS and SPARC.
  4. Outlaw aliases. Use a single global virtual address space, or forbid arbitrary mmap-style sharing — viable only in research / niche systems.

The MIPS R10K on the next slide combines (1)+(3)+L2-mediated detection.

MIPS R10K Synonym Solution

MIPS R10K Synonym Solution (slide 29)
Slide 29
Show slide text

MIPS R10K Synonym Solution

32KB 2-Way Virtually-Indexed L1

  • needs 10 bits of index and 4 bits of block offset
  • page offset is only 12-bits \Rightarrow 2 bits of index are VPN[1:0]

Direct-Mapped Physical L2

  • L2 is Inclusive of L1
  • VPN[1:0] is appended to the “tag” of L2

Given two virtual addresses VAVA and VBVB that differs in aa and both map to the same physical address PAPA

  • Suppose VAVA is accessed first so blocks are allocated in L1&L2
  • What happens when VBVB is referenced?
    1. VBVB indexes to a different block in L1 and misses
    2. VBVB translates to PAPA and goes to the same block as VAVA in L2
    3. Tag comparison fails (VA[1:0]VB[1:0]VA[1:0] \ne VB[1:0])
    4. L2 detects that a synonym is cached in L1 \Rightarrow VAVA‘s entry in L1 is ejected before VBVB is allowed to be refilled in L1

The MIPS R10K’s solution is a beautiful piece of L1/L2 cooperation. Its 32 KB 2-way L1 needs 10 index bits and 4 offset bits, so 10+4=1410 + 4 = 14 bits of address are below the L1 tag — but the page offset is only 12 bits, so two of the L1 index bits (VPN[1:0]) come from the VPN. That makes synonyms possible. The R10K augments the physically-indexed, direct-mapped L2 to remember which VPN bits the L1 used: VPN[1:0] is appended to the L2 tag, so the L2 effectively records which alias is currently cached in L1. Inclusion guarantees every L1 line is also in L2. The trace on the slide walks through what happens on a synonym: VB misses in L1 (because its VPN[1:0] differs from VA’s), translates to the same PA in L2, and the L2 tag compare fails on the appended VPN bits. That mismatch is the cue: L2 knows a synonym is currently resident in L1, so it forces L1 to evict VA’s line before refilling VB. At any moment, only one of VA/VB lives in L1 — invariant restored. This is option (3) from the previous slide implemented in hardware rather than in OS policy. The slide is a good example of how aliasing is solved by making L2 the global serialization point for translations.

Virtually Indexed Physically Tagged Cache

Virtually Indexed Physically Tagged Cache (slide 30)
Slide 30
Show slide text

Virtually Indexed Physically Tagged Cache

  • L1 caches: 32 KB, 8-way set-associative
  • Number of bits to represent a byte inside 32KB
    • 15 bits
    • With (23=8)(2^3 = 8)-way associativity, 153=1215 - 3 = 12 bit
  • Page sizes: 4KB, 2MB, 1GB
  • Number of bits to represent page offset
    • At least 12 bits
  • Virtual index is always the part of the page offset

A worked example that justifies the standard modern L1 sizing. A 32 KB cache addresses 215=327682^{15} = 32768 bytes, so 15 address bits live below the cache tag. With 8-way associativity the way selector consumes log28=3\log_2 8 = 3 of those bits, leaving 153=1215 - 3 = 12 bits for the cache index plus block offset. The smallest supported page size is 4 KB = 2122^{12} B, so the page offset is exactly 12 bits — meaning the entire 12-bit cache index+offset region fits inside the page offset. Therefore the virtually-indexed cache index is a subset of the page offset, which is identical in virtual and physical addresses, so synonyms cannot occur. Larger pages (2 MB, 1 GB) only enlarge the page offset and trivially preserve the inclusion. This is exactly why 32 KB / 8-way L1 has been the canonical Intel and AMD L1 size for two decades — it is the largest VIPT design that is alias-free given the standard 4 KB minimum page.

Readings

Readings (slide 31)
Slide 31
Show slide text

Readings

For today:

Announcements

Announcements (slide 32)
Slide 32
Show slide text

Announcements

  • Project #4
    • Due: May 1, 11:59 PM Eastern Time
  • Office hours
    • End this week
  • Final project presentation
    • May 4, 2026
    • 8:40 to 9:50 AM
    • 8 minutes per group with 2 minutes for questions & answers
    • Schedule and example slides on Courseworks

https://courseworks2.columbia.edu/courses/237081