Decoding Swizzle<B,M,S>: A Visual Guide to Bank-Conflict-Free Shared Memory Access
by Vijay Krishnamoorthy
Overview
Efficient use of shared memory (SMEM) is critical for achieving peak performance in GPU kernels. One of the most common pitfalls, bank conflicts, can silently degrade performance by forcing serialized memory accesses. Swizzling is a powerful technique that reorganizes data layout to eliminate these conflicts. This article builds a mental model for understanding swizzling, culminating in a thorough explanation of CuTe’s canonical Swizzle<BBits, MBase, SShift> representation.
1. The Problem: Bank Conflicts
How Shared Memory Banks Work
GPU shared memory is organized into banks — interleaved partitions that can each service one memory access per cycle. Think of banks as parallel lanes: when different threads access different banks, all accesses proceed simultaneously. On modern NVIDIA GPUs:
- SMEM is divided into 32 banks
- Each bank is 4 bytes (1 DWORD) wide
- One SMEM “row” spans all 32 banks = 128 bytes
SMEM Row (128 bytes):
Bank: 0 1 2 3 4 ... 30 31
[4B] [4B] [4B] [4B] [4B] ... [4B] [4B]
For example, when 32 threads in a warp access 32 different banks simultaneously, all accesses complete in a single transaction. This is an ideal scenario.
When Conflicts Occur
A bank conflict occurs when multiple threads access different addresses within the same bank. The hardware must serialize these accesses, degrading performance.
Consider a matrix stored in row-major order where each element is 4 bytes:
Logical Layout (8x8 matrix, 4 bytes per element):
Col 0 Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7
+-------+-------+-------+-------+-------+-------+-------+-------+
Row 0 | B0 | B1 | B2 | B3 | B4 | B5 | B6 | B7 |
Row 1 | B0 | B1 | B2 | B3 | B4 | B5 | B6 | B7 |
Row 2 | B0 | B1 | B2 | B3 | B4 | B5 | B6 | B7 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
Row 7 | B0 | B1 | B2 | B3 | B4 | B5 | B6 | B7 |
+-------+-------+-------+-------+-------+-------+-------+-------+
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7
Row-major access (reading Row 0): Each thread reads a different column → different banks → no conflict.
Column-major access (reading Col 0): Every thread reads from Bank 0 → 8-way bank conflict!
2. The Solution: Swizzling
Swizzling reorganizes how data is stored in physical memory so that column-wise access patterns also hit different banks. The key insight is:
Apply a row-dependent transformation to the column index when storing data, so that elements in the same logical column end up in different physical banks.
The XOR Trick
The most elegant swizzling scheme uses bitwise XOR. For each element at logical position (row, col):
physical_col = logical_col XOR row
Let’s see this in action for an 8x8 tile:
Logical Layout: Physical Layout (after XOR swizzle):
Col: 0 1 2 3 4 5 6 7 Col: 0 1 2 3 4 5 6 7
+----------------+ +----------------+
R0 | 0 1 2 3 4 5 6 7 | R0 | 0 1 2 3 4 5 6 7 | (XOR 0 = no change)
R1 | 0 1 2 3 4 5 6 7 | R1 | 1 0 3 2 5 4 7 6 | (XOR 1)
R2 | 0 1 2 3 4 5 6 7 | R2 | 2 3 0 1 6 7 4 5 | (XOR 2)
R3 | 0 1 2 3 4 5 6 7 | R3 | 3 2 1 0 7 6 5 4 | (XOR 3)
R4 | 0 1 2 3 4 5 6 7 | R4 | 4 5 6 7 0 1 2 3 | (XOR 4)
R5 | 0 1 2 3 4 5 6 7 | R5 | 5 4 7 6 1 0 3 2 | (XOR 5)
R6 | 0 1 2 3 4 5 6 7 | R6 | 6 7 4 5 2 3 0 1 | (XOR 6)
R7 | 0 1 2 3 4 5 6 7 | R7 | 7 6 5 4 3 2 1 0 | (XOR 7)
+----------------+ +----------------+
(Numbers show which logical column's data is stored at each physical position)
Now look at any physical column in the swizzled layout—say, physical column 0:
- Row 0: logical col 0
- Row 1: logical col 1
- Row 2: logical col 2
- Row 3: logical col 3
- Row 4: logical col 4
- Row 5: logical col 5
- Row 6: logical col 6
- Row 7: logical col 7
Every row contains a different logical column! When we read column 0 (rows 0-7), each thread accesses a different physical column → different banks → no conflict.
Why XOR Works Mathematically
The XOR operation with row indices guarantees conflict-free access because:
- XOR is its own inverse:
a XOR b XOR b = a - Unique mapping per row: For any fixed logical column
c, as rowrvaries from 0 to 2^n-1, the valuec XOR rproduces all values from 0 to 2^n-1 exactly once. - Bijection: XOR with a constant is a bijection (one-to-one mapping), so no two logical columns in the same row map to the same physical column.
3. CuTe’s Canonical Swizzle Form
NVIDIA’s CuTe library provides a canonical representation for swizzle patterns:
Swizzle<BBits, MBase, SShift>
This compact notation encodes everything needed to describe a swizzle pattern. Before diving into each parameter, let’s understand the high-level formula:
Given a byte address A, decompose it into bit fields:
A = [ ... ] [ YYY ] [ ... ] [ ZZZ ] [ ... ]
↑ ↑ ↑ ↑ ↑
│ │ │ │ │
│ │ │ │ └── bits [0, MBase)
│ │ │ │
│ │ │ └── bits [MBase, MBase+BBits)
│ │ │
│ │ └── bits [MBase+BBits, MBase+SShift)
│ │
│ └── bits [MBase+SShift, MBase+SShift+BBits)
│
└── bits [MBase+SShift+BBits, ...)
Swizzled address A' = A with ZZZ replaced by (ZZZ XOR YYY)
Now let’s build intuition for what each parameter means.
4. Building Intuition: MBase, BBits, and SShift
While CuTe uses the ordering <BBits, MBase, SShift>, it’s pedagogically clearer to understand them in a different order: MBase → BBits → SShift. Let’s explore each.
4.1 MBase: The Swizzle Unit
MBase defines the fundamental unit of data that moves together during swizzling. The swizzle unit size is:
Swizzle Unit Size = 2^MBase bytes
Key insight: Data within a swizzle unit is never rearranged. Swizzling only determines which physical slot a swizzle unit occupies—it doesn’t shuffle bytes inside the unit.
| MBase | Swizzle Unit Size | Banks Spanned |
|---|---|---|
| 2 | 4 bytes | 1 bank |
| 3 | 8 bytes | 2 banks |
| 4 | 16 bytes | 4 banks |
| 5 | 32 bytes | 8 banks |
Why does this matter? If your access pattern naturally reads 16-byte chunks (e.g., loading 4 fp32 values or 8 fp16 values per thread), you want MBase=4. There’s no benefit to swizzling at a finer granularity—it would just add complexity.
4.2 BBits: The Swizzle Tile Dimensions
BBits defines the swizzle tile as a square grid:
Swizzle Tile = 2^BBits rows × 2^BBits columns (of swizzle units)
| BBits | Tile Shape | Swizzle Units per Tile |
|---|---|---|
| 1 | 2 × 2 | 4 |
| 2 | 4 × 4 | 16 |
| 3 | 8 × 8 | 64 |
4.3 SShift: The Row Stride
Before diving into address bits, let’s understand what SShift represents. In a row-major layout:
- Each row contains 2^SShift swizzle units
- The row stride (bytes per row) is 2^(MBase+SShift) bytes
Physically, SShift determines how wide each row is in memory. For a standard 128-byte SMEM row with 16-byte swizzle units:
- Row width = 128 bytes = 8 units × 16 bytes/unit
- SShift = log₂(8) = 3
Intuitively, SShift encodes where the row index bits start in the address. The column index occupies bits [MBase, MBase+SShift), and the row index starts at bit MBase+SShift.
A row-major address is structured as below -
Address = row × row_stride + col × unit_size
= row × 2^(MBase+SShift) + col × 2^MBase
Address bits map to a 3-level hierarchy:
┌─ Level 1: SMEM Grid ──────────────────────────────────────────────────────┐
│ (swizzle tiles arranged in rows and columns) │
│ │
│ row_tile_idx col_tile_idx │
│ │ │ │
│ │ ┌────────────┬────────────┬──────▼─────┬────────────┐ │
│ └─────►│ Tile 0,0 │ Tile 0,1 │ Tile 0,2 │ Tile 0,3 │ ... │
│ ├────────────┼────────────┼────────────┼────────────┤ │
│ │ Tile 1,0 │ Tile 1,1 │ Tile 1,2 │ Tile 1,3 │ ... │
│ └────────────┴────────────┴──────┬─────┴────────────┘ │
│ │ │
│ ▼ │
├─ Level 2: Inside a Tile ──────────────────────────────────────────────────┤
│ (2^BBits rows × 2^BBits columns of swizzle units) │
│ │
│ row_in_tile (YYY) col_in_tile (ZZZ) │
│ │ │ │
│ │ C0 C1 C2 C3 C4▼ C5 C6 C7 │
│ │ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │
│ └─► R0 │ │ │ │ │ ● │ │ │ │ │
│ R1 │ │ │ │ │ │ │ │ │ │
│ R2 │ │ │ │ │ │ │ │ │ │
│ ... └─────┴─────┴─────┴─────┴──┬──┴─────┴─────┴─────┘ │
│ │ │
│ ★ SWIZZLE: physical_col = col_in_tile XOR row_in_tile │
│ ▼ │
├─ Level 3: Inside a Swizzle Unit ──────────────────────────────────────────┤
│ (2^MBase bytes — NEVER modified by swizzling) │
│ │
│ byte_offset ──► ┌────┬────┬────┬────┬────┬────┬─────┬──────────────┐ │
│ │ B0 │ B1 │ B2 │ B3 │ B4 │ B5 │ ... │ B(2^MBase-1) │ │
│ └────┴────┴────┴────┴────┴────┴─────┴──────────────┘ │
│ │
│ Data within a swizzle unit always stays together. │
└───────────────────────────────────────────────────────────────────────────┘
Note: col_tile_idx only exists when SShift > BBits (multiple swizzle tiles per grid row of tiles). When SShift = BBits, the entire column index fits in col_in_tile and there’s only one tile per row.
The swizzle operation XORs the YYY bits (row portion) into the ZZZ bits (column portion):
Swizzled Address = Original Address with ZZZ replaced by (ZZZ XOR YYY)
Why this works: The YYY bits contain the lower BBits of the row index. By XORing them into the ZZZ bits (lower BBits of column), we effectively remap which physical column stores each logical column—and the remapping is different for each row.
The mask: mask = (1 << BBits) - 1 isolates BBits bits. For BBits=3: mask = 0b111 = 7
Equivalence to row/column XOR: Because of how the address is structured, the swizzle is equivalent to:
col_in_tile = col & mask // Lower BBits of column (ZZZ)
row_in_tile = row & mask // Lower BBits of row (YYY)
physical_slot_in_tile = col_in_tile XOR row_in_tile
But remember: the actual operation is on address bits, not indices. The row bits at position [MBase+SShift, MBase+SShift+BBits) are XORed into the column bits at position [MBase, MBase+BBits).
Why square tiles? The XOR trick requires 2^BBits unique values in both YYY and ZZZ to create a bijection. With BBits rows and BBits column positions within a tile, we get exactly the right range of XOR operands.
To recap,
SShift specifies how many swizzle units fit in one logical row:
Swizzle Units per Row = 2^SShift
This parameter connects the swizzle tile to the actual SMEM layout. Typically:
SShift = log2(SMEM_row_bytes / swizzle_unit_bytes)
= log2(128 / 2^MBase)
= 7 - MBase
| MBase | Swizzle Unit | SShift (for 128B SMEM row) | Units per Row |
|---|---|---|---|
| 4 | 16 bytes | 3 | 8 |
| 5 | 32 bytes | 2 | 4 |
Critical constraint: SShift >= BBits
The swizzle tile width (2^BBits units) must fit within one row (2^SShift units). If the tile is wider than the row, the swizzle pattern breaks.
Repetition: When SShift > BBits, multiple swizzle tiles fit horizontally in one SMEM row:
Tiles per Row = 2^(SShift - BBits)
Important: Swizzling operates independently within each tile. There’s no swizzling across tile boundaries—each tile applies its own XOR pattern based on the row index within that tile.
4.4 Putting It All Together
For Swizzle<BBits=3, MBase=4, SShift=3>:
- Swizzle unit: 16 bytes (MBase=4), so MBase=4 means bits [0,4) are byte offset within unit
- Swizzle tile: 8×8 units = 8 rows × 128 bytes (BBits=3)
- Units per row: 8 (SShift=3)
- Tiles per row: 1 (SShift - BBits = 0)
Address bit layout:
Original Address (for row r, column c, byte offset b):
A = r × 2^(4+3) + c × 2^4 + b = r × 128 + c × 16 + b
Bit positions:
[13..10] [9 8 7] [6 5 4] [3 2 1 0]
row_tile row_in col_in byte
_idx _tile _tile _offset
(YYY) (ZZZ)
↑ bits [7,10) ↑ bits [4,7) ↑ bits [0,4)
Since SShift = BBits = 3, there is no col_tile_idx field—the entire column fits in col_in_tile. The swizzle tile spans the full SMEM row width.
The swizzle operation:
ZZZ = (Address >> MBase) & 0b111 // Extract bits [4,7) = col_in_tile
YYY = (Address >> (MBase+SShift)) & 0b111 // Extract bits [7,10) = row_in_tile
Swizzled Address = Address XOR (YYY << MBase)
= Address with ZZZ replaced by (ZZZ XOR YYY)
Equivalent index-based view:
col_in_tile = c & 0b111 // Lower 3 bits of column
row_in_tile = r & 0b111 // Lower 3 bits of row
physical_slot_in_tile = col_in_tile XOR row_in_tile
physical_col = (c & ~0b111) | physical_slot_in_tile // Preserve col tile index
The key insight: we XOR the row bits (YYY) into the column bits (ZZZ) at their respective positions in the address. The upper column bits (tile index) and all other bits pass through unchanged, ensuring swizzling is self-contained within each tile.
5. Worked Examples
Let’s walk through concrete examples using standard GPU SMEM configuration:
- Bank size: 4 bytes (1 DWORD)
- Number of banks: 32
- SMEM row: 128 bytes
We’ll examine different swizzle configurations corresponding to different “swizzle atoms” from the previous post on MMA layouts.
5.1 Example A: No Swizzle — 8×16B (1 Core Matrix)
Configuration: Swizzle<0, 4, 3>
- BBits=0: No swizzle (tile is 1×1)
- MBase=4: 16-byte units
- SShift=3: 8 units per row
Logical Layout (8 rows × 8 units of 16B each = 8 × 128B):
+--------+--------+--------+--------+--------+--------+--------+--------+
| Swiz | Swiz | Swiz | Swiz | Swiz | Swiz | Swiz | Swiz |
| Tile 0 | Tile 1 | Tile 2 | Tile 3 | Tile 4 | Tile 5 | Tile 6 | Tile 7 |
+-----------------+-----------------+-----------------+-----------------+
| Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Row 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Row 2 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Row 3 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Row 4 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Row 5 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Row 6 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Row 7 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Physical Layout: Same as logical (no swizzle applied)
Bank Mapping (each 16B unit spans 4 banks):
Banks 0-3 Banks 4-7 Banks 8-11 Banks 12-15 Banks 16-19 Banks 20-23 Banks 24-27 Banks 28-31
+----------+----------+----------+-----------+-----------+-----------+-----------+-----------+
Row 0 | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
Row 1 | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
+----------+----------+----------+-----------+-----------+-----------+-----------+-----------+
Column Access Pattern (reading logical Unit 0):
Bank conflict across threads accessing the units in a column.
Tensor Core access Pattern for swizzle atom with no swizzle (8x16B):
No bank conflict.
5.2 Example B: 32B Swizzle Atom — 8×32B (2 Core Matrices)
Configuration: Swizzle<1, 4, 3>
- BBits=1: 2×2 tile of swizzle units
- MBase=4: 16-byte units
- SShift=3: 8 units per row
Swizzle tile: 2 rows × 2 units (32 bytes wide)
XOR values by row:
- Row 0, 2, 4, 6: XOR with 0 (row & 1 = 0)
- Row 1, 3, 5, 7: XOR with 1 (row & 1 = 1)
Physical Layout (showing logical column at each physical position):
+-----------------+-----------------+-----------------+-----------------+
| Swizzle Tile 0 | Swizzle Tile 1 | Swizzle Tile 2 | Swizzle Tile 3 |
+-----------------+-----------------+-----------------+-----------------+
| Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | XOR 0
Row 1 | 1 | 0 | 3 | 2 | 5 | 4 | 7 | 6 | XOR 1
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 2 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | XOR 0
Row 3 | 1 | 0 | 3 | 2 | 5 | 4 | 7 | 6 | XOR 1
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 4 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | XOR 0
Row 5 | 1 | 0 | 3 | 2 | 5 | 4 | 7 | 6 | XOR 1
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 6 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | XOR 0
Row 7 | 1 | 0 | 3 | 2 | 5 | 4 | 7 | 6 | XOR 1
+--------+--------+--------+--------+--------+--------+--------+--------+
Column Access Pattern (reading logical Unit 0):
- Row 0: Physical Unit 0 (Banks 0-3)
- Row 1: Physical Unit 1 (Banks 4-7)
- Row 2: Physical Unit 0 (Banks 0-3)
- Row 3: Physical Unit 1 (Banks 4-7)
- …
Tensor Core access Pattern for 2 core matrix wide swizzle atom (8x32B):
No bank conflict.
Without swizzling, the 2 core matrices would be stored across 2 SMEM rows resulting in bank conflicts when loading the 8x32B atom.
If you recall from the last post, a core matrix is 8x16B, so each row of the core matrix maps to a swizzle unit spanning 4 banks in this case.
Logical Swizzle atom that’s 2 core matrices wide looks as shown below. Rows are denoted by CMmRn (Row n in Core Matrix m).
+---------+---------+
|8x32B Swizzle Atom |
+---------+---------+
| CM0R0 | CM1R0 |
| CM0R1 | CM1R1 |
| CM0R2 | CM1R2 |
| CM0R3 | CM1R3 |
| CM0R4 | CM1R4 |
| CM0R5 | CM1R5 |
| CM0R6 | CM1R6 |
| CM0R7 | CM1R7 |
+---------+---------+
+-----------------+-----------------+-----------------+-----------------+
| Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 0 | CM0R0 | CM1R0 | CM0R1 | CM1R1 | CM0R2 | CM1R2 | CM0R3 | CM1R3 |
Row 1 | CM0R4 | CM1R4 | CM0R5 | CM1R5 | CM0R6 | CM1R6 | CM0R7 | CM1R7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
No bank-conflict layout with swizzling where both core matrices can be loaded without bank conflicts between core matrix rows -
+-----------------+-----------------+-----------------+-----------------+
| Swizzle Tile 0 | Swizzle Tile 1 | Swizzle Tile 2 | Swizzle Tile 3 |
+-----------------+-----------------+-----------------+-----------------+
| Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 0 | CM0R0 | CM1R0 | CM0R1 | CM1R1 | CM0R2 | CM1R2 | CM0R3 | CM1R3 |
Row 1 | CM1R4 | CM0R4 | CM1R5 | CM0R5 | CM1R6 | CM0R6 | CM1R7 | CM0R7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
5.3 Example C: 64B Swizzle Atom — 8×64B (4 Core Matrices)
Configuration: Swizzle<2, 4, 3>
- BBits=2: 4×4 tile of swizzle units
- MBase=4: 16-byte units
- SShift=3: 8 units per row
Swizzle tile: 4 rows × 4 units (64 bytes wide)
XOR values by row:
- Row 0, 4: XOR with 0
- Row 1, 5: XOR with 1
- Row 2, 6: XOR with 2
- Row 3, 7: XOR with 3
Physical Layout:
+-----------------------------------+-----------------------------------+
| Swizzle Tile 0 | Swizzle Tile 1 |
+-----------------+-----------------+-----------------+-----------------+
| Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | XOR 0
Row 1 | 1 | 0 | 3 | 2 | 5 | 4 | 7 | 6 | XOR 1
Row 2 | 2 | 3 | 0 | 1 | 6 | 7 | 4 | 5 | XOR 2
Row 3 | 3 | 2 | 1 | 0 | 7 | 6 | 5 | 4 | XOR 3
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 4 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | XOR 0
Row 5 | 1 | 0 | 3 | 2 | 5 | 4 | 7 | 6 | XOR 1
Row 6 | 2 | 3 | 0 | 1 | 6 | 7 | 4 | 5 | XOR 2
Row 7 | 3 | 2 | 1 | 0 | 7 | 6 | 5 | 4 | XOR 3
+--------+--------+--------+--------+--------+--------+--------+--------+
Column Access Pattern (reading logical Unit 0):
- Row 0: Physical Unit 0 (Banks 0-3)
- Row 1: Physical Unit 1 (Banks 4-7)
- Row 2: Physical Unit 2 (Banks 8-11)
- Row 3: Physical Unit 3 (Banks 12-15)
- Row 4: Physical Unit 0 (Banks 0-3)
- Row 5: Physical Unit 1 (Banks 4-7)
- Row 6: Physical Unit 2 (Banks 8-11)
- Row 7: Physical Unit 3 (Banks 12-15)
Tensor Core access Pattern for 4 core matrix wide swizzle atom (8x64B):
No bank conflict.
Without swizzling, the 4 core matrices would be stored across 4 SMEM rows resulting in bank conflicts when loading the 8x64B atom.
+-----------------+-----------------+-----------------+-----------------+
| Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 0 | CM0R0 | CM1R0 | CM2R0 | CM3R0 | CM0R1 | CM1R1 | CM2R1 | CM3R1 |
Row 1 | CM0R2 | CM1R2 | CM2R2 | CM3R2 | CM0R3 | CM1R3 | CM2R3 | CM3R3 |
Row 2 | CM0R4 | CM1R4 | CM2R4 | CM3R4 | CM0R5 | CM1R5 | CM2R5 | CM3R5 |
Row 3 | CM0R6 | CM1R6 | CM2R6 | CM3R6 | CM0R7 | CM1R7 | CM2R7 | CM3R7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
No bank-conflict layout with swizzling, where all 4 core matrices can be loaded without bank conflicts between core matrix rows -
+-----------------------------------+-----------------------------------+
| Swizzle Tile 0 | Swizzle Tile 1 |
+-----------------+-----------------+-----------------+-----------------+
| Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 0 | CM0R0 | CM1R0 | CM2R0 | CM3R0 | CM0R1 | CM1R1 | CM2R1 | CM3R1 |
Row 1 | CM1R2 | CM0R2 | CM3R2 | CM2R2 | CM1R3 | CM0R3 | CM3R3 | CM2R3 |
Row 2 | CM2R4 | CM3R4 | CM0R4 | CM1R4 | CM2R5 | CM3R5 | CM0R5 | CM1R5 |
Row 3 | CM3R6 | CM2R6 | CM1R6 | CM0R6 | CM3R7 | CM2R7 | CM1R7 | CM0R7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
5.4 Example D: 128B Swizzle Atom — 8×128B (8 Core Matrices)
Configuration: Swizzle<3, 4, 3>
- BBits=3: 8×8 tile of swizzle units
- MBase=4: 16-byte units
- SShift=3: 8 units per row
Swizzle tile: 8 rows × 8 units (128 bytes wide) — exactly one SMEM row!
XOR values by row: 0, 1, 2, 3, 4, 5, 6, 7 (all unique)
Physical Layout:
+-----------------------------------------------------------------------+
| Swizzle Tile 0 |
+-----------------+-----------------+-----------------+-----------------+
| Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | XOR 0
Row 1 | 1 | 0 | 3 | 2 | 5 | 4 | 7 | 6 | XOR 1
Row 2 | 2 | 3 | 0 | 1 | 6 | 7 | 4 | 5 | XOR 2
Row 3 | 3 | 2 | 1 | 0 | 7 | 6 | 5 | 4 | XOR 3
Row 4 | 4 | 5 | 6 | 7 | 0 | 1 | 2 | 3 | XOR 4
Row 5 | 5 | 4 | 7 | 6 | 1 | 0 | 3 | 2 | XOR 5
Row 6 | 6 | 7 | 4 | 5 | 2 | 3 | 0 | 1 | XOR 6
Row 7 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | XOR 7
+--------+--------+--------+--------+--------+--------+--------+--------+
Column Access Pattern (reading logical Unit 0):
- Row 0: Physical Unit 0 (Banks 0-3)
- Row 1: Physical Unit 1 (Banks 4-7)
- Row 2: Physical Unit 2 (Banks 8-11)
- Row 3: Physical Unit 3 (Banks 12-15)
- Row 4: Physical Unit 4 (Banks 16-19)
- Row 5: Physical Unit 5 (Banks 20-23)
- Row 6: Physical Unit 6 (Banks 24-27)
- Row 7: Physical Unit 7 (Banks 28-31)
All 8 rows access different banks!
Tensor Core access Pattern for 8 core matrix wide swizzle atom (8x128B):
No bank conflict.
Without swizzling, the 8 core matrices would be stored across 8 SMEM rows resulting in bank conflicts when loading the 8x128B atom.
+-----------------------------------------------------------------------+
| Swizzle Tile 0 |
+-----------------+-----------------+-----------------+-----------------+
| Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 0 | CM0R0 | CM1R0 | CM2R0 | CM3R0 | CM4R0 | CM5R0 | CM6R0 | CM7R0 |
Row 1 | CM0R1 | CM1R1 | CM2R1 | CM3R1 | CM4R1 | CM5R1 | CM6R1 | CM7R1 |
Row 2 | CM0R2 | CM1R2 | CM2R2 | CM3R2 | CM4R2 | CM5R2 | CM6R2 | CM7R2 |
Row 3 | CM0R3 | CM1R3 | CM2R3 | CM3R3 | CM4R3 | CM5R3 | CM6R3 | CM7R3 |
Row 4 | CM0R4 | CM1R4 | CM2R4 | CM3R4 | CM4R4 | CM5R4 | CM6R4 | CM7R4 |
Row 5 | CM0R5 | CM1R5 | CM2R5 | CM3R5 | CM4R5 | CM5R5 | CM6R5 | CM7R5 |
Row 6 | CM0R6 | CM1R6 | CM2R6 | CM3R6 | CM4R6 | CM5R6 | CM6R6 | CM7R6 |
Row 7 | CM0R7 | CM1R7 | CM2R7 | CM3R7 | CM4R7 | CM5R7 | CM6R7 | CM7R7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
No bank-conflict layout with swizzling, where all 8 core matrices can be loaded without bank conflicts between core matrix rows -
+-----------------------------------------------------------------------+
| Swizzle Tile 0 |
+-----------------+-----------------+-----------------+-----------------+
| Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
Row 0 | CM0R0 | CM1R0 | CM2R0 | CM3R0 | CM4R0 | CM5R0 | CM6R0 | CM7R0 |
Row 1 | CM1R1 | CM0R1 | CM3R1 | CM2R1 | CM5R1 | CM4R1 | CM7R1 | CM6R1 |
Row 2 | CM2R2 | CM3R2 | CM0R2 | CM1R2 | CM6R2 | CM7R2 | CM4R2 | CM5R2 |
Row 3 | CM3R3 | CM2R3 | CM1R3 | CM0R3 | CM7R3 | CM6R3 | CM5R3 | CM4R3 |
Row 4 | CM4R4 | CM5R4 | CM6R4 | CM7R4 | CM0R4 | CM1R4 | CM2R4 | CM3R4 |
Row 5 | CM5R5 | CM4R5 | CM7R5 | CM6R5 | CM1R5 | CM0R5 | CM3R5 | CM2R5 |
Row 6 | CM6R6 | CM7R6 | CM4R6 | CM5R6 | CM2R6 | CM3R6 | CM0R6 | CM1R6 |
Row 7 | CM7R7 | CM6R7 | CM5R7 | CM4R7 | CM3R7 | CM2R7 | CM1R7 | CM0R7 |
+--------+--------+--------+--------+--------+--------+--------+--------+
5.5 Example E: 128B Swizzle Atom with 32B Swizzle Unit — 8×128B
Configuration: Swizzle<2, 5, 2>
- BBits=2: 4×4 tile of swizzle units
- MBase=5: 32-byte units (8 banks each)
- SShift=2: 4 units per row
Swizzle tile: 4 rows × 4 units (128 bytes wide)
This configuration uses larger swizzle units (32B instead of 16B), useful when threads load 32 bytes at a time (e.g., 8×fp32 or 16×fp16).
XOR values by row:
- Row 0, 4: XOR with 0
- Row 1, 5: XOR with 1
- Row 2, 6: XOR with 2
- Row 3, 7: XOR with 3
Physical Layout:
+-------------------------------------------------------------------+
| Swizzle Tile 0 |
+----------------+----------------+----------------+----------------+
| Unit 0 (32B) | Unit 1 (32B) | Unit 2 (32B) | Unit 3 (32B) |
+----------------+----------------+----------------+----------------+
Row 0 | 0 | 1 | 2 | 3 | XOR 0
Row 1 | 1 | 0 | 3 | 2 | XOR 1
Row 2 | 2 | 3 | 0 | 1 | XOR 2
Row 3 | 3 | 2 | 1 | 0 | XOR 3
+----------------+----------------+----------------+----------------+
Row 4 | 0 | 1 | 2 | 3 | XOR 0
Row 5 | 1 | 0 | 3 | 2 | XOR 1
Row 6 | 2 | 3 | 0 | 1 | XOR 2
Row 7 | 3 | 2 | 1 | 0 | XOR 3
+----------------+----------------+----------------+----------------+
Bank Mapping (each 32B unit spans 8 banks):
SMEM Banks: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
Row 0 | L0 | L1 | L2 | L3 |
Row 1 | L1 | L0 | L3 | L2 |
Row 2 | L2 | L3 | L0 | L1 |
Row 3 | L3 | L2 | L1 | L0 |
Row 4 | L0 | L1 | L2 | L3 |
...
6. Key Takeaways
-
MBase sets the granularity—choose based on your natural access width (16B for fp16×8, 32B for fp16×16, etc.)
-
BBits determines conflict reduction—larger BBits = more XOR values = fewer conflicts. For 8-row access, BBits=3 eliminates conflicts completely.
-
SShift connects to SMEM geometry—typically
7 - MBasefor 128-byte SMEM rows. -
Swizzle tiles are independent—no coordination needed across tile boundaries. Each tile applies its own XOR pattern.
7. Swizzle Visualizer
I “claude-coded” an HTML+JS based swizzle visualizer that let’s you configure the swizzle unit, swizzle tile size and grid configuration and then visualize swizzled layouts in action.
I also have a more generic visualizer that allows you to change bank size and number of banks per row (forewarning: I haven’t tested this variant extensively. It seems to work pretty well in my limited testing.)
GPU Shared Memory Swizzle Visualizer
Visualize how Swizzle<BBits, MBase, SShift> remaps addresses to avoid bank conflicts
Single Swizzle Tile
This is one swizzle tile. The grid below is composed of these tiles.
Full Grid Layout
The grid is composed of swizzle tiles. Tile boundaries are marked with thick black lines.
Address Inspector
Click any cell in the grids above or SMEM bank organization below to see its address details.
SMEM Bank Organization
Bank-level view of SMEM with standard row-major layout and swizzle applied. Each grid row is stored consecutively in memory. Numbers show logical column stored at each bank position.
Swizzle Formula
mask = (1 << BBits) - 1
col_in_tile = col & mask // col mod 2^BBits
row_mod = row & mask // row mod 2^BBits
swizzled_slot = col_in_tile ^ row_mod
// The swizzled slot determines which bank group is accessed.
// Looking down any column: all row_mod values are different,
// so all swizzled_slot values are different -> no bank conflicts!
How to Read This Visualization
Swizzle Tile: The fundamental swizzle unit (2^BBits x 2^BBits swz units). The swizzle pattern is defined within a tile.
Grid: The full shared memory layout, composed of one or more swizzle tiles. Tile boundaries are marked with thick black lines.
Colors: Each color represents a logical column. In the swizzled view, colors show which logical column's data is stored at each physical slot.
WITHOUT Swizzle: Looking down any column, all cells have the same color -> bank conflict when reading column-major.
WITH Swizzle: Looking down any column, colors are shuffled -> no bank conflicts, data spread across different logical columns.
To verify: Use "Highlight Slot" to see which logical columns are accessed when reading a physical slot across rows.
Conclusion
I hope this post helped you decode the canonical swizzle notation in CuTe. Please feel free to reach out if you spot any inaccuracies.