vj-krish

Notes on GPUs, kernels and ML systems.

View on GitHub
22 February 2026

Decoding Swizzle<B,M,S>: A Visual Guide to Bank-Conflict-Free Shared Memory Access

by Vijay Krishnamoorthy

Overview

Efficient use of shared memory (SMEM) is critical for achieving peak performance in GPU kernels. One of the most common pitfalls, bank conflicts, can silently degrade performance by forcing serialized memory accesses. Swizzling is a powerful technique that reorganizes data layout to eliminate these conflicts. This article builds a mental model for understanding swizzling, culminating in a thorough explanation of CuTe’s canonical Swizzle<BBits, MBase, SShift> representation.

1. The Problem: Bank Conflicts

How Shared Memory Banks Work

GPU shared memory is organized into banks — interleaved partitions that can each service one memory access per cycle. Think of banks as parallel lanes: when different threads access different banks, all accesses proceed simultaneously. On modern NVIDIA GPUs:

SMEM Row (128 bytes):
Bank:  0    1    2    3    4   ...  30   31
     [4B] [4B] [4B] [4B] [4B] ... [4B] [4B]

For example, when 32 threads in a warp access 32 different banks simultaneously, all accesses complete in a single transaction. This is an ideal scenario.

When Conflicts Occur

A bank conflict occurs when multiple threads access different addresses within the same bank. The hardware must serialize these accesses, degrading performance.

Consider a matrix stored in row-major order where each element is 4 bytes:

Logical Layout (8x8 matrix, 4 bytes per element):

          Col 0   Col 1   Col 2   Col 3   Col 4   Col 5   Col 6   Col 7
        +-------+-------+-------+-------+-------+-------+-------+-------+
Row 0   | B0    | B1    | B2    | B3    | B4    | B5    | B6    | B7    |
Row 1   | B0    | B1    | B2    | B3    | B4    | B5    | B6    | B7    |
Row 2   | B0    | B1    | B2    | B3    | B4    | B5    | B6    | B7    |
...     | ...   | ...   | ...   | ...   | ...   | ...   | ...   | ...   |
Row 7   | B0    | B1    | B2    | B3    | B4    | B5    | B6    | B7    |
        +-------+-------+-------+-------+-------+-------+-------+-------+
          ^       ^       ^       ^       ^       ^       ^       ^
          |       |       |       |       |       |       |       |
        Bank 0  Bank 1  Bank 2  Bank 3  Bank 4  Bank 5  Bank 6  Bank 7

Row-major access (reading Row 0): Each thread reads a different column → different banks → no conflict.

Column-major access (reading Col 0): Every thread reads from Bank 0 → 8-way bank conflict!

2. The Solution: Swizzling

Swizzling reorganizes how data is stored in physical memory so that column-wise access patterns also hit different banks. The key insight is:

Apply a row-dependent transformation to the column index when storing data, so that elements in the same logical column end up in different physical banks.

The XOR Trick

The most elegant swizzling scheme uses bitwise XOR. For each element at logical position (row, col):

physical_col = logical_col XOR row

Let’s see this in action for an 8x8 tile:

Logical Layout:              Physical Layout (after XOR swizzle):

Col: 0 1 2 3 4 5 6 7         Col: 0 1 2 3 4 5 6 7
   +----------------+           +----------------+
R0 | 0 1 2 3 4 5 6 7 |       R0 | 0 1 2 3 4 5 6 7 |  (XOR 0 = no change)
R1 | 0 1 2 3 4 5 6 7 |       R1 | 1 0 3 2 5 4 7 6 |  (XOR 1)
R2 | 0 1 2 3 4 5 6 7 |       R2 | 2 3 0 1 6 7 4 5 |  (XOR 2)
R3 | 0 1 2 3 4 5 6 7 |       R3 | 3 2 1 0 7 6 5 4 |  (XOR 3)
R4 | 0 1 2 3 4 5 6 7 |       R4 | 4 5 6 7 0 1 2 3 |  (XOR 4)
R5 | 0 1 2 3 4 5 6 7 |       R5 | 5 4 7 6 1 0 3 2 |  (XOR 5)
R6 | 0 1 2 3 4 5 6 7 |       R6 | 6 7 4 5 2 3 0 1 |  (XOR 6)
R7 | 0 1 2 3 4 5 6 7 |       R7 | 7 6 5 4 3 2 1 0 |  (XOR 7)
   +----------------+           +----------------+

(Numbers show which logical column's data is stored at each physical position)

Now look at any physical column in the swizzled layout—say, physical column 0:

Every row contains a different logical column! When we read column 0 (rows 0-7), each thread accesses a different physical column → different banks → no conflict.

Why XOR Works Mathematically

The XOR operation with row indices guarantees conflict-free access because:

  1. XOR is its own inverse: a XOR b XOR b = a
  2. Unique mapping per row: For any fixed logical column c, as row r varies from 0 to 2^n-1, the value c XOR r produces all values from 0 to 2^n-1 exactly once.
  3. Bijection: XOR with a constant is a bijection (one-to-one mapping), so no two logical columns in the same row map to the same physical column.

3. CuTe’s Canonical Swizzle Form

NVIDIA’s CuTe library provides a canonical representation for swizzle patterns:

Swizzle<BBits, MBase, SShift>

This compact notation encodes everything needed to describe a swizzle pattern. Before diving into each parameter, let’s understand the high-level formula:

Given a byte address A, decompose it into bit fields:
A = [ ... ] [ YYY ] [ ... ] [ ZZZ ] [ ... ]
       ↑       ↑       ↑       ↑       ↑
       │       │       │       │       │
       │       │       │       │       └── bits [0, MBase)
       │       │       │       │
       │       │       │       └── bits [MBase, MBase+BBits)
       │       │       │
       │       │       └── bits [MBase+BBits, MBase+SShift)
       │       │
       │       └── bits [MBase+SShift, MBase+SShift+BBits)
       │
       └── bits [MBase+SShift+BBits, ...)

Swizzled address A' = A with ZZZ replaced by (ZZZ XOR YYY)

Now let’s build intuition for what each parameter means.

4. Building Intuition: MBase, BBits, and SShift

While CuTe uses the ordering <BBits, MBase, SShift>, it’s pedagogically clearer to understand them in a different order: MBase → BBits → SShift. Let’s explore each.

4.1 MBase: The Swizzle Unit

MBase defines the fundamental unit of data that moves together during swizzling. The swizzle unit size is:

Swizzle Unit Size = 2^MBase bytes

Key insight: Data within a swizzle unit is never rearranged. Swizzling only determines which physical slot a swizzle unit occupies—it doesn’t shuffle bytes inside the unit.

MBase Swizzle Unit Size Banks Spanned
2 4 bytes 1 bank
3 8 bytes 2 banks
4 16 bytes 4 banks
5 32 bytes 8 banks

Why does this matter? If your access pattern naturally reads 16-byte chunks (e.g., loading 4 fp32 values or 8 fp16 values per thread), you want MBase=4. There’s no benefit to swizzling at a finer granularity—it would just add complexity.

4.2 BBits: The Swizzle Tile Dimensions

BBits defines the swizzle tile as a square grid:

Swizzle Tile = 2^BBits rows × 2^BBits columns (of swizzle units)
BBits Tile Shape Swizzle Units per Tile
1 2 × 2 4
2 4 × 4 16
3 8 × 8 64

4.3 SShift: The Row Stride

Before diving into address bits, let’s understand what SShift represents. In a row-major layout:

Physically, SShift determines how wide each row is in memory. For a standard 128-byte SMEM row with 16-byte swizzle units:

Intuitively, SShift encodes where the row index bits start in the address. The column index occupies bits [MBase, MBase+SShift), and the row index starts at bit MBase+SShift.

A row-major address is structured as below -

Address = row × row_stride + col × unit_size
        = row × 2^(MBase+SShift) + col × 2^MBase

Address bits map to a 3-level hierarchy:

┌─ Level 1: SMEM Grid ──────────────────────────────────────────────────────┐
│  (swizzle tiles arranged in rows and columns)                             │
│                                                                           │
│  row_tile_idx                            col_tile_idx                     │
│       │                                       │                           │
│       │      ┌────────────┬────────────┬──────▼─────┬────────────┐        │
│       └─────►│  Tile 0,0  │  Tile 0,1  │  Tile 0,2  │  Tile 0,3  │ ...    │
│              ├────────────┼────────────┼────────────┼────────────┤        │
│              │  Tile 1,0  │  Tile 1,1  │  Tile 1,2  │  Tile 1,3  │ ...    │
│              └────────────┴────────────┴──────┬─────┴────────────┘        │
│                                               │                           │
│                                               ▼                           │
├─ Level 2: Inside a Tile ──────────────────────────────────────────────────┤
│  (2^BBits rows × 2^BBits columns of swizzle units)                        │
│                                                                           │
│  row_in_tile (YYY)                col_in_tile (ZZZ)                       │
│       │                                  │                                │
│       │         C0    C1    C2    C3   C4▼    C5    C6    C7              │
│       │      ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐            │
│       └─► R0 │     │     │     │     │  ●  │     │     │     │            │
│           R1 │     │     │     │     │     │     │     │     │            │
│           R2 │     │     │     │     │     │     │     │     │            │
│          ... └─────┴─────┴─────┴─────┴──┬──┴─────┴─────┴─────┘            │
│                                         │                                 │
│  ★ SWIZZLE: physical_col = col_in_tile XOR row_in_tile                    │
│                                         ▼                                 │
├─ Level 3: Inside a Swizzle Unit ──────────────────────────────────────────┤
│  (2^MBase bytes — NEVER modified by swizzling)                            │
│                                                                           │
│  byte_offset ──►  ┌────┬────┬────┬────┬────┬────┬─────┬──────────────┐    │
│                   │ B0 │ B1 │ B2 │ B3 │ B4 │ B5 │ ... │ B(2^MBase-1) │    │
│                   └────┴────┴────┴────┴────┴────┴─────┴──────────────┘    │
│                                                                           │
│  Data within a swizzle unit always stays together.                        │
└───────────────────────────────────────────────────────────────────────────┘

Note: col_tile_idx only exists when SShift > BBits (multiple swizzle tiles per grid row of tiles). When SShift = BBits, the entire column index fits in col_in_tile and there’s only one tile per row.

The swizzle operation XORs the YYY bits (row portion) into the ZZZ bits (column portion):

Swizzled Address = Original Address with ZZZ replaced by (ZZZ XOR YYY)

Why this works: The YYY bits contain the lower BBits of the row index. By XORing them into the ZZZ bits (lower BBits of column), we effectively remap which physical column stores each logical column—and the remapping is different for each row.

The mask: mask = (1 << BBits) - 1 isolates BBits bits. For BBits=3: mask = 0b111 = 7

Equivalence to row/column XOR: Because of how the address is structured, the swizzle is equivalent to:

col_in_tile = col & mask           // Lower BBits of column (ZZZ)
row_in_tile = row & mask           // Lower BBits of row (YYY)
physical_slot_in_tile = col_in_tile XOR row_in_tile

But remember: the actual operation is on address bits, not indices. The row bits at position [MBase+SShift, MBase+SShift+BBits) are XORed into the column bits at position [MBase, MBase+BBits).

Why square tiles? The XOR trick requires 2^BBits unique values in both YYY and ZZZ to create a bijection. With BBits rows and BBits column positions within a tile, we get exactly the right range of XOR operands.

To recap,

SShift specifies how many swizzle units fit in one logical row:

Swizzle Units per Row = 2^SShift

This parameter connects the swizzle tile to the actual SMEM layout. Typically:

SShift = log2(SMEM_row_bytes / swizzle_unit_bytes)
       = log2(128 / 2^MBase)
       = 7 - MBase
MBase Swizzle Unit SShift (for 128B SMEM row) Units per Row
4 16 bytes 3 8
5 32 bytes 2 4

Critical constraint: SShift >= BBits

The swizzle tile width (2^BBits units) must fit within one row (2^SShift units). If the tile is wider than the row, the swizzle pattern breaks.

Repetition: When SShift > BBits, multiple swizzle tiles fit horizontally in one SMEM row:

Tiles per Row = 2^(SShift - BBits)

Important: Swizzling operates independently within each tile. There’s no swizzling across tile boundaries—each tile applies its own XOR pattern based on the row index within that tile.

4.4 Putting It All Together

For Swizzle<BBits=3, MBase=4, SShift=3>:

Address bit layout:

Original Address (for row r, column c, byte offset b):
A = r × 2^(4+3) + c × 2^4 + b = r × 128 + c × 16 + b

Bit positions:
  [13..10]  [9  8  7]  [6  5  4]  [3  2  1  0]
  row_tile  row_in     col_in      byte
   _idx      _tile      _tile      _offset
             (YYY)      (ZZZ)

             ↑ bits [7,10)  ↑ bits [4,7)   ↑ bits [0,4)

Since SShift = BBits = 3, there is no col_tile_idx field—the entire column fits in col_in_tile. The swizzle tile spans the full SMEM row width.

The swizzle operation:

ZZZ = (Address >> MBase) & 0b111          // Extract bits [4,7) = col_in_tile
YYY = (Address >> (MBase+SShift)) & 0b111 // Extract bits [7,10) = row_in_tile

Swizzled Address = Address XOR (YYY << MBase)
                 = Address with ZZZ replaced by (ZZZ XOR YYY)

Equivalent index-based view:

col_in_tile = c & 0b111                   // Lower 3 bits of column
row_in_tile = r & 0b111                   // Lower 3 bits of row
physical_slot_in_tile = col_in_tile XOR row_in_tile
physical_col = (c & ~0b111) | physical_slot_in_tile  // Preserve col tile index

The key insight: we XOR the row bits (YYY) into the column bits (ZZZ) at their respective positions in the address. The upper column bits (tile index) and all other bits pass through unchanged, ensuring swizzling is self-contained within each tile.

5. Worked Examples

Let’s walk through concrete examples using standard GPU SMEM configuration:

We’ll examine different swizzle configurations corresponding to different “swizzle atoms” from the previous post on MMA layouts.

5.1 Example A: No Swizzle — 8×16B (1 Core Matrix)

Configuration: Swizzle<0, 4, 3>

Logical Layout (8 rows × 8 units of 16B each = 8 × 128B):

        +--------+--------+--------+--------+--------+--------+--------+--------+
        | Swiz   | Swiz   | Swiz   | Swiz   | Swiz   | Swiz   | Swiz   | Swiz   |
        | Tile 0 | Tile 1 | Tile 2 | Tile 3 | Tile 4 | Tile 5 | Tile 6 | Tile 7 |
        +-----------------+-----------------+-----------------+-----------------+
        | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 0   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    |
Row 1   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    |
Row 2   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    |
Row 3   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    |
Row 4   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    |
Row 5   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    |
Row 6   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    |
Row 7   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    |
        +--------+--------+--------+--------+--------+--------+--------+--------+

Physical Layout: Same as logical (no swizzle applied)

Bank Mapping (each 16B unit spans 4 banks):

          Banks 0-3  Banks 4-7  Banks 8-11 Banks 12-15 Banks 16-19 Banks 20-23 Banks 24-27 Banks 28-31
        +----------+----------+----------+-----------+-----------+-----------+-----------+-----------+
Row 0   |  Unit 0  |  Unit 1  |  Unit 2  |  Unit 3   |  Unit 4   |  Unit 5   |  Unit 6   |  Unit 7   |
Row 1   |  Unit 0  |  Unit 1  |  Unit 2  |  Unit 3   |  Unit 4   |  Unit 5   |  Unit 6   |  Unit 7   |
...     |   ...    |   ...    |   ...    |   ...     |   ...     |   ...     |   ...     |   ...     |
        +----------+----------+----------+-----------+-----------+-----------+-----------+-----------+

Column Access Pattern (reading logical Unit 0):

Bank conflict across threads accessing the units in a column.

Tensor Core access Pattern for swizzle atom with no swizzle (8x16B):

No bank conflict.

5.2 Example B: 32B Swizzle Atom — 8×32B (2 Core Matrices)

Configuration: Swizzle<1, 4, 3>

Swizzle tile: 2 rows × 2 units (32 bytes wide)

XOR values by row:

Physical Layout (showing logical column at each physical position):

        +-----------------+-----------------+-----------------+-----------------+
        | Swizzle Tile 0  | Swizzle Tile 1  | Swizzle Tile 2  | Swizzle Tile 3  |
        +-----------------+-----------------+-----------------+-----------------+
        | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 0   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    | XOR 0
Row 1   |   1    |   0    |   3    |   2    |   5    |   4    |   7    |   6    | XOR 1
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 2   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    | XOR 0
Row 3   |   1    |   0    |   3    |   2    |   5    |   4    |   7    |   6    | XOR 1
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 4   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    | XOR 0
Row 5   |   1    |   0    |   3    |   2    |   5    |   4    |   7    |   6    | XOR 1
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 6   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    | XOR 0
Row 7   |   1    |   0    |   3    |   2    |   5    |   4    |   7    |   6    | XOR 1
        +--------+--------+--------+--------+--------+--------+--------+--------+

Column Access Pattern (reading logical Unit 0):

Tensor Core access Pattern for 2 core matrix wide swizzle atom (8x32B):

No bank conflict.

Without swizzling, the 2 core matrices would be stored across 2 SMEM rows resulting in bank conflicts when loading the 8x32B atom.

If you recall from the last post, a core matrix is 8x16B, so each row of the core matrix maps to a swizzle unit spanning 4 banks in this case.

Logical Swizzle atom that’s 2 core matrices wide looks as shown below. Rows are denoted by CMmRn (Row n in Core Matrix m).

+---------+---------+
|8x32B Swizzle Atom |
+---------+---------+
|  CM0R0  |  CM1R0  |
|  CM0R1  |  CM1R1  |
|  CM0R2  |  CM1R2  |
|  CM0R3  |  CM1R3  |
|  CM0R4  |  CM1R4  |
|  CM0R5  |  CM1R5  |
|  CM0R6  |  CM1R6  |
|  CM0R7  |  CM1R7  |
+---------+---------+
        +-----------------+-----------------+-----------------+-----------------+
        | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 0   |  CM0R0 |  CM1R0 |  CM0R1 |  CM1R1 |  CM0R2 |  CM1R2 |  CM0R3 | CM1R3  |
Row 1   |  CM0R4 |  CM1R4 |  CM0R5 |  CM1R5 |  CM0R6 |  CM1R6 |  CM0R7 | CM1R7  |
        +--------+--------+--------+--------+--------+--------+--------+--------+

No bank-conflict layout with swizzling where both core matrices can be loaded without bank conflicts between core matrix rows -

        +-----------------+-----------------+-----------------+-----------------+
        |  Swizzle Tile 0 |  Swizzle Tile 1 |  Swizzle Tile 2 |  Swizzle Tile 3 |
        +-----------------+-----------------+-----------------+-----------------+
        | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 0   |  CM0R0 |  CM1R0 |  CM0R1 |  CM1R1 |  CM0R2 |  CM1R2 |  CM0R3 | CM1R3  |
Row 1   |  CM1R4 |  CM0R4 |  CM1R5 |  CM0R5 |  CM1R6 |  CM0R6 |  CM1R7 | CM0R7  |
        +--------+--------+--------+--------+--------+--------+--------+--------+

5.3 Example C: 64B Swizzle Atom — 8×64B (4 Core Matrices)

Configuration: Swizzle<2, 4, 3>

Swizzle tile: 4 rows × 4 units (64 bytes wide)

XOR values by row:

Physical Layout:

        +-----------------------------------+-----------------------------------+
        |          Swizzle Tile 0           |          Swizzle Tile 1           |
        +-----------------+-----------------+-----------------+-----------------+
        | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 0   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    | XOR 0
Row 1   |   1    |   0    |   3    |   2    |   5    |   4    |   7    |   6    | XOR 1
Row 2   |   2    |   3    |   0    |   1    |   6    |   7    |   4    |   5    | XOR 2
Row 3   |   3    |   2    |   1    |   0    |   7    |   6    |   5    |   4    | XOR 3
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 4   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    | XOR 0
Row 5   |   1    |   0    |   3    |   2    |   5    |   4    |   7    |   6    | XOR 1
Row 6   |   2    |   3    |   0    |   1    |   6    |   7    |   4    |   5    | XOR 2
Row 7   |   3    |   2    |   1    |   0    |   7    |   6    |   5    |   4    | XOR 3
        +--------+--------+--------+--------+--------+--------+--------+--------+

Column Access Pattern (reading logical Unit 0):

Tensor Core access Pattern for 4 core matrix wide swizzle atom (8x64B):

No bank conflict.

Without swizzling, the 4 core matrices would be stored across 4 SMEM rows resulting in bank conflicts when loading the 8x64B atom.

        +-----------------+-----------------+-----------------+-----------------+
        | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 0   |  CM0R0 |  CM1R0 |  CM2R0 |  CM3R0 |  CM0R1 |  CM1R1 |  CM2R1 | CM3R1  |
Row 1   |  CM0R2 |  CM1R2 |  CM2R2 |  CM3R2 |  CM0R3 |  CM1R3 |  CM2R3 | CM3R3  |
Row 2   |  CM0R4 |  CM1R4 |  CM2R4 |  CM3R4 |  CM0R5 |  CM1R5 |  CM2R5 | CM3R5  |
Row 3   |  CM0R6 |  CM1R6 |  CM2R6 |  CM3R6 |  CM0R7 |  CM1R7 |  CM2R7 | CM3R7  |
        +--------+--------+--------+--------+--------+--------+--------+--------+

No bank-conflict layout with swizzling, where all 4 core matrices can be loaded without bank conflicts between core matrix rows -

        +-----------------------------------+-----------------------------------+
        |          Swizzle Tile 0           |          Swizzle Tile 1           |
        +-----------------+-----------------+-----------------+-----------------+
        | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 0   |  CM0R0 |  CM1R0 |  CM2R0 |  CM3R0 |  CM0R1 |  CM1R1 |  CM2R1 | CM3R1  |
Row 1   |  CM1R2 |  CM0R2 |  CM3R2 |  CM2R2 |  CM1R3 |  CM0R3 |  CM3R3 | CM2R3  |
Row 2   |  CM2R4 |  CM3R4 |  CM0R4 |  CM1R4 |  CM2R5 |  CM3R5 |  CM0R5 | CM1R5  |
Row 3   |  CM3R6 |  CM2R6 |  CM1R6 |  CM0R6 |  CM3R7 |  CM2R7 |  CM1R7 | CM0R7  |
        +--------+--------+--------+--------+--------+--------+--------+--------+

5.4 Example D: 128B Swizzle Atom — 8×128B (8 Core Matrices)

Configuration: Swizzle<3, 4, 3>

Swizzle tile: 8 rows × 8 units (128 bytes wide) — exactly one SMEM row!

XOR values by row: 0, 1, 2, 3, 4, 5, 6, 7 (all unique)

Physical Layout:

        +-----------------------------------------------------------------------+
        |                            Swizzle Tile 0                             |
        +-----------------+-----------------+-----------------+-----------------+
        | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 0   |   0    |   1    |   2    |   3    |   4    |   5    |   6    |   7    | XOR 0
Row 1   |   1    |   0    |   3    |   2    |   5    |   4    |   7    |   6    | XOR 1
Row 2   |   2    |   3    |   0    |   1    |   6    |   7    |   4    |   5    | XOR 2
Row 3   |   3    |   2    |   1    |   0    |   7    |   6    |   5    |   4    | XOR 3
Row 4   |   4    |   5    |   6    |   7    |   0    |   1    |   2    |   3    | XOR 4
Row 5   |   5    |   4    |   7    |   6    |   1    |   0    |   3    |   2    | XOR 5
Row 6   |   6    |   7    |   4    |   5    |   2    |   3    |   0    |   1    | XOR 6
Row 7   |   7    |   6    |   5    |   4    |   3    |   2    |   1    |   0    | XOR 7
        +--------+--------+--------+--------+--------+--------+--------+--------+

Column Access Pattern (reading logical Unit 0):

All 8 rows access different banks!

Tensor Core access Pattern for 8 core matrix wide swizzle atom (8x128B):

No bank conflict.

Without swizzling, the 8 core matrices would be stored across 8 SMEM rows resulting in bank conflicts when loading the 8x128B atom.

        +-----------------------------------------------------------------------+
        |                            Swizzle Tile 0                             |
        +-----------------+-----------------+-----------------+-----------------+
        | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 0   |  CM0R0 |  CM1R0 |  CM2R0 |  CM3R0 |  CM4R0 |  CM5R0 |  CM6R0 | CM7R0  |
Row 1   |  CM0R1 |  CM1R1 |  CM2R1 |  CM3R1 |  CM4R1 |  CM5R1 |  CM6R1 | CM7R1  |
Row 2   |  CM0R2 |  CM1R2 |  CM2R2 |  CM3R2 |  CM4R2 |  CM5R2 |  CM6R2 | CM7R2  |
Row 3   |  CM0R3 |  CM1R3 |  CM2R3 |  CM3R3 |  CM4R3 |  CM5R3 |  CM6R3 | CM7R3  |
Row 4   |  CM0R4 |  CM1R4 |  CM2R4 |  CM3R4 |  CM4R4 |  CM5R4 |  CM6R4 | CM7R4  |
Row 5   |  CM0R5 |  CM1R5 |  CM2R5 |  CM3R5 |  CM4R5 |  CM5R5 |  CM6R5 | CM7R5  |
Row 6   |  CM0R6 |  CM1R6 |  CM2R6 |  CM3R6 |  CM4R6 |  CM5R6 |  CM6R6 | CM7R6  |
Row 7   |  CM0R7 |  CM1R7 |  CM2R7 |  CM3R7 |  CM4R7 |  CM5R7 |  CM6R7 | CM7R7  |
        +--------+--------+--------+--------+--------+--------+--------+--------+

No bank-conflict layout with swizzling, where all 8 core matrices can be loaded without bank conflicts between core matrix rows -

        +-----------------------------------------------------------------------+
        |                            Swizzle Tile 0                             |
        +-----------------+-----------------+-----------------+-----------------+
        | Unit 0 | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5 | Unit 6 | Unit 7 |
        +--------+--------+--------+--------+--------+--------+--------+--------+
Row 0   |  CM0R0 |  CM1R0 |  CM2R0 |  CM3R0 |  CM4R0 |  CM5R0 |  CM6R0 | CM7R0  |
Row 1   |  CM1R1 |  CM0R1 |  CM3R1 |  CM2R1 |  CM5R1 |  CM4R1 |  CM7R1 | CM6R1  |
Row 2   |  CM2R2 |  CM3R2 |  CM0R2 |  CM1R2 |  CM6R2 |  CM7R2 |  CM4R2 | CM5R2  |
Row 3   |  CM3R3 |  CM2R3 |  CM1R3 |  CM0R3 |  CM7R3 |  CM6R3 |  CM5R3 | CM4R3  |
Row 4   |  CM4R4 |  CM5R4 |  CM6R4 |  CM7R4 |  CM0R4 |  CM1R4 |  CM2R4 | CM3R4  |
Row 5   |  CM5R5 |  CM4R5 |  CM7R5 |  CM6R5 |  CM1R5 |  CM0R5 |  CM3R5 | CM2R5  |
Row 6   |  CM6R6 |  CM7R6 |  CM4R6 |  CM5R6 |  CM2R6 |  CM3R6 |  CM0R6 | CM1R6  |
Row 7   |  CM7R7 |  CM6R7 |  CM5R7 |  CM4R7 |  CM3R7 |  CM2R7 |  CM1R7 | CM0R7  |
        +--------+--------+--------+--------+--------+--------+--------+--------+

5.5 Example E: 128B Swizzle Atom with 32B Swizzle Unit — 8×128B

Configuration: Swizzle<2, 5, 2>

Swizzle tile: 4 rows × 4 units (128 bytes wide)

This configuration uses larger swizzle units (32B instead of 16B), useful when threads load 32 bytes at a time (e.g., 8×fp32 or 16×fp16).

XOR values by row:

Physical Layout:

        +-------------------------------------------------------------------+
        |                          Swizzle Tile 0                           |
        +----------------+----------------+----------------+----------------+
        |  Unit 0 (32B)  |  Unit 1 (32B)  |  Unit 2 (32B)  |  Unit 3 (32B)  |
        +----------------+----------------+----------------+----------------+
Row 0   |       0        |       1        |       2        |       3        | XOR 0
Row 1   |       1        |       0        |       3        |       2        | XOR 1
Row 2   |       2        |       3        |       0        |       1        | XOR 2
Row 3   |       3        |       2        |       1        |       0        | XOR 3
        +----------------+----------------+----------------+----------------+
Row 4   |       0        |       1        |       2        |       3        | XOR 0
Row 5   |       1        |       0        |       3        |       2        | XOR 1
Row 6   |       2        |       3        |       0        |       1        | XOR 2
Row 7   |       3        |       2        |       1        |       0        | XOR 3
        +----------------+----------------+----------------+----------------+

Bank Mapping (each 32B unit spans 8 banks):

SMEM Banks:   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31
            +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
Row 0       |              L0               |              L1               |              L2               |              L3               |
Row 1       |              L1               |              L0               |              L3               |              L2               |
Row 2       |              L2               |              L3               |              L0               |              L1               |
Row 3       |              L3               |              L2               |              L1               |              L0               |
Row 4       |              L0               |              L1               |              L2               |              L3               |
...

6. Key Takeaways

  1. MBase sets the granularity—choose based on your natural access width (16B for fp16×8, 32B for fp16×16, etc.)

  2. BBits determines conflict reduction—larger BBits = more XOR values = fewer conflicts. For 8-row access, BBits=3 eliminates conflicts completely.

  3. SShift connects to SMEM geometry—typically 7 - MBase for 128-byte SMEM rows.

  4. Swizzle tiles are independent—no coordination needed across tile boundaries. Each tile applies its own XOR pattern.

7. Swizzle Visualizer

I “claude-coded” an HTML+JS based swizzle visualizer that let’s you configure the swizzle unit, swizzle tile size and grid configuration and then visualize swizzled layouts in action.

I also have a more generic visualizer that allows you to change bank size and number of banks per row (forewarning: I haven’t tested this variant extensively. It seems to work pretty well in my limited testing.)

GPU Shared Memory Swizzle Visualizer

Visualize how Swizzle<BBits, MBase, SShift> remaps addresses to avoid bank conflicts

Presets
Swizzle Unit
Unit size = 2^MBase bytes
Swizzle Tile
Tile = 2^BBits x 2^BBits units
Grid Configuration
Swz units/grid row = 2^SShift
Default = tile height
Inspect
Trace logical column

Single Swizzle Tile

This is one swizzle tile. The grid below is composed of these tiles.

Tile WITHOUT Swizzle
->
Tile WITH Swizzle

Full Grid Layout

The grid is composed of swizzle tiles. Tile boundaries are marked with thick black lines.

Grid WITHOUT Swizzle
Color & number = logical column
Grid WITH Swizzle
Color & number = logical column stored at each physical slot

Address Inspector

Click any cell in the grids above or SMEM bank organization below to see its address details.

SMEM Bank Organization

Bank-level view of SMEM with standard row-major layout and swizzle applied. Each grid row is stored consecutively in memory. Numbers show logical column stored at each bank position.

Swizzle Formula

// For swz unit at logical position (row, col):

mask = (1 << BBits) - 1
col_in_tile = col & mask // col mod 2^BBits
row_mod = row & mask // row mod 2^BBits
swizzled_slot = col_in_tile ^ row_mod

// The swizzled slot determines which bank group is accessed.
// Looking down any column: all row_mod values are different,
// so all swizzled_slot values are different -> no bank conflicts!

How to Read This Visualization

Swizzle Tile: The fundamental swizzle unit (2^BBits x 2^BBits swz units). The swizzle pattern is defined within a tile.

Grid: The full shared memory layout, composed of one or more swizzle tiles. Tile boundaries are marked with thick black lines.

Colors: Each color represents a logical column. In the swizzled view, colors show which logical column's data is stored at each physical slot.

WITHOUT Swizzle: Looking down any column, all cells have the same color -> bank conflict when reading column-major.

WITH Swizzle: Looking down any column, colors are shuffled -> no bank conflicts, data spread across different logical columns.

To verify: Use "Highlight Slot" to see which logical columns are accessed when reading a physical slot across rows.

Conclusion

I hope this post helped you decode the canonical swizzle notation in CuTe. Please feel free to reach out if you spot any inaccuracies.

Find me

X GitHub LinkedIn

tags: GPU - SMEM - Swizzling