This is Part 1 of the Understanding Memory: Essential Commodity of the AI Era series.

Introduction

Hello, I’m Seungbin Shin, an RTL Designer at HyperAccel.

Recently, “memory and semiconductor stocks” have been making headlines in the stock market. Lately, a new term — “HBF” — has started popping up, and people are asking “What even is that?” and “So should I buy it?”

So what exactly is High Bandwidth Flash (HBF)?

To understand HBF, you first need to be able to answer one question:

“Why do computers have so many different types of memory?”

Inside our computers, there are memories with confusing names — SRAM, DRAM, HBM, SSD — organized into multiple layers. They all “store data,” so why not just use one type?

As you work through this question, you’ll naturally see why HBF emerged and where it fits in.

The content of this post is based on my personal study and experience. If you find any errors, please let me know in the comments.


Why So Many Types of Memory — The Speed/Capacity/Cost Trilemma

The short answer: fast memory is expensive and small, while cheap memory is slow and large.

This isn’t simply because technology hasn’t advanced enough. The physical structure of the memory cell itself determines this trade-off. Build a cell with many transistors and it’s fast but takes up a lot of area; simplify the cell and it’s slower but can store far more data in the same space.

Computers work around this physical limitation by stacking different types of memory in a hierarchy. Faster, smaller memory sits closer to the processor, while slower, larger memory sits further away.

Now let’s look at why each layer of memory has the characteristics it does, starting with the cell structure.

Memory hierarchy pyramid — faster, more expensive, and smaller toward the top


SRAM — The Fastest, and the Most Expensive

6T SRAM cell schematic — cross-coupled inverters and pass gates

Cell Structure: 6 Transistors per Bit

Static Random Access Memory (SRAM) uses 6 transistors to store a single bit. This is called a 6T SRAM cell.

Four of the six transistors form two inverters whose outputs are cross-coupled to each other. This structure creates a latch that maintains its state as long as power is supplied. The remaining two transistors serve as pass gates for reading and writing to the cell.

Why It’s Fast

The reason SRAM is fast is straightforward.

Because the latch maintains its state on its own, no refresh is needed. This is where the name “Static” comes from. A read operation simply transfers the voltage state stored in the latch to the bitline, resulting in access latency of under 1 ns.

Why It’s Expensive

The problem is size. With 6 transistors needed per bit, each cell occupies a large area. This means fewer bits can be produced on the same silicon wafer, which translates to a very high cost per bit.

This makes using SRAM in large capacities physically and economically impractical. It’s deployed only in amounts of a few MB to tens of MB as cache (L1/L2/L3) inside CPUs and GPUs, or as on-chip memory in AI accelerators.

Fast but expensive, and small in capacity. That’s why it sits right next to the processor, holding only the most frequently used data.


DRAM — Cheap and Large, but Constantly Refreshing

1T1C DRAM cell schematic — one transistor and one capacitor

Cell Structure: 1 Transistor + 1 Capacitor

The cell structure of Dynamic Random Access Memory (DRAM) is dramatically different from SRAM. It stores a single bit using just two components: one transistor and one capacitor. This is called the 1T1C structure.

If the capacitor is charged, it’s a 1; if discharged, it’s a 0. The transistor acts as a switch to access the capacitor.

Why It’s Cheap

With only two components per cell, the cell area is 1/4 to 1/6 that of SRAM. Far more bits can be produced on the same wafer, significantly lowering the cost per bit. Thanks to this density advantage, DRAM is used as main memory in capacities of several GB to tens of GB.

Why It’s Slower (Relatively)

Over time, the charge stored in the capacitor gradually leaks away. Before the data is lost, the charge must be periodically replenished through a refresh operation. The name “Dynamic” comes from exactly this characteristic.

The read operation is also more complex than SRAM. The capacitor’s charge must be shared with the bitline (charge sharing) and then amplified by a sense amplifier. Since the read process depletes the charge, the data must be rewritten (restored) immediately after reading.

These additional steps result in access latency of 10–100 ns, more than 10x slower than SRAM.

DDR: The Interface That Speeds Up DRAM

While DRAM cells themselves are slow, improving the interface standard for exchanging data with the outside world can increase effective transfer speeds. That standard is Double Data Rate (DDR).

Before DDR, SDRAM transferred data only on the rising edge of the clock. DDR, as its name suggests, transfers data on both the rising and falling edges, doubling the throughput at the same clock speed.

Subsequent generations increased transfer speeds by deepening the prefetch depth. Prefetch is a technique that retrieves multiple beats from a single internal memory access, making the external transfer rate faster than the internal clock. DDR4 prefetches 8 bits at a time, while DDR5 prefetches 16 bits. As a result, DDR5 delivers a maximum bandwidth of approximately 50 GB/s per module.

But there is a fundamental ceiling to this improvement.


HBM — Stacking DRAM Up

Breaking the Pin Limit Through Packaging

The data bus width of DDR is 64 bits as of DDR5. Because there’s a physical limit to the number of pins connecting the processor to memory, no matter how much you increase the clock, there’s a ceiling on how much data can be transferred at once.

For workloads like AI that need to read massive amounts of data simultaneously, this bottleneck is critical. High Bandwidth Memory (HBM) solves this problem not by changing the cell structure, but through packaging.

The core idea is simple: stack DRAM dies vertically and place them right next to the processor on the same substrate.

HBM stack cross-section — DRAM dies stacked with TSV and placed alongside GPU on interposer

TSV: Vertical Wiring Through Silicon

Conventional DRAM modules connect to the processor through traces on a PCB. HBM is different. Tiny holes are drilled through the silicon die itself and filled with copper wiring, vertically connecting dies above and below. This is the Through-Silicon Via (TSV) — a through-silicon electrode.

Thanks to TSV, multiple DRAM dies can be stacked into a single unit, and because each die has independent data paths, the bus width can be dramatically widened. As of HBM4, the data bus width is 2,048 bits32x that of DDR5.

Interposer: Placing It Right Next to the Processor

HBM stacks are placed alongside a GPU or AI accelerator on a silicon interposer, an intermediate substrate. The fine wiring inside the interposer connects HBM to the processor at a distance of just a few millimeters, a dramatic reduction compared to conventional DDR modules traveling tens of centimeters through a PCB.

Shorter wiring = lower latency + lower power consumption + higher signal integrity. This combination is what makes HBM’s bandwidth possible.

HBM by the Numbers

MetricDDR5HBM3EHBM4
Bus Width64-bit1024-bit2048-bit
Bandwidth per Stack~50 GB/s~1.2 TB/s~2 TB/s+
Capacity per Stack-24–36 GB36–48 GB
Distance to Processor~tens of cm (PCB)~mm (interposer)~mm (interposer)

In April 2025, Joint Electron Device Engineering Council (JEDEC), the semiconductor standards body, officially published the HBM4 standard, and mass production began in early 2026. SK hynix unveiled the world’s first 16-layer, 48 GB HBM4 stack, and NVIDIA’s next-generation GPU, Vera Rubin, targets a total of 384 GB and approximately 22 TB/s of memory bandwidth with eight HBM4 stacks.

HBM’s Limitation: Fast, but Still Not Enough Capacity

While bandwidth and capacity have both grown significantly through HBM4, the fundamental constraints remain.

The cell itself is still DRAM (1T1C), so the limits on bit density haven’t changed. Combined with the challenges of TSV processing, interposer fabrication, yield management during die stacking, and heat dissipation, the cost per unit of capacity is very high.

Even with a full complement of eight HBM4 stacks, the maximum is 384 GB. Yet the parameter sizes of the latest Large Language Models (LLMs) range from hundreds of GB to several TB. The rate at which models are growing outpaces HBM’s ability to scale capacity.


NAND Flash — Cheap and Large, but Far Too Slow

Cell Structure: Trapping Charge to Store Bits

NAND Flash stores data using a fundamentally different principle from SRAM or DRAM.

At the heart of a NAND cell is a conductor surrounded by an insulating layer called a floating gate. When a high voltage is applied, electrons are injected into the floating gate through quantum tunneling, and the insulating layer prevents the electrons from escaping, preserving data even when power is off. This is why NAND is a non-volatile memory.

Why It’s Cheap and Large: Extreme Density

NAND’s density advantage comes from two factors.

First, multi-level cells. By finely distinguishing the amount of charge stored in a single cell, multiple bits can be packed in. Triple-Level Cell (TLC) stores 3 bits per cell, and Quad-Level Cell (QLC) stores 4 bits per cell.

The difference becomes even more dramatic when comparing areal bit density. SRAM achieves roughly 0.04 Gb/mm², DRAM about 0.2–0.3 Gb/mm², while the latest 3D NAND reaches approximately 5–15 Gb/mm². That means over 100x more data than SRAM and over 20x more than DRAM in the same silicon area.

Second, 3D stacking. Modern NAND stacks cells vertically rather than laying them flat, reaching hundreds of layers. Samsung’s V-NAND and SanDisk’s BiCS NAND are representative examples, with the latest products exceeding 200 layers.

Together, these two factors allow NAND to deliver TB-scale capacity at a very low cost per bit.

Why It’s Slow: Microseconds Just to Read

NAND’s read operation involves measuring the threshold voltage of the transistor, which varies depending on the amount of charge in the floating gate. In multi-level cells, multiple voltage levels must be precisely distinguished, adding time.

As a result, NAND’s random read latency is approximately 50–100 µs. Compared to DRAM’s tens of nanoseconds and SRAM’s sub-nanosecond latency, that’s over 1,000x slower.

Writes are even slower, and overwriting data requires erasing an entire block first. On top of that, there’s a physical limit on the number of erase/write cycles (write endurance).

Current Role in AI Servers

In today’s AI servers, NAND (SSD) serves as model storage and checkpoint storage. It holds trained model weights and loads them into HBM when needed — essentially a “warehouse.”

But it’s far too slow to directly feed data during computation. The GPU ends up waiting for data longer than it takes to actually compute.

Ample capacity and low cost, but critically insufficient speed. NAND has been trapped within this limitation.


The Gap — Between HBM and SSD

Let’s put the memory hierarchy we’ve covered so far into a single table.

SRAMDRAM (DDR5)HBM4???NAND (SSD)
Cell Structure6T1T1C1T1C (TSV stacked)Floating gate
Access Latency~1 ns~10–100 ns~10–100 ns~50–100 µs
Bandwidth (stack/module)~TB/s (on-chip)~50 GB/s~2 TB/s~7 GB/s
CapacityMB–tens of MBGB–tens of GB36–48 GBTB-scale
Cost per BitVery highMediumHighVery low
VolatilityVolatileVolatileVolatileNon-volatile

One thing immediately stands out.

HBM4 maxes out at 48 GB per stack, or 384 GB with eight stacks. SSD offers TB-scale capacity at low cost, but bandwidth is only around 7 GB/s.

HBM has great bandwidth but limited capacity; SSD has ample capacity but is far too slow.

Granted, HBM’s capacity is relatively large at this point, but it falls short of what people expect.

Between the two, there is no memory that provides “TB-scale capacity at TB/s-level bandwidth.”

With LLM parameters growing from hundreds of GB to several TB, what if we could fill this gap?


HBF — Dressing NAND in HBM’s Clothes

HBF architecture — NAND dies stacked with TSV, placed alongside accelerator on interposer

Core Idea: Familiar Cells, New Packaging

High Bandwidth Flash (HBF) is a new memory tier that keeps the NAND cell as-is while applying packaging technology proven in HBM.

Earlier, we saw how HBM revolutionized bandwidth without changing the DRAM cell — purely through TSV stacking and interposer placement. HBF applies exactly the same strategy to NAND.

NAND dies are vertically stacked with TSV and placed on an interposer right next to a GPU or AI accelerator. Just as HBM broke through DDR’s pin-count limitation with packaging, HBF breaks through SSD’s bandwidth limitation with packaging.

CBA Architecture: Optimizing NAND for High Bandwidth

The technical core of HBF is the CMOS Bonding Array (CBA) architecture developed by SanDisk.

Conventional NAND accesses a single large memory array sequentially. CBA divides this into thousands of independent storage sub-arrays. Each sub-array has its own read/write channel and operates in parallel simultaneously, extracting bandwidth levels that a single NAND die could never achieve on its own.

This is combined with SanDisk’s BiCS NAND (3D vertical stacking NAND) technology. BiCS NAND dies stacked hundreds of layers high are connected via TSV, and CBA drives the sub-arrays across these dies simultaneously.

HBF by the Numbers

HBM4HBF Gen 1HBF Gen 2HBF Gen 3
Read Bandwidth~2 TB/s1.6 TB/s>2 TB/s>3.2 TB/s
Stack Capacity36–48 GB512 GB1 TB1.5 TB
Access Latency~10–100 ns~10 µs--
Cell Structure1T1C DRAMFloating gate NANDFloating gate NANDFloating gate NAND

There’s a number worth noting here.

HBF Gen 1’s read bandwidth of 1.6 TB/s approaches that of HBM4. Yet its capacity of 512 GB is over 10x that of an HBM4 stack (48 GB). Providing 8–16x the capacity of HBM at similar bandwidth and similar cost is HBF’s core value proposition.

Compatibility with HBM Controllers

Another strength of HBF is that its physical footprint and electrical interface (PHY level) are compatible with HBM.

Pinout, package size, power profile, and stack height are nearly identical to HBM4, so the existing HBM interposer and packaging infrastructure can be reused as-is.

That said, it’s important to be clear that HBF is not a drop-in replacement for HBM. Because NAND differs from DRAM in access granularity (page vs. word), erase cycles, wear-leveling, and other characteristics, minimal protocol changes are required on the host controller side. The open standard being pushed by SanDisk and SK hynix is also aimed at “the same electrical interface with minimal protocol changes.”

In other words, there is no need to design a new memory controller from scratch, but existing HBM controllers cannot be used as-is either. Even so, the adoption barrier for accelerator designers remains substantially lower than designing a memory controller from zero.

HBF’s Limitations: Not a Silver Bullet

HBF is a strong candidate for filling the gap, but its limitations are clear.

Latency: At approximately 10 µs, it’s roughly 100x slower than HBM’s tens to hundreds of nanoseconds. Because the NAND cell’s read mechanism is inherently slower than DRAM, packaging alone cannot fully close this gap.

Write speed and endurance: NAND’s characteristically slow writes and limited erase/write cycles remain. This makes HBF unsuitable for AI training, which requires frequent writes.

Conversely, for AI inference workloads — where pre-loaded model weights are read repeatedly — these limitations are much less of a concern. This is exactly why HBF is attracting attention as “the memory for the AI inference era.”


Conclusion & Next Episode Preview

In this post, we explored why memory is divided into so many types and how each memory’s cell structure determines the speed/capacity/cost trilemma.

Here’s the summary:

  • SRAM: 6T cell — the fastest, but the most expensive and smallest
  • DRAM: 1T1C cell — density advantage makes it the main memory, but bandwidth is limited
  • HBM: DRAM cells with TSV + interposer packaging — revolutionized bandwidth, but capacity scaling is difficult
  • NAND: Extreme density and low cost, but too slow to directly participate in computation
  • HBF: HBM packaging applied to NAND cells — a new tier filling the gap between HBM and SSD

But being able to fill the gap alone doesn’t complete the story.

HBF has a weakness: its latency of around 10 µs is roughly 100x longer than HBM’s. So how can this weakness be overcome, and how can HBF actually be used in real LLM workloads?

In the next installment, we’ll explore how HBF can be leveraged in LLMs, focusing on SK hynix’s proposed approach to overcoming HBF’s long latency.

Part 2 will be written by Jaewon Lim from HyperAccel. Stay tuned!


P.S.

I’m designing RTL at HyperAccel for the launch of an LLM acceleration ASIC chip. Many of us are putting our heads together to find more efficient ways to maximize performance from the limited resource of memory bandwidth. I hope this series helps us understand the flow of memory technology together and watch the changes ahead.

HyperAccel is a company that covers HW, SW, and AI, bringing together outstanding talent across all domains. If you want to learn broadly and deeply while growing together, please apply to HyperAccel anytime!

Careers: https://hyperaccel.career.greetinghr.com/ko/guide

Reference