Your Struct is Wasting Memory and You Don't Know It

We write structs by listing fields in whatever order feels readable. Name, then age, then score. It compiles. It runs. The compiler silently bloats it, misaligns it, or both, and you ship it without ever checking.

Here are three structs holding the exact same six fields:

#include <stdio.h>
#include <stdint.h>
#include <stddef.h>

struct Good {
    double balance;
    uint64_t transaction_id;
    int32_t account_type;
    int16_t region_code;
    char status;
    char currency[4];
};

struct Bad {
    char status;
    double balance;
    int16_t region_code;
    uint64_t transaction_id;
    char currency[4];
    int32_t account_type;
};

struct __attribute__((packed)) PackedBad {
    char status;
    double balance;
    int16_t region_code;
    uint64_t transaction_id;
    char currency[4];
    int32_t account_type;
};

int main() {
    printf("Good:      %zu bytes\n", sizeof(struct Good));
    printf("Bad:       %zu bytes\n", sizeof(struct Bad));
    printf("PackedBad: %zu bytes\n", sizeof(struct PackedBad));
    return 0;
}

Good:      32 bytes
Bad:       40 bytes
PackedBad: 27 bytes

Same fields. 27, 32, and 40 bytes. The difference is not the data. It is the order and whether you let the compiler do its job.

What Happens When You Read One Byte

Before touching any struct, you need to understand how the CPU actually talks to RAM. There are three buses connecting them.

The address bus carries the memory address the CPU wants to read. It is 48 to 52 physical wires on a modern system. The CPU puts a number on these wires and RAM listens.

The data bus carries the actual bytes back. It is 64 bits wide, so 8 bytes travel in parallel per transfer. But the CPU does not stop at 8 bytes. It keeps bursting transfers across the data bus until it has filled a full cache line.

A cache line is 64 bytes. That is the only unit of communication between RAM and your L1/L2 cache. The CPU never fetches 1 byte. It never fetches 8 bytes. It always fetches 64 bytes. When you read a single char, the CPU puts that char's address on the address bus, pulls the entire 64-byte block containing it across the data bus, stores it in cache, and then gives you your one byte out of it.

Every cache line starts at an address that is a multiple of 64. Cache line 0 covers 0x0000 to 0x003F (0 to 63). Cache line 1 covers 0x0040 to 0x007F (64 to 127). Cache line 2 covers 0x0080 to 0x00BF (128 to 191). The boundaries are fixed and always at multiples of 64.

This is the rule everything else in this post follows.

Natural Alignment and Why It Matters

Every data type has an alignment requirement equal to its own size. A double (8 bytes) must start at an address divisible by 8. A uint32_t (4 bytes) must start at an address divisible by 4. A char (1 byte) can go anywhere.

When a field sits at a naturally aligned address the CPU reads it in one bus transaction. It fits cleanly inside one cache line fetch.

When a field is misaligned it can straddle a cache line boundary. Say a double starts at 0x003C (60). It is 8 bytes, so it occupies 0x003C to 0x0043 (60 to 67). Cache line 0 ends at 0x003F (63). Cache line 1 starts at 0x0040 (64). Your double is split across both. The CPU issues an address request for cache line 0, waits for the data bus to deliver, then issues a second address request for cache line 1, waits again, and stitches both halves together in hardware. Two full round trips to memory for one field read.

Now think about why address mod data_size == 0 prevents this. Cache line boundaries sit at multiples of 64. A naturally aligned double sits at a multiple of 8. The worst case is a double at 0x0038 (56), occupying bytes 56 to 63. It ends exactly at the cache line boundary, never crossing it. This works because 64 is itself a multiple of 8. A field aligned to its own size mathematically cannot straddle a boundary that is also a multiple of that same size. So address mod data_size == 0 is not a style convention. It is the condition that guarantees your field lives inside exactly one cache line, fetched in exactly one bus transaction, with no possibility of being split.

The compiler inserts padding between fields to maintain this guarantee. Bad field ordering forces it to insert a lot of padding. And packed removes all of it.

Good : 32 bytes, nothing wasted

0x0000      0x0008      0x0010 0x0014 0x0016 0x0017  0x001B  0x001F
(0)         (8)         (16)   (20)   (22)   (23)    (27)    (31)
|-----------|-----------|------|----|--|------|--------|
balance     trans_id    acct   reg  st currency  pad

balance at 0x0000 (0). 0 mod 8 = 0. Aligned.
transaction_id at 0x0008 (8). 8 mod 8 = 0. Aligned.
account_type at 0x0010 (16). 16 mod 4 = 0. Aligned.
region_code at 0x0014 (20). 20 mod 2 = 0. Aligned.
status at 0x0016 (22). Char, goes anywhere.
currency at 0x0017 (23). Char array, goes anywhere.

Every field starts exactly where the previous one ended. Zero internal padding.

The 5 bytes at the end are tail padding. In an array the second element must start at an address divisible by 8, the largest field alignment. Without tail padding the second element begins at 0x001B (27) and its balance field lands there too. 27 mod 8 = 3. Misaligned. So the compiler rounds 27 up to 32. The second element starts at 0x0020 (32). 32 mod 8 = 0. Clean.

Zero bytes wasted internally. The tail padding is structural and unavoidable.

Bad : 40 bytes, 13 bytes of dead space

0x0000 0x0001    0x0008      0x0010 0x0012    0x0018      0x0020 0x0024 0x0028
(0)    (1)       (8)         (16)   (18)      (24)        (32)   (36)   (40)
|------|---------|-----------|------|---------|-----------|------|----|
st     7B pad    balance     reg    6B pad    trans_id    curr   acct

status at 0x0000 (0), one byte. The next field is balance, a double that needs a multiple of 8. After byte 1, the nearest multiple of 8 is 0x0008 (8). The compiler inserts 7 bytes of padding between them that store nothing and do nothing.

Then region_code lands at 0x0010 (16), two bytes, ending at 0x0011 (17). The field after it is transaction_id, which needs a multiple of 8. The nearest is 0x0018 (24). Six more bytes gone.

13 bytes wasted purely from putting char status first. In an array of a million of these structs that is 13MB of RAM holding nothing. The struct is 25% larger than it needs to be, meaning fewer elements fit per cache line and more trips to RAM on every access pattern.

PackedBad : 27 bytes, zero padding, four misaligned fields

0x0000 0x0001    0x0009 0x000B      0x0013  0x0017 0x001B
(0)    (1)       (9)    (11)        (19)    (23)   (27)
|------|---------|------|-----------|--------|------|
st     balance   reg    trans_id    curr    acct

__attribute__((packed)) removes all padding. Fields sit back to back. 27 bytes. But look at where each field actually lands:

balance at 0x0001 (1). 1 mod 8 = 1. Not 0. Misaligned.
region_code at 0x0009 (9). 9 mod 2 = 1. Not 0. Misaligned.
transaction_id at 0x000B (11). 11 mod 8 = 3. Not 0. Misaligned.
account_type at 0x0017 (23). 23 mod 4 = 3. Not 0. Misaligned.

Four fields, zero aligned. In an array, whether a given element straddles a cache line boundary depends on its index. You can check with:

(index x struct_size) mod 64 + struct_size > 64

For element 2 of PackedBad: (2 x 27) mod 64 + 27 = 54 + 27 = 81. Since 81 > 64, element 2 straddles. Its bytes run from 0x0036 (54) to 0x0054 (84), crossing the cache line boundary at 0x0040 (64). The CPU issues an address request for cache line 0 (0x0000 to 0x003F), waits for the data bus, issues a request for cache line 1 (0x0040 to 0x007F), waits again, and stitches both halves. Two full round trips for one struct read. You saved 13 bytes on paper and doubled your memory traffic in practice.

Tearing

The straddle is slow. In single-threaded code it is just slower. In multithreaded code it is also wrong.

The CPU guarantees a memory access is atomic, meaning indivisible and instantaneous from every other thread's perspective, only when:

address mod data_size == 0

That condition guarantees the field sits inside one cache line and the CPU fetches it in one bus transaction. One transaction means no window for another thread to slip in.

When balance sits at 0x0001 (1) in PackedBad, 1 mod 8 = 1. The condition fails. The CPU fetches the first portion of balance in one bus transaction, then the second portion in a separate bus transaction. There is a real time gap between them.

If another thread writes to that same balance field inside that gap, the reading thread gets the first half from before the write and the second half from after it. A value assembled from two different points in time. A number that was never logically written anywhere in your program.

No segfault. No assertion. No log line. The field silently reads as garbage. In a monitoring system this corrupts your metrics. In a financial system this is a balance that never existed reaching your business logic.

Good and Bad are both padded by the compiler so every field satisfies address mod data_size == 0. Tearing cannot happen. PackedBad has four fields that fail this condition in every element.

All Three Side by Side

Struct	Size	Internal padding	Misaligned fields
Good	32 bytes	0 bytes	0
Bad	40 bytes	13 bytes	0
PackedBad	27 bytes	0 bytes	4

Bad pays in memory. PackedBad pays in correctness. Good pays nothing.

The Fix Is Just Field Order

Order fields from largest alignment requirement to smallest:

struct Good {
    double balance;          // 8 bytes
    uint64_t transaction_id; // 8 bytes
    int32_t account_type;    // 4 bytes
    int16_t region_code;     // 2 bytes
    char status;             // 1 byte
    char currency[4];        // 1 byte alignment
};

The compiler has nothing to pad because each field naturally follows the previous one without any gap. No attributes, no pragmas. Just ordering.

Verify with sizeof. Inspect individual field positions with __builtin_offsetof(struct Foo, field) when something looks off.

When packed Is Actually Correct

__attribute__((packed)) has one valid use: serializing data onto a network socket or disk, where you control both ends and the CPU never does arithmetic directly on the packed bytes.

You pack the struct, write the raw bytes to the wire, and on the receiving end you copy into a properly aligned struct before reading any field. The packed struct is a transport container, not a data structure your code operates on. The moment you read fields out of a packed struct in a running program you pay the straddle penalty on every access and you are one concurrent write away from tearing.

False Sharing

You fix your field order. You remove packed. Everything is aligned. You go multithreaded and all cores pin at 100% while throughput collapses.

struct Good is 32 bytes. Two of them fit inside one 64-byte cache line. Say your array starts at 0x1000 (4096). arr[0] lives at 0x1000 to 0x101F (4096 to 4127). arr[1] lives at 0x1020 to 0x103F (4128 to 4159). Both sit inside the single cache line spanning 0x1000 to 0x103F (4096 to 4159).

Thread 1 writes to arr[0]. Thread 2 writes to arr[1]. Different structs. No shared fields. No mutex involved. But both live in the same 64-byte cache line.

Every time Thread 1 writes to arr[0], the CPU's MESI cache coherency protocol broadcasts an invalidation across the ring bus to every other core: the cache line at 0x1000 was modified, your copies are stale, drop them. Thread 2 has its L1 cache entry for arr[1] ripped away even though nobody touched arr[1]. It takes an L1 miss, goes out to L3, fetches the 64-byte line again, modifies arr[1], and now Thread 1 gets invalidated. Back and forth. The cores spend the vast majority of their time passing one cache line across the ring bus and almost no time doing actual work.

The fix is to give each struct its own cache line:

struct __attribute__((aligned(64))) NodeMetrics {
    double balance;
    uint64_t transaction_id;
    int32_t account_type;
    int16_t region_code;
    char status;
    char currency[4];
};

Now arr[0] owns 0x1000 to 0x103F (4096 to 4159) entirely. arr[1] owns 0x1040 to 0x107F (4160 to 4223) entirely. Thread 1 and Thread 2 never touch the same cache line and the coherency protocol never fires between them. You waste 32 bytes per struct. You get linear scaling across every core.

Takeaway

Order fields largest to smallest. Verify with sizeof. Check offsets with __builtin_offsetof when something feels off. Use packed only for wire or disk formats where you control both ends. Pad to 64 bytes with aligned(64) only when multiple threads write to adjacent elements of an array.