The reason for `MOVDQA` | The Depressed Milkman

While perusing through the Intel Software Development Manuals, I finally found the answer to a question that had been nagging me for a few months: why does the vectorized move instruction MOVDQA (in all its variants) still exist?

Here’s a little background into the situation. The MOVDQA and MOVDQU instructions were introduced in SSE2, for moving SIMD vectors of integral type between XMM registers and between registers and memory. Both instructions had the same effect, except that when used with memory, MOVDQA only performed 16-byte-aligned memory accesses, while MOVDQU allowed non-alignment. At the time, the rationale for having two instructions was clear: unaligned accesses are significantly slower than aligned ones, so they should be handled as distinct cases.

With recent Intel CPUs, MOVDQU has the same performance as MOVDQA if the accesses are aligned, to the point where compilers began defaulting to MOVDQU even if the memory accesses are guaranteed to be aligned. As such, MOVDQU grew in capability to the point where MOVDQA no longer appears to be necessary. However, SIMD extensions following this change (such as AVX) continued to introduce new forms of MOVDQA, as if Intel never noticed that it wasn’t being used.

For a long time, I thought that MOVDQA did not need to exist, and that Intel had simply made a mistake in keeping it going (I figured that it may have been needed for backward compatibility of some sort). However, a different section of the Software Development Manuals (specifically, manual 3, the System Programming Guide) finally revealed an actual reason for why MOVDQA should exist, and how its behavior differs from that of MOVDQU. It’s a use-case that is not often considered, particularly when vectorization is involved: atomics and cache coherency.

Here is an except from volume 3A, section 8.1.1 (“Guaranteed Atomic Operations”):

Processors that enumerate support for Intel AVX (…) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
MOVAPD, MOVAPS, and MOVDQA.
VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).
(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

It turns out that MOVDQA and friends have a genuine use that MOVDQU cannot fulfill: performing 16-byte atomic loads and stores. This is an obscure enough use-case that I’ve never heard of it before and I’ve never seen a compiler produce it, but I suppose that it could be useful for hand-written synchronization assembly code.

As such, lesson learnt: Intel isn’t clueless about what they’re up to. However, I’d love to see a genuine instance of MOVDQA being used for atomics. Perhaps it wouldn’t be too hard to scan through all the executables on my computer…