While perusing through the Intel Software Development Manuals, I finally
found the answer to a question that had been nagging me for a few months: why does the
vectorized move instruction MOVDQA
(in all its variants) still exist?
Here’s a little background into the situation. The MOVDQA
and MOVDQU
instructions
were introduced in SSE2, for moving SIMD vectors of integral type between XMM
registers
and between registers and memory. Both instructions had the same effect, except that when
used with memory, MOVDQA
only performed 16-byte-aligned memory accesses, while MOVDQU
allowed non-alignment. At the time, the rationale for having two instructions was clear:
unaligned accesses are significantly slower than aligned ones, so they should be handled
as distinct cases.
With recent Intel CPUs, MOVDQU
has the same performance as MOVDQA
if the accesses are
aligned, to the point where compilers began defaulting to MOVDQU
even
if the memory accesses are guaranteed to be aligned. As such, MOVDQU
grew in capability
to the point where MOVDQA
no longer appears to be necessary. However, SIMD extensions
following this change (such as AVX) continued to introduce new forms of MOVDQA
, as if
Intel never noticed that it wasn’t being used.
For a long time, I thought that MOVDQA
did not need to exist, and that Intel had simply
made a mistake in keeping it going (I figured that it may have been needed for backward
compatibility of some sort). However, a different section of the Software Development
Manuals (specifically, manual 3, the System Programming Guide) finally revealed an actual
reason for why MOVDQA
should exist, and how its behavior differs from that of MOVDQU
.
It’s a use-case that is not often considered, particularly when vectorization is involved:
atomics and cache coherency.
Here is an except from volume 3A, section 8.1.1 (“Guaranteed Atomic Operations”):
Processors that enumerate support for Intel AVX (…) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
MOVAPD
,MOVAPS
, andMOVDQA
.VMOVAPD
,VMOVAPS
, andVMOVDQA
when encoded withVEX.128
.VMOVAPD
,VMOVAPS
,VMOVDQA32
, andVMOVDQA64
when encoded withEVEX.128
andk0
(masking disabled).(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)
It turns out that MOVDQA
and friends have a genuine use that MOVDQU
cannot fulfill:
performing 16-byte atomic loads and stores. This is an obscure enough use-case that I’ve
never heard of it before and I’ve never seen a compiler produce it, but I suppose that it
could be useful for hand-written synchronization assembly code.
As such, lesson learnt: Intel isn’t clueless about what they’re up to. However, I’d love
to see a genuine instance of MOVDQA
being used for atomics. Perhaps it wouldn’t be too
hard to scan through all the executables on my computer…