Why doesn't clang use memory-destination x86 instructions when I compile with optimization disabled?
I wrote this simple assembly code, ran it and looked at the memory location using GDB:
.text
.global _main
_main:
pushq %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
popq %rbp
ret
It's adding 5 to 6 directly in memory and according to GDB it worked. So this is performing math operations directly in memory instead of CPU registers.
Now writing the same thing in C and compiling it to assembly turns out like this:
... # clang output
xorl %eax, %eax
movl $0, -4(%rbp)
movl $5, -8(%rbp)
movl -8(%rbp), %ecx # load a
addl $6, %ecx # a += 6
movl %ecx, -8(%rbp) # store a
....
It's moving them to a register before adding them together.
So why don't we add directly in memory?
Is it slower? If so, then why is adding directly in memory even allowed, why didn't the assembler complain about my assembly code in the beginning?
Edit:
Here is the C code for the second assembly block, I have disabled optimization when compiling.
#include <iostream>
int main(){
int a = 5;
a+=6;
return 0;
}
c gcc assembly clang compiler-optimization
New contributor
|
show 5 more comments
I wrote this simple assembly code, ran it and looked at the memory location using GDB:
.text
.global _main
_main:
pushq %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
popq %rbp
ret
It's adding 5 to 6 directly in memory and according to GDB it worked. So this is performing math operations directly in memory instead of CPU registers.
Now writing the same thing in C and compiling it to assembly turns out like this:
... # clang output
xorl %eax, %eax
movl $0, -4(%rbp)
movl $5, -8(%rbp)
movl -8(%rbp), %ecx # load a
addl $6, %ecx # a += 6
movl %ecx, -8(%rbp) # store a
....
It's moving them to a register before adding them together.
So why don't we add directly in memory?
Is it slower? If so, then why is adding directly in memory even allowed, why didn't the assembler complain about my assembly code in the beginning?
Edit:
Here is the C code for the second assembly block, I have disabled optimization when compiling.
#include <iostream>
int main(){
int a = 5;
a+=6;
return 0;
}
c gcc assembly clang compiler-optimization
New contributor
1
Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU
– Ctx
5 hours ago
6
The code from C appears to be unoptimised so it has extra loads and stores. Compile with-O3
and see what happens.
– Michael Petch
5 hours ago
4
@Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)
– Ctx
5 hours ago
1
@Sam It is not an architecture register, but an unnamed hardware register
– Ctx
5 hours ago
2
I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.
– Michael Petch
5 hours ago
|
show 5 more comments
I wrote this simple assembly code, ran it and looked at the memory location using GDB:
.text
.global _main
_main:
pushq %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
popq %rbp
ret
It's adding 5 to 6 directly in memory and according to GDB it worked. So this is performing math operations directly in memory instead of CPU registers.
Now writing the same thing in C and compiling it to assembly turns out like this:
... # clang output
xorl %eax, %eax
movl $0, -4(%rbp)
movl $5, -8(%rbp)
movl -8(%rbp), %ecx # load a
addl $6, %ecx # a += 6
movl %ecx, -8(%rbp) # store a
....
It's moving them to a register before adding them together.
So why don't we add directly in memory?
Is it slower? If so, then why is adding directly in memory even allowed, why didn't the assembler complain about my assembly code in the beginning?
Edit:
Here is the C code for the second assembly block, I have disabled optimization when compiling.
#include <iostream>
int main(){
int a = 5;
a+=6;
return 0;
}
c gcc assembly clang compiler-optimization
New contributor
I wrote this simple assembly code, ran it and looked at the memory location using GDB:
.text
.global _main
_main:
pushq %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
popq %rbp
ret
It's adding 5 to 6 directly in memory and according to GDB it worked. So this is performing math operations directly in memory instead of CPU registers.
Now writing the same thing in C and compiling it to assembly turns out like this:
... # clang output
xorl %eax, %eax
movl $0, -4(%rbp)
movl $5, -8(%rbp)
movl -8(%rbp), %ecx # load a
addl $6, %ecx # a += 6
movl %ecx, -8(%rbp) # store a
....
It's moving them to a register before adding them together.
So why don't we add directly in memory?
Is it slower? If so, then why is adding directly in memory even allowed, why didn't the assembler complain about my assembly code in the beginning?
Edit:
Here is the C code for the second assembly block, I have disabled optimization when compiling.
#include <iostream>
int main(){
int a = 5;
a+=6;
return 0;
}
c gcc assembly clang compiler-optimization
c gcc assembly clang compiler-optimization
New contributor
New contributor
edited 25 mins ago
Peter Cordes
123k17186313
123k17186313
New contributor
asked 5 hours ago
SamSam
412
412
New contributor
New contributor
1
Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU
– Ctx
5 hours ago
6
The code from C appears to be unoptimised so it has extra loads and stores. Compile with-O3
and see what happens.
– Michael Petch
5 hours ago
4
@Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)
– Ctx
5 hours ago
1
@Sam It is not an architecture register, but an unnamed hardware register
– Ctx
5 hours ago
2
I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.
– Michael Petch
5 hours ago
|
show 5 more comments
1
Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU
– Ctx
5 hours ago
6
The code from C appears to be unoptimised so it has extra loads and stores. Compile with-O3
and see what happens.
– Michael Petch
5 hours ago
4
@Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)
– Ctx
5 hours ago
1
@Sam It is not an architecture register, but an unnamed hardware register
– Ctx
5 hours ago
2
I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.
– Michael Petch
5 hours ago
1
1
Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU
– Ctx
5 hours ago
Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU
– Ctx
5 hours ago
6
6
The code from C appears to be unoptimised so it has extra loads and stores. Compile with
-O3
and see what happens.– Michael Petch
5 hours ago
The code from C appears to be unoptimised so it has extra loads and stores. Compile with
-O3
and see what happens.– Michael Petch
5 hours ago
4
4
@Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)
– Ctx
5 hours ago
@Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)
– Ctx
5 hours ago
1
1
@Sam It is not an architecture register, but an unnamed hardware register
– Ctx
5 hours ago
@Sam It is not an architecture register, but an unnamed hardware register
– Ctx
5 hours ago
2
2
I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.
– Michael Petch
5 hours ago
I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.
– Michael Petch
5 hours ago
|
show 5 more comments
1 Answer
1
active
oldest
votes
You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.
Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp)
would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.
(Update: just noticed your compiler output included xor %eax,%eax
, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0
doesn't look for the xor-zeroing peephole optimization at -O0
, using mov $0, %eax
.)
Fun fact: gcc -O0
will actually use addl $6, -4(%rbp)
in your main
.
You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add
instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.
A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6
to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock
prefix to make it appear atomic.
There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)
But as it stands now, add $6, -4(%rbp)
decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.
Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.
So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.
We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb
(which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.
Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.
is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?
In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)
Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter
instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop
instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)
x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.
You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter
or loop
, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop
are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32
, and even inc
/loop
as a 3-byte alternative to 4-byte test ecx,ecx
/jnz
). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.
Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.
Avoiding slowdowns requires looking at more than 1 instruction at once
Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.
e.g. this code will cause a partial-register stall on Intel P6-family CPUs:
mov ah, 1
add eax, 123
Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz
in an adc
loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs
If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.
But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:
- Assembly - How to score a CPU instruction by latency and throughput
- How many CPU cycles are needed for each assembly instruction?
- What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
GCC/clang -O0
's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).
But even within the block of asm for one statement, clang -O0
apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.
GCC -O0
happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax
/ret
, because a
is unused.)
main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
movl $0, %eax
popq %rbp
ret
How to see clang/LLVM using memory-destination add
I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic
for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)
void add_reg_to_mem(int *p, int b) {
*p += b;
}
# I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO
addl %esi, (%rdi)
ret
void add_imm_to_mem(int *p) {
*p += 3;
}
# gcc and clang -O3 both emit the same asm here, where there's only one good choice
addl $3, (%rdi)
ret
The gcc -O0
output is just totally braindead, e.g. reloading p
twice because it clobbers the pointer while calculating the +3
. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0
for that would probably be a lot less terrible.
# gcc8.2 -O0 output
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rax # load p
movl (%rax), %eax # load *p, clobbering p
leal 3(%rax), %edx # edx = *p + 3
movq -8(%rbp), %rax # reload p
movl %edx, (%rax) # store *p + 3
GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.
The clang -O0 output happens to be less horrible for this:
# clang -O0
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rdi # reload p
movl (%rdi), %eax # eax = *p
addl $3, %eax # eax += 3
movl %eax, (%rdi) # *p = eax
See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.
If I compiled with -m32 -mtune=pentium
, gcc -O3 would avoid memory-dst add:
The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.
# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention
movl (%eax), %edx # *p += 3 implemented as 3 separate instructions
addl $3, %edx
movl %edx, (%eax)
ret
You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.
Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem)
on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).
In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.
GCC refuses -mtune=pentium
without -m32
, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc
. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium
.
The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sam is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54391268%2fwhy-doesnt-clang-use-memory-destination-x86-instructions-when-i-compile-with-op%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.
Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp)
would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.
(Update: just noticed your compiler output included xor %eax,%eax
, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0
doesn't look for the xor-zeroing peephole optimization at -O0
, using mov $0, %eax
.)
Fun fact: gcc -O0
will actually use addl $6, -4(%rbp)
in your main
.
You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add
instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.
A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6
to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock
prefix to make it appear atomic.
There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)
But as it stands now, add $6, -4(%rbp)
decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.
Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.
So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.
We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb
(which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.
Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.
is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?
In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)
Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter
instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop
instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)
x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.
You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter
or loop
, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop
are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32
, and even inc
/loop
as a 3-byte alternative to 4-byte test ecx,ecx
/jnz
). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.
Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.
Avoiding slowdowns requires looking at more than 1 instruction at once
Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.
e.g. this code will cause a partial-register stall on Intel P6-family CPUs:
mov ah, 1
add eax, 123
Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz
in an adc
loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs
If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.
But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:
- Assembly - How to score a CPU instruction by latency and throughput
- How many CPU cycles are needed for each assembly instruction?
- What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
GCC/clang -O0
's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).
But even within the block of asm for one statement, clang -O0
apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.
GCC -O0
happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax
/ret
, because a
is unused.)
main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
movl $0, %eax
popq %rbp
ret
How to see clang/LLVM using memory-destination add
I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic
for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)
void add_reg_to_mem(int *p, int b) {
*p += b;
}
# I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO
addl %esi, (%rdi)
ret
void add_imm_to_mem(int *p) {
*p += 3;
}
# gcc and clang -O3 both emit the same asm here, where there's only one good choice
addl $3, (%rdi)
ret
The gcc -O0
output is just totally braindead, e.g. reloading p
twice because it clobbers the pointer while calculating the +3
. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0
for that would probably be a lot less terrible.
# gcc8.2 -O0 output
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rax # load p
movl (%rax), %eax # load *p, clobbering p
leal 3(%rax), %edx # edx = *p + 3
movq -8(%rbp), %rax # reload p
movl %edx, (%rax) # store *p + 3
GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.
The clang -O0 output happens to be less horrible for this:
# clang -O0
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rdi # reload p
movl (%rdi), %eax # eax = *p
addl $3, %eax # eax += 3
movl %eax, (%rdi) # *p = eax
See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.
If I compiled with -m32 -mtune=pentium
, gcc -O3 would avoid memory-dst add:
The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.
# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention
movl (%eax), %edx # *p += 3 implemented as 3 separate instructions
addl $3, %edx
movl %edx, (%eax)
ret
You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.
Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem)
on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).
In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.
GCC refuses -mtune=pentium
without -m32
, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc
. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium
.
The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.
add a comment |
You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.
Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp)
would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.
(Update: just noticed your compiler output included xor %eax,%eax
, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0
doesn't look for the xor-zeroing peephole optimization at -O0
, using mov $0, %eax
.)
Fun fact: gcc -O0
will actually use addl $6, -4(%rbp)
in your main
.
You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add
instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.
A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6
to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock
prefix to make it appear atomic.
There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)
But as it stands now, add $6, -4(%rbp)
decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.
Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.
So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.
We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb
(which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.
Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.
is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?
In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)
Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter
instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop
instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)
x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.
You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter
or loop
, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop
are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32
, and even inc
/loop
as a 3-byte alternative to 4-byte test ecx,ecx
/jnz
). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.
Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.
Avoiding slowdowns requires looking at more than 1 instruction at once
Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.
e.g. this code will cause a partial-register stall on Intel P6-family CPUs:
mov ah, 1
add eax, 123
Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz
in an adc
loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs
If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.
But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:
- Assembly - How to score a CPU instruction by latency and throughput
- How many CPU cycles are needed for each assembly instruction?
- What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
GCC/clang -O0
's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).
But even within the block of asm for one statement, clang -O0
apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.
GCC -O0
happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax
/ret
, because a
is unused.)
main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
movl $0, %eax
popq %rbp
ret
How to see clang/LLVM using memory-destination add
I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic
for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)
void add_reg_to_mem(int *p, int b) {
*p += b;
}
# I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO
addl %esi, (%rdi)
ret
void add_imm_to_mem(int *p) {
*p += 3;
}
# gcc and clang -O3 both emit the same asm here, where there's only one good choice
addl $3, (%rdi)
ret
The gcc -O0
output is just totally braindead, e.g. reloading p
twice because it clobbers the pointer while calculating the +3
. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0
for that would probably be a lot less terrible.
# gcc8.2 -O0 output
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rax # load p
movl (%rax), %eax # load *p, clobbering p
leal 3(%rax), %edx # edx = *p + 3
movq -8(%rbp), %rax # reload p
movl %edx, (%rax) # store *p + 3
GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.
The clang -O0 output happens to be less horrible for this:
# clang -O0
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rdi # reload p
movl (%rdi), %eax # eax = *p
addl $3, %eax # eax += 3
movl %eax, (%rdi) # *p = eax
See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.
If I compiled with -m32 -mtune=pentium
, gcc -O3 would avoid memory-dst add:
The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.
# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention
movl (%eax), %edx # *p += 3 implemented as 3 separate instructions
addl $3, %edx
movl %edx, (%eax)
ret
You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.
Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem)
on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).
In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.
GCC refuses -mtune=pentium
without -m32
, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc
. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium
.
The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.
add a comment |
You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.
Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp)
would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.
(Update: just noticed your compiler output included xor %eax,%eax
, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0
doesn't look for the xor-zeroing peephole optimization at -O0
, using mov $0, %eax
.)
Fun fact: gcc -O0
will actually use addl $6, -4(%rbp)
in your main
.
You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add
instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.
A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6
to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock
prefix to make it appear atomic.
There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)
But as it stands now, add $6, -4(%rbp)
decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.
Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.
So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.
We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb
(which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.
Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.
is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?
In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)
Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter
instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop
instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)
x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.
You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter
or loop
, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop
are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32
, and even inc
/loop
as a 3-byte alternative to 4-byte test ecx,ecx
/jnz
). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.
Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.
Avoiding slowdowns requires looking at more than 1 instruction at once
Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.
e.g. this code will cause a partial-register stall on Intel P6-family CPUs:
mov ah, 1
add eax, 123
Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz
in an adc
loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs
If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.
But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:
- Assembly - How to score a CPU instruction by latency and throughput
- How many CPU cycles are needed for each assembly instruction?
- What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
GCC/clang -O0
's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).
But even within the block of asm for one statement, clang -O0
apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.
GCC -O0
happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax
/ret
, because a
is unused.)
main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
movl $0, %eax
popq %rbp
ret
How to see clang/LLVM using memory-destination add
I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic
for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)
void add_reg_to_mem(int *p, int b) {
*p += b;
}
# I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO
addl %esi, (%rdi)
ret
void add_imm_to_mem(int *p) {
*p += 3;
}
# gcc and clang -O3 both emit the same asm here, where there's only one good choice
addl $3, (%rdi)
ret
The gcc -O0
output is just totally braindead, e.g. reloading p
twice because it clobbers the pointer while calculating the +3
. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0
for that would probably be a lot less terrible.
# gcc8.2 -O0 output
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rax # load p
movl (%rax), %eax # load *p, clobbering p
leal 3(%rax), %edx # edx = *p + 3
movq -8(%rbp), %rax # reload p
movl %edx, (%rax) # store *p + 3
GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.
The clang -O0 output happens to be less horrible for this:
# clang -O0
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rdi # reload p
movl (%rdi), %eax # eax = *p
addl $3, %eax # eax += 3
movl %eax, (%rdi) # *p = eax
See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.
If I compiled with -m32 -mtune=pentium
, gcc -O3 would avoid memory-dst add:
The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.
# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention
movl (%eax), %edx # *p += 3 implemented as 3 separate instructions
addl $3, %edx
movl %edx, (%eax)
ret
You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.
Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem)
on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).
In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.
GCC refuses -mtune=pentium
without -m32
, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc
. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium
.
The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.
You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.
Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp)
would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.
(Update: just noticed your compiler output included xor %eax,%eax
, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0
doesn't look for the xor-zeroing peephole optimization at -O0
, using mov $0, %eax
.)
Fun fact: gcc -O0
will actually use addl $6, -4(%rbp)
in your main
.
You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add
instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.
A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6
to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock
prefix to make it appear atomic.
There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)
But as it stands now, add $6, -4(%rbp)
decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.
Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.
So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.
We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb
(which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.
Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.
is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?
In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)
Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter
instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop
instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)
x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.
You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter
or loop
, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop
are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32
, and even inc
/loop
as a 3-byte alternative to 4-byte test ecx,ecx
/jnz
). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.
Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.
Avoiding slowdowns requires looking at more than 1 instruction at once
Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.
e.g. this code will cause a partial-register stall on Intel P6-family CPUs:
mov ah, 1
add eax, 123
Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz
in an adc
loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs
If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.
But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:
- Assembly - How to score a CPU instruction by latency and throughput
- How many CPU cycles are needed for each assembly instruction?
- What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
GCC/clang -O0
's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).
But even within the block of asm for one statement, clang -O0
apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.
GCC -O0
happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax
/ret
, because a
is unused.)
main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
addl $6, -4(%rbp)
movl $0, %eax
popq %rbp
ret
How to see clang/LLVM using memory-destination add
I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic
for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)
void add_reg_to_mem(int *p, int b) {
*p += b;
}
# I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO
addl %esi, (%rdi)
ret
void add_imm_to_mem(int *p) {
*p += 3;
}
# gcc and clang -O3 both emit the same asm here, where there's only one good choice
addl $3, (%rdi)
ret
The gcc -O0
output is just totally braindead, e.g. reloading p
twice because it clobbers the pointer while calculating the +3
. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0
for that would probably be a lot less terrible.
# gcc8.2 -O0 output
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rax # load p
movl (%rax), %eax # load *p, clobbering p
leal 3(%rax), %edx # edx = *p + 3
movq -8(%rbp), %rax # reload p
movl %edx, (%rax) # store *p + 3
GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.
The clang -O0 output happens to be less horrible for this:
# clang -O0
... after making a stack frame and spilling `p` from RDI to -8(%rbp)
movq -8(%rbp), %rdi # reload p
movl (%rdi), %eax # eax = *p
addl $3, %eax # eax += 3
movl %eax, (%rdi) # *p = eax
See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.
If I compiled with -m32 -mtune=pentium
, gcc -O3 would avoid memory-dst add:
The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.
# gcc8.2 -O3 -m32 -mtune=pentium
add_imm_to_mem(int*):
movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention
movl (%eax), %edx # *p += 3 implemented as 3 separate instructions
addl $3, %edx
movl %edx, (%eax)
ret
You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.
Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem)
on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).
In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.
GCC refuses -mtune=pentium
without -m32
, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc
. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium
.
The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.
edited 22 mins ago
answered 4 hours ago
Peter CordesPeter Cordes
123k17186313
123k17186313
add a comment |
add a comment |
Sam is a new contributor. Be nice, and check out our Code of Conduct.
Sam is a new contributor. Be nice, and check out our Code of Conduct.
Sam is a new contributor. Be nice, and check out our Code of Conduct.
Sam is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54391268%2fwhy-doesnt-clang-use-memory-destination-x86-instructions-when-i-compile-with-op%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Most architectures simply don't have an operation for adding directly in memory. Implicitly, the operands always have to be transferred into cpu registers to be added by some kind of ALU
– Ctx
5 hours ago
6
The code from C appears to be unoptimised so it has extra loads and stores. Compile with
-O3
and see what happens.– Michael Petch
5 hours ago
4
@Sam What I mean is: it is not really added "directly in memory", the target operand has still to be fetched from the memory (or caches) into a CPU register before adding. This is done implicitly. I just added this because especially the title suggests, that the memory (RAM) could perform arithmetic operations, which is not true on any platform I know ;)
– Ctx
5 hours ago
1
@Sam It is not an architecture register, but an unnamed hardware register
– Ctx
5 hours ago
2
I recommend tossing the add in a function and adding two parameters and examine the code: godbolt.org/z/ZmySpq . Godbolt is a useful tool to look at generated code online.
– Michael Petch
5 hours ago