## Of tails.

In a previous post, I explored the effect of pruning on a recursive function, namely, the Collatz function. Richard (see comment) asked “does your compiler know about tail recursion?”. Well, I didn’t know for sure. Let’s find out.

Let’s start with the Collatz function, the function that initiated the question. A naïve, recursive, implementation gives something like this:

```unsigned collatz(unsigned x)
{
if (x==1)
return 1;
else
if (is_even(x))
return collatz(x/2);
else
return collatz(3*x+1);
}
```

Where is_even is a simple function that tests whether or not a number is… even. Inlined, it does not affect performance but improves readability.

We might expect that the compiler would see that it can be flattened into a while loop, but let’s see what it actually does. As usual, the compiler is g++, this time v. 4.8.2. The dumps are obtained using objdump -D -Mintel a.out. Invoking g++ with no extra arguments (implicitly using -O0), we get the following code:

```0000000000400841 <_Z7collatzj>:
400841:       55                      push   rbp
400842:       48 89 e5                mov    rbp,rsp
400845:       48 83 ec 10             sub    rsp,0x10
400849:       89 7d fc                mov    DWORD PTR [rbp-0x4],edi
40084c:       83 7d fc 01             cmp    DWORD PTR [rbp-0x4],0x1
400850:       75 07                   jne    400859 <_Z7collatzj+0x18>
400852:       b8 01 00 00 00          mov    eax,0x1
400857:       eb 2f                   jmp    400888 <_Z7collatzj+0x47>
400859:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
40085c:       89 c7                   mov    edi,eax
40085e:       e8 ca ff ff ff          call   40082d <_Z7is_evenj>
400863:       84 c0                   test   al,al
400865:       74 0e                   je     400875 <_Z7collatzj+0x34>
400867:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
40086a:       d1 e8                   shr    eax,1
40086c:       89 c7                   mov    edi,eax
40086e:       e8 ce ff ff ff          call   400841 <_Z7collatzj>
400873:       eb 13                   jmp    400888 <_Z7collatzj+0x47>
400875:       8b 55 fc                mov    edx,DWORD PTR [rbp-0x4]
400878:       89 d0                   mov    eax,edx
40087e:       83 c0 01                add    eax,0x1
400881:       89 c7                   mov    edi,eax
400883:       e8 b9 ff ff ff          call   400841 <_Z7collatzj>
400888:       c9                      leave
400889:       c3                      ret
```

This code literally translate the C++ code into assembly, without inlining nor any other fancy optimization: it is a line-by-line assembly equivalent of the original code. (Much like most early compiler would do. Turbo Pascal, for exemple, as I recall, was blindingly fast, but also quite literal in its code generation.)

Invoking g++ with -O3, we get the following code:

```  400930:       eb 16                   jmp    400948 <_Z7collatzj+0x18>
400932:       66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
400938:       89 fa                   mov    edx,edi
40093a:       8d 44 7f 01             lea    eax,[rdi+rdi*2+0x1]
40093e:       d1 ea                   shr    edx,1
400940:       83 e7 01                and    edi,0x1
400943:       0f 44 c2                cmove  eax,edx
400946:       89 c7                   mov    edi,eax
400948:       83 ff 01                cmp    edi,0x1
40094b:       75 eb                   jne    400938 <_Z7collatzj+0x8>
40094d:       b8 01 00 00 00          mov    eax,0x1
400952:       c3                      ret
```

This time, the compiler did figure out the recursion and flattened the recursion into a while loop. Can we do it ourselves? Let’s check if we can do better.

```unsigned collatz_untailed(unsigned x)
{
while (x!=1)
if (is_even(x))
x/=2;
else
x=3*x+1;
return x;
}
```

This is a rework of the code to re-write the parameter x and avoid recursive calls. Let’s see what g++ with -O3 generates.

```0000000000400930 <_Z16collatz_untailedj>:
400930:       eb 16                   jmp    400948 <_Z16collatz_untailedj+0x18>
400932:       66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
400938:       89 fa                   mov    edx,edi
40093a:       8d 44 7f 01             lea    eax,[rdi+rdi*2+0x1]
40093e:       d1 ea                   shr    edx,1
400940:       83 e7 01                and    edi,0x1
400943:       0f 44 c2                cmove  eax,edx
400946:       89 c7                   mov    edi,eax
400948:       83 ff 01                cmp    edi,0x1
40094b:       75 eb                   jne    400938 <_Z16collatz_untailedj+0x8>
40094d:       b8 01 00 00 00          mov    eax,0x1
400952:       c3                      ret
```

Well, it seems that exactly the compiler understood of the original function very well, enough to generate a recursion-free version. Even the generated code is identical.

*
* *

Let’s try another one that’s usually presented as a recursive function: the greatest common divisor, or gcd. In every textbook ever, you either find the original version, due to Euclid and based on successive subtraction, or this version, using modulo:

```unsigned gcd(unsigned a, unsigned b)
{
if (b)
return gcd(b, a % b);
else
return a;
}
```

The version based on modulo is quite advantageous because it will make a number of calls (or iterations) proportional to the logarithm of the smallest of the two numbers. With g++ (with no optimizations):

```00000000004008b5 <_Z3gcdjj>:
4008b5:       55                      push   rbp
4008b6:       48 89 e5                mov    rbp,rsp
4008b9:       48 83 ec 10             sub    rsp,0x10
4008bd:       89 7d fc                mov    DWORD PTR [rbp-0x4],edi
4008c0:       89 75 f8                mov    DWORD PTR [rbp-0x8],esi
4008c3:       83 7d f8 00             cmp    DWORD PTR [rbp-0x8],0x0
4008c7:       74 19                   je     4008e2 <_Z3gcdjj+0x2d>
4008c9:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
4008cc:       ba 00 00 00 00          mov    edx,0x0
4008d1:       f7 75 f8                div    DWORD PTR [rbp-0x8]
4008d4:       8b 45 f8                mov    eax,DWORD PTR [rbp-0x8]
4008d7:       89 d6                   mov    esi,edx
4008d9:       89 c7                   mov    edi,eax
4008db:       e8 d5 ff ff ff          call   4008b5 <_Z3gcdjj>
4008e0:       eb 03                   jmp    4008e5 <_Z3gcdjj+0x30>
4008e2:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
4008e5:       c9                      leave
4008e6:       c3                      ret
```

Now with -O3:

```0000000000400a70 <_Z3gcdjj>:
400a70:       85 f6                   test   esi,esi
400a72:       89 f8                   mov    eax,edi
400a74:       75 0c                   jne    400a82 <_Z3gcdjj+0x12>
400a76:       eb 14                   jmp    400a8c <_Z3gcdjj+0x1c>
400a78:       0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
400a7f:       00
400a80:       89 d6                   mov    esi,edx
400a82:       31 d2                   xor    edx,edx
400a84:       f7 f6                   div    esi
400a86:       89 f0                   mov    eax,esi
400a88:       85 d2                   test   edx,edx
400a8a:       75 f4                   jne    400a80 <_Z3gcdjj+0x10>
400a8c:       f3 c3                   repz ret
```

If we eliminate recursion, we would get something like:

```unsigned gcd_untailed(unsigned a, unsigned b)
{
while (b)
{
unsigned t=a % b;
a=b;
b=t;
}
return a;
}
```

Which, with g++ -O3, generates…

```0000000000400b60 <_Z12gcd_untailedjj>:
400b60:       85 f6                   test   esi,esi
400b62:       89 f8                   mov    eax,edi
400b64:       75 0c                   jne    400b72 <_Z12gcd_untailedjj+0x12>
400b66:       eb 14                   jmp    400b7c <_Z12gcd_untailedjj+0x1c>
400b68:       0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
400b6f:       00
400b70:       89 d6                   mov    esi,edx
400b72:       31 d2                   xor    edx,edx
400b74:       f7 f6                   div    esi
400b76:       89 f0                   mov    eax,esi
400b78:       85 d2                   test   edx,edx
400b7a:       75 f4                   jne    400b70 <_Z12gcd_untailedjj+0x10>
400b7c:       f3 c3                   repz ret
```

…the exact same code.

*
* *

OK, let’s consider one last example. The simplest of all, certainly, the prototypical textbook example: the factorial.

```unsigned factorial(unsigned n)
{
if (n==0)
return 1;
else
return n*factorial(n-1);
}
```

(Let’s say we ignore the overflows. Let’s pretend it’s the factorial, modulo the machine-size integers.) With no optimizations, g++ generates the following code:

```000000000040088a <_Z9factorialj>:
40088a:       55                      push   rbp
40088b:       48 89 e5                mov    rbp,rsp
40088e:       48 83 ec 10             sub    rsp,0x10
400892:       89 7d fc                mov    DWORD PTR [rbp-0x4],edi
400895:       83 7d fc 00             cmp    DWORD PTR [rbp-0x4],0x0
400899:       75 07                   jne    4008a2 <_Z9factorialj+0x18>
40089b:       b8 01 00 00 00          mov    eax,0x1
4008a0:       eb 11                   jmp    4008b3 <_Z9factorialj+0x29>
4008a2:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
4008a5:       83 e8 01                sub    eax,0x1
4008a8:       89 c7                   mov    edi,eax
4008aa:       e8 db ff ff ff          call   40088a <_Z9factorialj>
4008af:       0f af 45 fc             imul   eax,DWORD PTR [rbp-0x4]
4008b3:       c9                      leave
4008b4:       c3                      ret
```

Now, invoking the compiler with all optimizations (-O3):

```0000000000400920 <_Z9factorialj>:
400920:       85 ff                   test   edi,edi
400922:       0f 84 58 01 00 00       je     400a80 <_Z9factorialj+0x160>
400928:       89 fa                   mov    edx,edi
40092a:       c1 ea 02                shr    edx,0x2
40092d:       8d 0c 95 00 00 00 00    lea    ecx,[rdx*4+0x0]
400934:       85 c9                   test   ecx,ecx
400936:       0f 84 54 01 00 00       je     400a90 <_Z9factorialj+0x170>
40093c:       83 ff 09                cmp    edi,0x9
40093f:       0f 86 4b 01 00 00       jbe    400a90 <_Z9factorialj+0x170>
400945:       8d 47 ff                lea    eax,[rdi-0x1]
400948:       89 44 24 ec             mov    DWORD PTR [rsp-0x14],eax
40094c:       8d 47 fe                lea    eax,[rdi-0x2]
40094f:       66 0f 6e 74 24 ec       movd   xmm6,DWORD PTR [rsp-0x14]
400955:       89 44 24 f0             mov    DWORD PTR [rsp-0x10],eax
400959:       8d 47 fd                lea    eax,[rdi-0x3]
40095c:       66 0f 6e 44 24 f0       movd   xmm0,DWORD PTR [rsp-0x10]
400962:       89 7c 24 f0             mov    DWORD PTR [rsp-0x10],edi
400966:       89 44 24 f4             mov    DWORD PTR [rsp-0xc],eax
40096a:       66 0f 6e 4c 24 f0       movd   xmm1,DWORD PTR [rsp-0x10]
400970:       31 c0                   xor    eax,eax
400972:       66 0f 6e 64 24 f4       movd   xmm4,DWORD PTR [rsp-0xc]
400978:       66 0f 62 ce             punpckldq xmm1,xmm6
40097c:       66 0f 62 c4             punpckldq xmm0,xmm4
400980:       66 0f 6f 25 d8 01 00    movdqa xmm4,XMMWORD PTR [rip+0x1d8]        # 400b60 <_IO_stdin_used+0x20>
400987:       00
400988:       66 0f 6c c8             punpcklqdq xmm1,xmm0
40098c:       66 0f 6f 05 bc 01 00    movdqa xmm0,XMMWORD PTR [rip+0x1bc]        # 400b50 <_IO_stdin_used+0x10>
400993:       00
400994:       eb 0e                   jmp    4009a4 <_Z9factorialj+0x84>
400996:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
40099d:       00 00 00
4009a0:       66 0f 6f cb             movdqa xmm1,xmm3
4009a4:       66 0f 6f e9             movdqa xmm5,xmm1
4009a8:       83 c0 01                add    eax,0x1
4009ab:       66 0f 6f d9             movdqa xmm3,xmm1
4009af:       66 0f 73 d1 20          psrlq  xmm1,0x20
4009b4:       39 c2                   cmp    edx,eax
4009b6:       66 0f f4 e8             pmuludq xmm5,xmm0
4009ba:       66 0f 73 d0 20          psrlq  xmm0,0x20
4009bf:       66 0f f4 c8             pmuludq xmm1,xmm0
4009c3:       66 0f 70 c5 08          pshufd xmm0,xmm5,0x8
4009c8:       66 0f fe dc             paddd  xmm3,xmm4
4009cc:       66 0f 70 c9 08          pshufd xmm1,xmm1,0x8
4009d1:       66 0f 62 c1             punpckldq xmm0,xmm1
4009d5:       77 c9                   ja     4009a0 <_Z9factorialj+0x80>
4009d7:       66 0f 6f f8             movdqa xmm7,xmm0
4009db:       89 fa                   mov    edx,edi
4009dd:       29 ca                   sub    edx,ecx
4009df:       39 f9                   cmp    ecx,edi
4009e1:       66 0f 73 df 08          psrldq xmm7,0x8
4009e6:       66 0f 6f d7             movdqa xmm2,xmm7
4009ea:       66 0f 6f f7             movdqa xmm6,xmm7
4009ee:       66 0f 73 d2 20          psrlq  xmm2,0x20
4009f3:       66 0f f4 f0             pmuludq xmm6,xmm0
4009f7:       66 0f 73 d0 20          psrlq  xmm0,0x20
4009fc:       66 0f 70 ce 08          pshufd xmm1,xmm6,0x8
400a01:       66 0f f4 c2             pmuludq xmm0,xmm2
400a05:       66 0f 70 c0 08          pshufd xmm0,xmm0,0x8
400a0a:       66 0f 62 c8             punpckldq xmm1,xmm0
400a0e:       66 0f 6f f9             movdqa xmm7,xmm1
400a12:       66 0f 73 df 04          psrldq xmm7,0x4
400a17:       66 0f f4 cf             pmuludq xmm1,xmm7
400a1b:       66 0f 7e 4c 24 ec       movd   DWORD PTR [rsp-0x14],xmm1
400a21:       8b 44 24 ec             mov    eax,DWORD PTR [rsp-0x14]
400a25:       74 61                   je     400a88 <_Z9factorialj+0x168>
400a27:       89 d1                   mov    ecx,edx
400a29:       0f af c2                imul   eax,edx
400a2c:       83 e9 01                sub    ecx,0x1
400a2f:       74 57                   je     400a88 <_Z9factorialj+0x168>
400a31:       0f af c1                imul   eax,ecx
400a34:       89 d1                   mov    ecx,edx
400a36:       83 e9 02                sub    ecx,0x2
400a39:       74 4d                   je     400a88 <_Z9factorialj+0x168>
400a3b:       0f af c1                imul   eax,ecx
400a3e:       89 d1                   mov    ecx,edx
400a40:       83 e9 03                sub    ecx,0x3
400a43:       74 43                   je     400a88 <_Z9factorialj+0x168>
400a45:       0f af c1                imul   eax,ecx
400a48:       89 d1                   mov    ecx,edx
400a4a:       83 e9 04                sub    ecx,0x4
400a4d:       74 39                   je     400a88 <_Z9factorialj+0x168>
400a4f:       0f af c1                imul   eax,ecx
400a52:       89 d1                   mov    ecx,edx
400a54:       83 e9 05                sub    ecx,0x5
400a57:       74 2f                   je     400a88 <_Z9factorialj+0x168>
400a59:       0f af c1                imul   eax,ecx
400a5c:       89 d1                   mov    ecx,edx
400a5e:       83 e9 06                sub    ecx,0x6
400a61:       74 25                   je     400a88 <_Z9factorialj+0x168>
400a63:       0f af c1                imul   eax,ecx
400a66:       89 d1                   mov    ecx,edx
400a68:       83 e9 07                sub    ecx,0x7
400a6b:       74 1b                   je     400a88 <_Z9factorialj+0x168>
400a6d:       0f af c1                imul   eax,ecx
400a70:       83 ea 08                sub    edx,0x8
400a73:       89 c1                   mov    ecx,eax
400a75:       0f af ca                imul   ecx,edx
400a78:       85 d2                   test   edx,edx
400a7a:       0f 45 c1                cmovne eax,ecx
400a7d:       c3                      ret
400a7e:       66 90                   xchg   ax,ax
400a80:       b8 01 00 00 00          mov    eax,0x1
400a85:       0f 1f 00                nop    DWORD PTR [rax]
400a88:       f3 c3                   repz ret
400a8a:       66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
400a90:       89 fa                   mov    edx,edi
400a92:       b8 01 00 00 00          mov    eax,0x1
400a97:       eb 8e                   jmp    400a27 <_Z9factorialj+0x107>
400a99:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
```

wat.

I have no idea what the compiler tries to do here. It unrolls loops and vectorizes the code? Wow. That’s… something. OK, not let’s see what we could’ve done ourselves:

```unsigned factorial_untailed(unsigned n)
{
unsigned p=1;
for (unsigned u=2;u<=n;p*=u,u++);
return p;
}
```

Which, now, generates the expected, simple, loop-based code:

```0000000000400a80 <_Z18factorial_untailedj>:
400a80:       83 ff 01                cmp    edi,0x1
400a83:       76 17                   jbe    400a9c <_Z18factorial_untailedj+0x1c>
400a85:       ba 02 00 00 00          mov    edx,0x2
400a8a:       b8 01 00 00 00          mov    eax,0x1
400a8f:       90                      nop
400a90:       0f af c2                imul   eax,edx
400a93:       83 c2 01                add    edx,0x1
400a96:       39 d7                   cmp    edi,edx
400a98:       73 f6                   jae    400a90 <_Z18factorial_untailedj+0x10>
400a9a:       f3 c3                   repz ret
400a9c:       b8 01 00 00 00          mov    eax,0x1
400aa1:       c3                      ret
```

*
* *

Tail-recursion is (very) important in some programming languages. In fact, for some, recursion is the only way of implementing loops, and is the natural construct for iterations. For these languages, it makes plenty of sense that the compiler knows and understands a lot about recursion and tail-recursion. However, while tail-recursion isn’t forbidden in C or C++, it is not really a common design-pattern for functions. We favor iteration, and are sometimes even explicitly taught (although wrongly) that recursion is evil. But while tail-recursion is not really a basic design-pattern for either C nor C++, the g++ compiler still seems to know a lot about it. It figured out all functions (albeit it quite went overboard with one). It seems to be doing such a good job that it may be safe to trust it with bits of program written as recursive functions. But not always. If the target code is a bottleneck: profile, time, look at disassembly, and decide whether you want to spend time writing an iterative version.

### 6 Responses to Of tails.

1. Fanael says:

> all optimizations (-O3)

Actually, there are some optimizations not enabled by -O3, such as -funroll-loops.

> wat

Funny. On my machine, turning off vectorization (-fno-tree-vectorize) results in much more sane

.seh_proc _Z9factorialj
_Z9factorialj:
.LFB0:
.seh_endprologue
movl \$1, %eax
testl %ecx, %ecx
je .L4
.p2align 4,,10
.L3:
imull %ecx, %eax
subl \$1, %ecx
jne .L3
rep ret
.L4:
rep ret
.seh_endproc

I wonder what the hell the vectorizer was thinking.

> It seems to be doing such a good job that it may be safe to trust it with bits of program written as recursive functions.

Only when optimizing (specifically, when -foptimize-sibling-calls is on), so if you have separate debug (-O0 -g etc.) and release (-O3) builds, you can have the debug build overflowing the stack and release working fine. Not particularly nasty to sort out, but it’s worth keeping in mind.

• Yes, I know, -O3 enable most optimizations.

And, well, I don’t know why vectorizing went crazy on this function. I didn’t check for speed. Maybe it’s counterintuitively much faster?

• Fanael says:

…it turns out it is faster. Not bu much, by faster nonetheless. On my machine, calculating the modulo-2³² factorial of 1000000000 takes 630ms for the vectorized version and 830ms for the naïve version. Surprising, to say the least.

• Fanael says:

The vectorized version is down to 350ms if I enable AVX2.

Also, if the for loop in the iterative version is changed to “for (;n!=0;p*=n,–n);”, it gets vectorized too.

• And what does the assembly-level code look like? Does the compiler go nuts on instructions, or is it somewhat sane?

• Fanael says:

Nuts, the assembly is very similar to the vectorized factorial in the post.