Of tails.

In a previous post, I explored the effect of pruning on a recursive function, namely, the Collatz function. Richard (see comment) asked “does your compiler know about tail recursion?”. Well, I didn’t know for sure. Let’s find out.

tail-recursion-small

Let’s start with the Collatz function, the function that initiated the question. A naïve, recursive, implementation gives something like this:

unsigned collatz(unsigned x)
 {
  if (x==1)
   return 1;
  else
   if (is_even(x))
    return collatz(x/2);
   else
    return collatz(3*x+1);
 }

Where is_even is a simple function that tests whether or not a number is… even. Inlined, it does not affect performance but improves readability.

We might expect that the compiler would see that it can be flattened into a while loop, but let’s see what it actually does. As usual, the compiler is g++, this time v. 4.8.2. The dumps are obtained using objdump -D -Mintel a.out. Invoking g++ with no extra arguments (implicitly using -O0), we get the following code:

0000000000400841 <_Z7collatzj>:
  400841:       55                      push   rbp
  400842:       48 89 e5                mov    rbp,rsp
  400845:       48 83 ec 10             sub    rsp,0x10
  400849:       89 7d fc                mov    DWORD PTR [rbp-0x4],edi
  40084c:       83 7d fc 01             cmp    DWORD PTR [rbp-0x4],0x1
  400850:       75 07                   jne    400859 <_Z7collatzj+0x18>
  400852:       b8 01 00 00 00          mov    eax,0x1
  400857:       eb 2f                   jmp    400888 <_Z7collatzj+0x47>
  400859:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
  40085c:       89 c7                   mov    edi,eax
  40085e:       e8 ca ff ff ff          call   40082d <_Z7is_evenj>
  400863:       84 c0                   test   al,al
  400865:       74 0e                   je     400875 <_Z7collatzj+0x34>
  400867:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
  40086a:       d1 e8                   shr    eax,1
  40086c:       89 c7                   mov    edi,eax
  40086e:       e8 ce ff ff ff          call   400841 <_Z7collatzj>
  400873:       eb 13                   jmp    400888 <_Z7collatzj+0x47>
  400875:       8b 55 fc                mov    edx,DWORD PTR [rbp-0x4]
  400878:       89 d0                   mov    eax,edx
  40087a:       01 c0                   add    eax,eax
  40087c:       01 d0                   add    eax,edx
  40087e:       83 c0 01                add    eax,0x1
  400881:       89 c7                   mov    edi,eax
  400883:       e8 b9 ff ff ff          call   400841 <_Z7collatzj>
  400888:       c9                      leave
  400889:       c3                      ret

This code literally translate the C++ code into assembly, without inlining nor any other fancy optimization: it is a line-by-line assembly equivalent of the original code. (Much like most early compiler would do. Turbo Pascal, for exemple, as I recall, was blindingly fast, but also quite literal in its code generation.)

Invoking g++ with -O3, we get the following code:

  400930:       eb 16                   jmp    400948 <_Z7collatzj+0x18>
  400932:       66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
  400938:       89 fa                   mov    edx,edi
  40093a:       8d 44 7f 01             lea    eax,[rdi+rdi*2+0x1]
  40093e:       d1 ea                   shr    edx,1
  400940:       83 e7 01                and    edi,0x1
  400943:       0f 44 c2                cmove  eax,edx
  400946:       89 c7                   mov    edi,eax
  400948:       83 ff 01                cmp    edi,0x1
  40094b:       75 eb                   jne    400938 <_Z7collatzj+0x8>
  40094d:       b8 01 00 00 00          mov    eax,0x1
  400952:       c3                      ret

This time, the compiler did figure out the recursion and flattened the recursion into a while loop. Can we do it ourselves? Let’s check if we can do better.

unsigned collatz_untailed(unsigned x)
 {
  while (x!=1)
   if (is_even(x))
    x/=2;
   else
    x=3*x+1;
  return x;
 }

This is a rework of the code to re-write the parameter x and avoid recursive calls. Let’s see what g++ with -O3 generates.

0000000000400930 <_Z16collatz_untailedj>:
  400930:       eb 16                   jmp    400948 <_Z16collatz_untailedj+0x18>
  400932:       66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
  400938:       89 fa                   mov    edx,edi
  40093a:       8d 44 7f 01             lea    eax,[rdi+rdi*2+0x1]
  40093e:       d1 ea                   shr    edx,1
  400940:       83 e7 01                and    edi,0x1
  400943:       0f 44 c2                cmove  eax,edx
  400946:       89 c7                   mov    edi,eax
  400948:       83 ff 01                cmp    edi,0x1
  40094b:       75 eb                   jne    400938 <_Z16collatz_untailedj+0x8>
  40094d:       b8 01 00 00 00          mov    eax,0x1
  400952:       c3                      ret

Well, it seems that exactly the compiler understood of the original function very well, enough to generate a recursion-free version. Even the generated code is identical.

*
* *

Let’s try another one that’s usually presented as a recursive function: the greatest common divisor, or gcd. In every textbook ever, you either find the original version, due to Euclid and based on successive subtraction, or this version, using modulo:

unsigned gcd(unsigned a, unsigned b)
 {
  if (b)
   return gcd(b, a % b);
  else
   return a;
 }

The version based on modulo is quite advantageous because it will make a number of calls (or iterations) proportional to the logarithm of the smallest of the two numbers. With g++ (with no optimizations):

00000000004008b5 <_Z3gcdjj>:
  4008b5:       55                      push   rbp
  4008b6:       48 89 e5                mov    rbp,rsp
  4008b9:       48 83 ec 10             sub    rsp,0x10
  4008bd:       89 7d fc                mov    DWORD PTR [rbp-0x4],edi
  4008c0:       89 75 f8                mov    DWORD PTR [rbp-0x8],esi
  4008c3:       83 7d f8 00             cmp    DWORD PTR [rbp-0x8],0x0
  4008c7:       74 19                   je     4008e2 <_Z3gcdjj+0x2d>
  4008c9:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
  4008cc:       ba 00 00 00 00          mov    edx,0x0
  4008d1:       f7 75 f8                div    DWORD PTR [rbp-0x8]
  4008d4:       8b 45 f8                mov    eax,DWORD PTR [rbp-0x8]
  4008d7:       89 d6                   mov    esi,edx
  4008d9:       89 c7                   mov    edi,eax
  4008db:       e8 d5 ff ff ff          call   4008b5 <_Z3gcdjj>
  4008e0:       eb 03                   jmp    4008e5 <_Z3gcdjj+0x30>
  4008e2:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
  4008e5:       c9                      leave
  4008e6:       c3                      ret

Now with -O3:

0000000000400a70 <_Z3gcdjj>:
  400a70:       85 f6                   test   esi,esi
  400a72:       89 f8                   mov    eax,edi
  400a74:       75 0c                   jne    400a82 <_Z3gcdjj+0x12>
  400a76:       eb 14                   jmp    400a8c <_Z3gcdjj+0x1c>
  400a78:       0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
  400a7f:       00
  400a80:       89 d6                   mov    esi,edx
  400a82:       31 d2                   xor    edx,edx
  400a84:       f7 f6                   div    esi
  400a86:       89 f0                   mov    eax,esi
  400a88:       85 d2                   test   edx,edx
  400a8a:       75 f4                   jne    400a80 <_Z3gcdjj+0x10>
  400a8c:       f3 c3                   repz ret

If we eliminate recursion, we would get something like:

unsigned gcd_untailed(unsigned a, unsigned b)
 {
  while (b)
   {
    unsigned t=a % b;
    a=b;
    b=t;
   }
  return a;
 }

Which, with g++ -O3, generates…

0000000000400b60 <_Z12gcd_untailedjj>:
  400b60:       85 f6                   test   esi,esi
  400b62:       89 f8                   mov    eax,edi
  400b64:       75 0c                   jne    400b72 <_Z12gcd_untailedjj+0x12>
  400b66:       eb 14                   jmp    400b7c <_Z12gcd_untailedjj+0x1c>
  400b68:       0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
  400b6f:       00
  400b70:       89 d6                   mov    esi,edx
  400b72:       31 d2                   xor    edx,edx
  400b74:       f7 f6                   div    esi
  400b76:       89 f0                   mov    eax,esi
  400b78:       85 d2                   test   edx,edx
  400b7a:       75 f4                   jne    400b70 <_Z12gcd_untailedjj+0x10>
  400b7c:       f3 c3                   repz ret

…the exact same code.

*
* *

OK, let’s consider one last example. The simplest of all, certainly, the prototypical textbook example: the factorial.

unsigned factorial(unsigned n)
 {
  if (n==0)
   return 1;
  else
   return n*factorial(n-1);
 }

(Let’s say we ignore the overflows. Let’s pretend it’s the factorial, modulo the machine-size integers.) With no optimizations, g++ generates the following code:

000000000040088a <_Z9factorialj>:
  40088a:       55                      push   rbp
  40088b:       48 89 e5                mov    rbp,rsp
  40088e:       48 83 ec 10             sub    rsp,0x10
  400892:       89 7d fc                mov    DWORD PTR [rbp-0x4],edi
  400895:       83 7d fc 00             cmp    DWORD PTR [rbp-0x4],0x0
  400899:       75 07                   jne    4008a2 <_Z9factorialj+0x18>
  40089b:       b8 01 00 00 00          mov    eax,0x1
  4008a0:       eb 11                   jmp    4008b3 <_Z9factorialj+0x29>
  4008a2:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
  4008a5:       83 e8 01                sub    eax,0x1
  4008a8:       89 c7                   mov    edi,eax
  4008aa:       e8 db ff ff ff          call   40088a <_Z9factorialj>
  4008af:       0f af 45 fc             imul   eax,DWORD PTR [rbp-0x4]
  4008b3:       c9                      leave
  4008b4:       c3                      ret

Now, invoking the compiler with all optimizations (-O3):

0000000000400920 <_Z9factorialj>:
  400920:       85 ff                   test   edi,edi
  400922:       0f 84 58 01 00 00       je     400a80 <_Z9factorialj+0x160>
  400928:       89 fa                   mov    edx,edi
  40092a:       c1 ea 02                shr    edx,0x2
  40092d:       8d 0c 95 00 00 00 00    lea    ecx,[rdx*4+0x0]
  400934:       85 c9                   test   ecx,ecx
  400936:       0f 84 54 01 00 00       je     400a90 <_Z9factorialj+0x170>
  40093c:       83 ff 09                cmp    edi,0x9
  40093f:       0f 86 4b 01 00 00       jbe    400a90 <_Z9factorialj+0x170>
  400945:       8d 47 ff                lea    eax,[rdi-0x1]
  400948:       89 44 24 ec             mov    DWORD PTR [rsp-0x14],eax
  40094c:       8d 47 fe                lea    eax,[rdi-0x2]
  40094f:       66 0f 6e 74 24 ec       movd   xmm6,DWORD PTR [rsp-0x14]
  400955:       89 44 24 f0             mov    DWORD PTR [rsp-0x10],eax
  400959:       8d 47 fd                lea    eax,[rdi-0x3]
  40095c:       66 0f 6e 44 24 f0       movd   xmm0,DWORD PTR [rsp-0x10]
  400962:       89 7c 24 f0             mov    DWORD PTR [rsp-0x10],edi
  400966:       89 44 24 f4             mov    DWORD PTR [rsp-0xc],eax
  40096a:       66 0f 6e 4c 24 f0       movd   xmm1,DWORD PTR [rsp-0x10]
  400970:       31 c0                   xor    eax,eax
  400972:       66 0f 6e 64 24 f4       movd   xmm4,DWORD PTR [rsp-0xc]
  400978:       66 0f 62 ce             punpckldq xmm1,xmm6
  40097c:       66 0f 62 c4             punpckldq xmm0,xmm4
  400980:       66 0f 6f 25 d8 01 00    movdqa xmm4,XMMWORD PTR [rip+0x1d8]        # 400b60 <_IO_stdin_used+0x20>
  400987:       00
  400988:       66 0f 6c c8             punpcklqdq xmm1,xmm0
  40098c:       66 0f 6f 05 bc 01 00    movdqa xmm0,XMMWORD PTR [rip+0x1bc]        # 400b50 <_IO_stdin_used+0x10>
  400993:       00
  400994:       eb 0e                   jmp    4009a4 <_Z9factorialj+0x84>
  400996:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  40099d:       00 00 00
  4009a0:       66 0f 6f cb             movdqa xmm1,xmm3
  4009a4:       66 0f 6f e9             movdqa xmm5,xmm1
  4009a8:       83 c0 01                add    eax,0x1
  4009ab:       66 0f 6f d9             movdqa xmm3,xmm1
  4009af:       66 0f 73 d1 20          psrlq  xmm1,0x20
  4009b4:       39 c2                   cmp    edx,eax
  4009b6:       66 0f f4 e8             pmuludq xmm5,xmm0
  4009ba:       66 0f 73 d0 20          psrlq  xmm0,0x20
  4009bf:       66 0f f4 c8             pmuludq xmm1,xmm0
  4009c3:       66 0f 70 c5 08          pshufd xmm0,xmm5,0x8
  4009c8:       66 0f fe dc             paddd  xmm3,xmm4
  4009cc:       66 0f 70 c9 08          pshufd xmm1,xmm1,0x8
  4009d1:       66 0f 62 c1             punpckldq xmm0,xmm1
  4009d5:       77 c9                   ja     4009a0 <_Z9factorialj+0x80>
  4009d7:       66 0f 6f f8             movdqa xmm7,xmm0
  4009db:       89 fa                   mov    edx,edi
  4009dd:       29 ca                   sub    edx,ecx
  4009df:       39 f9                   cmp    ecx,edi
  4009e1:       66 0f 73 df 08          psrldq xmm7,0x8
  4009e6:       66 0f 6f d7             movdqa xmm2,xmm7
  4009ea:       66 0f 6f f7             movdqa xmm6,xmm7
  4009ee:       66 0f 73 d2 20          psrlq  xmm2,0x20
  4009f3:       66 0f f4 f0             pmuludq xmm6,xmm0
  4009f7:       66 0f 73 d0 20          psrlq  xmm0,0x20
  4009fc:       66 0f 70 ce 08          pshufd xmm1,xmm6,0x8
  400a01:       66 0f f4 c2             pmuludq xmm0,xmm2
  400a05:       66 0f 70 c0 08          pshufd xmm0,xmm0,0x8
  400a0a:       66 0f 62 c8             punpckldq xmm1,xmm0
  400a0e:       66 0f 6f f9             movdqa xmm7,xmm1
  400a12:       66 0f 73 df 04          psrldq xmm7,0x4
  400a17:       66 0f f4 cf             pmuludq xmm1,xmm7
  400a1b:       66 0f 7e 4c 24 ec       movd   DWORD PTR [rsp-0x14],xmm1
  400a21:       8b 44 24 ec             mov    eax,DWORD PTR [rsp-0x14]
  400a25:       74 61                   je     400a88 <_Z9factorialj+0x168>
  400a27:       89 d1                   mov    ecx,edx
  400a29:       0f af c2                imul   eax,edx
  400a2c:       83 e9 01                sub    ecx,0x1
  400a2f:       74 57                   je     400a88 <_Z9factorialj+0x168>
  400a31:       0f af c1                imul   eax,ecx
  400a34:       89 d1                   mov    ecx,edx
  400a36:       83 e9 02                sub    ecx,0x2
  400a39:       74 4d                   je     400a88 <_Z9factorialj+0x168>
  400a3b:       0f af c1                imul   eax,ecx
  400a3e:       89 d1                   mov    ecx,edx
  400a40:       83 e9 03                sub    ecx,0x3
  400a43:       74 43                   je     400a88 <_Z9factorialj+0x168>
  400a45:       0f af c1                imul   eax,ecx
  400a48:       89 d1                   mov    ecx,edx
  400a4a:       83 e9 04                sub    ecx,0x4
  400a4d:       74 39                   je     400a88 <_Z9factorialj+0x168>
  400a4f:       0f af c1                imul   eax,ecx
  400a52:       89 d1                   mov    ecx,edx
  400a54:       83 e9 05                sub    ecx,0x5
  400a57:       74 2f                   je     400a88 <_Z9factorialj+0x168>
  400a59:       0f af c1                imul   eax,ecx
  400a5c:       89 d1                   mov    ecx,edx
  400a5e:       83 e9 06                sub    ecx,0x6
  400a61:       74 25                   je     400a88 <_Z9factorialj+0x168>
  400a63:       0f af c1                imul   eax,ecx
  400a66:       89 d1                   mov    ecx,edx
  400a68:       83 e9 07                sub    ecx,0x7
  400a6b:       74 1b                   je     400a88 <_Z9factorialj+0x168>
  400a6d:       0f af c1                imul   eax,ecx
  400a70:       83 ea 08                sub    edx,0x8
  400a73:       89 c1                   mov    ecx,eax
  400a75:       0f af ca                imul   ecx,edx
  400a78:       85 d2                   test   edx,edx
  400a7a:       0f 45 c1                cmovne eax,ecx
  400a7d:       c3                      ret
  400a7e:       66 90                   xchg   ax,ax
  400a80:       b8 01 00 00 00          mov    eax,0x1
  400a85:       0f 1f 00                nop    DWORD PTR [rax]
  400a88:       f3 c3                   repz ret
  400a8a:       66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
  400a90:       89 fa                   mov    edx,edi
  400a92:       b8 01 00 00 00          mov    eax,0x1
  400a97:       eb 8e                   jmp    400a27 <_Z9factorialj+0x107>
  400a99:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]

wat.

I have no idea what the compiler tries to do here. It unrolls loops and vectorizes the code? Wow. That’s… something. OK, not let’s see what we could’ve done ourselves:

unsigned factorial_untailed(unsigned n)
 {
  unsigned p=1;
  for (unsigned u=2;u<=n;p*=u,u++);
  return p;
 }

Which, now, generates the expected, simple, loop-based code:

0000000000400a80 <_Z18factorial_untailedj>:
  400a80:       83 ff 01                cmp    edi,0x1
  400a83:       76 17                   jbe    400a9c <_Z18factorial_untailedj+0x1c>
  400a85:       ba 02 00 00 00          mov    edx,0x2
  400a8a:       b8 01 00 00 00          mov    eax,0x1
  400a8f:       90                      nop
  400a90:       0f af c2                imul   eax,edx
  400a93:       83 c2 01                add    edx,0x1
  400a96:       39 d7                   cmp    edi,edx
  400a98:       73 f6                   jae    400a90 <_Z18factorial_untailedj+0x10>
  400a9a:       f3 c3                   repz ret
  400a9c:       b8 01 00 00 00          mov    eax,0x1
  400aa1:       c3                      ret

*
* *

Tail-recursion is (very) important in some programming languages. In fact, for some, recursion is the only way of implementing loops, and is the natural construct for iterations. For these languages, it makes plenty of sense that the compiler knows and understands a lot about recursion and tail-recursion. However, while tail-recursion isn’t forbidden in C or C++, it is not really a common design-pattern for functions. We favor iteration, and are sometimes even explicitly taught (although wrongly) that recursion is evil. But while tail-recursion is not really a basic design-pattern for either C nor C++, the g++ compiler still seems to know a lot about it. It figured out all functions (albeit it quite went overboard with one). It seems to be doing such a good job that it may be safe to trust it with bits of program written as recursive functions. But not always. If the target code is a bottleneck: profile, time, look at disassembly, and decide whether you want to spend time writing an iterative version.

6 Responses to Of tails.

  1. Fanael says:

    > all optimizations (-O3)

    Actually, there are some optimizations not enabled by -O3, such as -funroll-loops.

    > wat

    Funny. On my machine, turning off vectorization (-fno-tree-vectorize) results in much more sane

    .seh_proc _Z9factorialj
    _Z9factorialj:
    .LFB0:
    .seh_endprologue
    movl $1, %eax
    testl %ecx, %ecx
    je .L4
    .p2align 4,,10
    .L3:
    imull %ecx, %eax
    subl $1, %ecx
    jne .L3
    rep ret
    .L4:
    rep ret
    .seh_endproc

    I wonder what the hell the vectorizer was thinking.

    > It seems to be doing such a good job that it may be safe to trust it with bits of program written as recursive functions.

    Only when optimizing (specifically, when -foptimize-sibling-calls is on), so if you have separate debug (-O0 -g etc.) and release (-O3) builds, you can have the debug build overflowing the stack and release working fine. Not particularly nasty to sort out, but it’s worth keeping in mind.

    • Yes, I know, -O3 enable most optimizations.

      And, well, I don’t know why vectorizing went crazy on this function. I didn’t check for speed. Maybe it’s counterintuitively much faster?

      • Fanael says:

        …it turns out it is faster. Not bu much, by faster nonetheless. On my machine, calculating the modulo-2³² factorial of 1000000000 takes 630ms for the vectorized version and 830ms for the naïve version. Surprising, to say the least.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: