C# vs C++, Arrays

Comparison Research

It seems to be a settled case that C++ is fast, and .NET is slow. Yet, there is really not so much analysis out there to actually support that. With increasing time demands and ever competitive demands for more features, more quickly, I felt it useful to know just how much I would be giving up in performance if I just ditched C++ and started writing everying in C#. The results thus far may surprise some.

A good place to start in such an investigation is arrays. Arrays are one of the most fundamental structures of computing, so fundamental that processors assume that they are array crunching engines and optimize for that.

Making a program "cache friendly" means organizing data as densely as possible, and in order, and that means a array. For x86, the old ESI/RSI, EDI/RDI registers came to mean source and destination index into arrays. And, the SSE instructions are a great tool for dealing with arrays. Finally, strings, of text, are usually implemented as a specialization of an array.

Note that in all of the examples of disassembly, full optimization was on, and on both sides.

The C++ Native Array

C++ has a well deserved reputation for being a speedy array processor, and it should. In that language, an array is a pointer. The much maligned pointer is just an indirect memory reference in assembly language. Let's have a look at a simple loop in C++:


void populateWithPointerLoop(int *arr, int length)
{
    int *endarr = arr + length;
    int i = 0;
    while (arr < endarr)
    {
        *arr++ = i * i;
        i++;
    }
}

In our particular test program, the body of the loop assembles into this:

000000013F9310B1  mov         eax,ecx 
000000013F9310B3  imul        eax,ecx 
000000013F9310B6  mov         dword ptr [rbx],eax 
000000013F9310B8  add         rbx,4 
000000013F9310BC  inc         ecx  
000000013F9310BE  cmp         rbx,rdx 
000000013F9310C1  jb          main+0B1h (13F9310B1h) 

For this roundup, that's about as good as the compilers do. This is pretty quick but it isn't perfect. The point is, though, that vanilla pointer and array loops in C and C++ are going to lead to reasonably efficient machine code. Some of the touches of the compiler are interesting. Our test calling sequence just calls each function in turn. The compiler not only was smart enough to inline all the function calls, it also converted a parameter to a function into an immediate mode cmp instruction. That's pretty nice.

C# Pointer Array

C# has a reputation for being somewhat sluggish. For array crunching, this is entirely deserved. C# is rather wordy for array processing. The fastest code that I could come up with, casually, is to use unsafe code and a C# pointer. Yes, C# does have pointers, but they are more like a married man at a friend's bachelor party then they are a single man at a bar. They have so many restrictions you almost wonder what the point is. The source looks like this:


unsafe static void populateWithPointerForLoop(int[] arr)
{
    fixed (int* c = &arr[0])
    {
        int l = arr.Length;
        for (int i = 0; i < l; i++)
        {
            c[i] = i * i;
        }
    }
}

What appears to be the loop of the generated product from above looks like:


0000008c  movsxd      rdx,dword ptr [rsp+2Ch] 
00000091  mov         ecx,dword ptr [rsp+2Ch] 
00000095  imul        ecx,dword ptr [rsp+2Ch] 
0000009a  mov         rax,qword ptr [rsp+20h] 
0000009f  mov         dword ptr [rax+rdx*4],ecx 
000000a2  mov         eax,dword ptr [rsp+2Ch] 
000000a6  add         eax,1 
000000a9  mov         dword ptr [rsp+2Ch],eax 
000000ad  mov         eax,dword ptr [rsp+28h] 
000000b1  cmp         dword ptr [rsp+2Ch],eax 
000000b5  jl          000000000000008C 

It's more involved, for sure. There's more instructions. As a rule for this sort of thing, that means slower code. Its worth noting that the setup was far more complicated than the C / C++ comparisons. Of course, that is not the way one normally uses arrays in C#.

C# Basic Arrays

Our simple loop, with stock arrays, in C#, would look like this:


static void populateWithForLoopInvariant(int[] arr)
{
    int l = arr.Length;
    for (int i = 0; i < l; i++)
    {
        arr[i] = i * i;
        i++;
    }
}

Given a piece of code like that, the compiler dutifully generates this:


00000045  mov         eax,dword ptr [rsp+24h] 
00000049  imul        eax,dword ptr [rsp+24h] 
0000004e  mov         dword ptr [rsp+28h],eax 
00000052  movsxd      rcx,dword ptr [rsp+24h] 
00000057  mov         rax,qword ptr [rsp+50h] 
0000005c  mov         rax,qword ptr [rax+8] 
00000060  mov         qword ptr [rsp+30h],rcx 
00000065  cmp         qword ptr [rsp+30h],rax 
0000006a  jae         0000000000000078 
0000006c  mov         rax,qword ptr [rsp+30h] 
00000071  mov         qword ptr [rsp+30h],rax 
00000076  jmp         000000000000007D 
00000078  call        FFFFFFFFF96583F0 
0000007d  mov         rdx,qword ptr [rsp+50h] 
00000082  mov         rcx,qword ptr [rsp+30h] 
00000087  mov         eax,dword ptr [rsp+28h] 
0000008b  mov         dword ptr [rdx+rcx*4+10h],eax 
0000008f  mov         eax,dword ptr [rsp+24h] 
00000093  add         eax,1 
00000096  mov         dword ptr [rsp+24h],eax 

Clearly there's more involved in this approach. This is seemingly not as efficient as the C or C++ code I showed earlier, but, is it -that- much less efficient? The function call in the middle of the loop is troubling, but it appears like it is calling something that throws an exception if the array bounds is exceeeded. It's certainly wordier than any of the C++ implementations so far, but then again, the C++ example doesn't check the array bounds. STL does though. Let's have a look at that.

C++ STL vector

Pure vectors are the fastest way to go, but many C++ developers choose to use the safer STL for everything now, rather than the basic arrays. STL is C++'s standard library efficiency is but one of STL's goals. Let's look at our C++ simple STL program:


void populateWithStdVector( std::vector& v, int length )
{
    for (int i = 0; i < length; i++)
    {
        v[i] = i * i;
    }
}

and then its dissassembly

000000013FB810D0  mov         rax,qword ptr [rsp+48h] 
000000013FB810D5  mov         rcx,qword ptr [rsp+40h] 
000000013FB810DA  sub         rax,rcx 
000000013FB810DD  sar         rax,2 
000000013FB810E1  cmp         rdi,rax 
000000013FB810E4  jb          main+0F1h (13FB810F1h) 
000000013FB810E6  call        qword ptr [__imp__invalid_parameter_noinfo (13FB82138h)] 
000000013FB810EC  mov         rcx,qword ptr [rsp+40h] 
000000013FB810F1  mov         eax,ebx 
000000013FB810F3  imul        eax,ebx 
000000013FB810F6  mov         dword ptr [rcx+rdi*4],eax 
000000013FB810F9  inc         ebx  
000000013FB810FB  inc         rdi  
000000013FB810FE  cmp         ebx,64h 
000000013FB81101  jl          main+0D0h (13FB810D0h) 

STL clearly bloats up C++ a bit. It makes us wonder. If you are writing in C++ but are willing to trade performance to get better libraries, why not just use a language like C#? More than a few C++ STL fans would get better results with C# collections.

Benchmark: C# array vs C++ STL vector vs C++ native array

Benchmarks Defined

I created three benchmarks.

  1. array set - array set is a timed derivative of the basic array assignment code we have been discussing.
  2. big streamish - squares the source array into a dest array. The array sizes are large enough to make memory bandwidth an issue.
  3. little streamish - like big streamish, but with more trials and smaller arrays in an attempt to reduce the effects of memory bandwidth on the timing.

Results


Discussion

The results here are interesting. Memory bandwidth outweighed code generation in the big streamish benchmark. All three systems performed nearly identically, with STL trailing the pack. With that obstacle removed, a native array approach performed best overall across all three benchmarks. Surprisingly, C#'s array implementation did not trail too far behind C++'s native array

Like so many other writers, I admit that my benchmarking doesn't tell the full story of the capabilities of any product. In particular, I will probably revise this article to include comparison's with GCC as I suspect GCC at this point probably produces better 64 bit code for simple array crunching than does Visual C++.

Conclusions

I would say that if you are embarking on the creation of a new scientific or financial computing product, using Microsoft tools on Windows, and your gut is telling you to use STL on Visual C++ because you want the convenience and the safety, you should probably have a look at C# instead. C# is safer and faster than STL on Visual C++. The only way that you can get the kind of performance that Visual C++ is famous for is to live dangerously, and take a C approach to arrays, not a C++ one. Even then, you will be surprised at how often C#'s array implementation comes close to C++'s native arrays for performance.

One thing to notice about C# is that I had arranged the timing checks to not include the time to compile the code. It looks like we can make any C# benchmark worse by moving the timing loop outside of a code block. C# will compiles on demand but in our case even that took but a few milliseconds.

Be careful about using these benchmarks to draw conclusions about other compilers on other platforms. Do not say that you will use Mono over GNU C++ on Linux because of this page, but you should have a look at the STL vector source code you are using, and understand what you are getting into. Know your C++ compiler and STL implementation, and do a sanity check against a language that is easier to program. If you can get the performance you need in an application language rather than a systems one, certainly take the convenience of the application language and get your solution deployed more quickly.

System

These benchmarks were written and executed on the release candidate for Windows 7, 64 bit, using Visual Studio 2008 SP1, and .net framework 3.5 SP1. The hardware was 2P AMD Opteron 270 processor box, with 2GB of RAM, on a Tyan S2885 motherboard. The wstream benchmark for copy, stream, and triad for this machine are about 1440mb/s. It's a dinosaur folks, and I look forward to replicating these results on a Nehalem architecture machine, once I can obtain one.