Class constructor performance foibles?

So, while working on PT 9.1, the SystemCPU class was added:

class SystemCPU {
	val Vendor = "";
	
	val MMX = false;
	val SSE = false;
	val SSE2 = false;
	val SSE3 = false;
	val SSSE3 = false;
	val SSE41 = false;
	val SSE42 = false;
}

The class isn’t complete yet since it is missing some options, including core count and the brand name for the CPU. But even when it going to be finished, it is going to be simple class since it has no functionality. The System class will instantiate one of these for you and fill it up with the appropriate information so you have a platform independent way of figuring out the capabilities of you CPU and things like number of cores and what-not.

So this is and is meant to be a simple class. It has a String field and several Bool fields. As it turns out, the current implementation of String causes its fields to be zeroed and false is also 0, so what a constructor for SystemCPU must do is set everything to 0. Do basically a memset(0). Since he have the C++ backend, one can always output pretty readable equivalent C++ code, and here are the constructors for String and SystemCPU:

String::String() {
	data = 0;
	length = 0;
	capacity = 0;
}

SystemCPU::SystemCPU(): Vendor(_Void) {
	new (&this->Vendor) ::String();
	MMX = false;
	SSE = false;
	SSE2 = false;
	SSE3 = false;
	SSSE3 = false;
	SSE41 = false;
	SSE42 = false;
}

The Vendor(_Void) and new (&this->Vendor) ::String() are a bit weird. The sequence Vendor(_Void) guarantees that the C++ SystemCPU constructor will leave Vendor alone (the optimizer recognizes the Vendor(_Void) construct due to a special constructor in String and generates a NOP for that). With the placement new, new (&this->Vendor) ::String(), we manually call the constructor for String and then initialize the rest of the fields. And all fields get initialized to 0.

The question is: should we optimize this initialization? When working with the C++ backend, we always need to decide if we do an optimization in the front-end or leave-it to the backend. Leaving it to the backend can result in more readable code. But does the backend do this? Let’s benchmark!

using sys.core.StopWatch;
using sys.core.SystemCPU;

class BenchSystemCPU {
	const TIMES = 100000;
	
	static def test_Abs(const in: [c]SystemCPU) {
		for (val i = 0; i < TIMES; i++) {
			for (val j = 0p; j < in.Length; j++)
				in[j]{};
		}
	}

	static val buffa: [c 100000]SystemCPU = void;
	
	def @main() {
		{
			val sw = StopWatch{};
			test_Abs(buffa);
			System.Out << test_Abs.Name << " finished in " << sw.Elapsed() / 1000 << " sec.\n";
		}
	}
}

This simple little benchmark will fill up a statically allocated vector of 100000 SystemCPU instances 100000 times. Never-mind that the program only due to the sheer coincidence that everything is initialized to zero doesn’t leak memory. In order for it to not leak memory the call of a manual constructor must be accompanied by a call to manual destructor too, but calling the destructor would detract from the goal of the benchmark since the destructor is not free. And even if it were to leak, we are benchmarking the time it takes to call the constructor, so it is not important.

So here are the results:

  • 37.1421495178437 seconds on MSC14, 32-bit, release mode.
  • 35.1300343545186 seconds on TDM, 32-bit, release mode.

Fair enough! But what if I am wrong about the _Void bit and the placement new. What if these two constructs completely confuse both compilers? Well, we can easily rewrite the two constructors to be much more standard:

String::String(): data(0), length(0), capacity(0) {
}

SystemCPU::SystemCPU(): Vendor(), MMX(false), SSE(false), SSE2(false), SSE3(false),
						SSSE3(false), SSE41(false), SSE42(false) {
}

Now all fields are initialized in the initializer list and I also followed the declaration order. Let’s see the benchmarks:

  • 37.1537800874641 seconds on MSC14, 32-bit, release mode.
  • 35.1234764643324 seconds on TDM, 32-bit, release mode.

The results are the same, barring a bit of standard variance, so the constructs are not confusing the compiler.

But we are 600 words in and you may be asking the following question: what is this pointless benchmarking all about. Well, I’ll do one final tweak and replace the initializer list with a memset(0). It is a bit hacky and not as pretty or maintainable, but one would except the compiler to actually do this behind the scenes and if we get the same numbers, then that is evidence enough that the memset hack should not be used. Here are the modified constructors:

String::String() {
	memset(this, 0, sizeof(String));
}

SystemCPU::SystemCPU(): Vendor(_Void) {
	memset(this, 0, sizeof(SystemCPU));
}

And the results:

  • 8.30797196732857 seconds on MSC14, 32-bit, release mode.
  • 16.5283915465819 seconds on TDM, 32-bit, release mode.

This is insane! The MSC version is roughly 4 times faster and the TDM version is roughly 2 times faster with this hackjob. To investigate the 4 vs. 2 times difference I would need to go into the assembly and into memset and see what is going on, but that is not the point.

The point is: are C++ compilers not equipped with an optimization pass to handle this? Because if they are not, adding such a pass to the front-end would be a huge win.

This issue needs further investigation and I’ll be back with a follow up after I dig though some ASM. Hopefully it is not just some mistake on my part!