Class constructor performance foibles?

So, while working on PT 9.1, the SystemCPU class was added:

class SystemCPU {
	val Vendor = "";
	
	val MMX = false;
	val SSE = false;
	val SSE2 = false;
	val SSE3 = false;
	val SSSE3 = false;
	val SSE41 = false;
	val SSE42 = false;
}

The class isn’t complete yet since it is missing some options, including core count and the brand name for the CPU. But even when it going to be finished, it is going to be simple class since it has no functionality. The System class will instantiate one of these for you and fill it up with the appropriate information so you have a platform independent way of figuring out the capabilities of you CPU and things like number of cores and what-not.

So this is and is meant to be a simple class. It has a String field and several Bool fields. As it turns out, the current implementation of String causes its fields to be zeroed and false is also 0, so what a constructor for SystemCPU must do is set everything to 0. Do basically a memset(0). Since he have the C++ backend, one can always output pretty readable equivalent C++ code, and here are the constructors for String and SystemCPU:

String::String() {
	data = 0;
	length = 0;
	capacity = 0;
}

SystemCPU::SystemCPU(): Vendor(_Void) {
	new (&this->Vendor) ::String();
	MMX = false;
	SSE = false;
	SSE2 = false;
	SSE3 = false;
	SSSE3 = false;
	SSE41 = false;
	SSE42 = false;
}

The Vendor(_Void) and new (&this->Vendor) ::String() are a bit weird. The sequence Vendor(_Void) guarantees that the C++ SystemCPU constructor will leave Vendor alone (the optimizer recognizes the Vendor(_Void) construct due to a special constructor in String and generates a NOP for that). With the placement new, new (&this->Vendor) ::String(), we manually call the constructor for String and then initialize the rest of the fields. And all fields get initialized to 0.

The question is: should we optimize this initialization? When working with the C++ backend, we always need to decide if we do an optimization in the front-end or leave-it to the backend. Leaving it to the backend can result in more readable code. But does the backend do this? Let’s benchmark!

using sys.core.StopWatch;
using sys.core.SystemCPU;

class BenchSystemCPU {
	const TIMES = 100000;
	
	static def test_Abs(const in: [c]SystemCPU) {
		for (val i = 0; i < TIMES; i++) {
			for (val j = 0p; j < in.Length; j++)
				in[j]{};
		}
	}

	static val buffa: [c 100000]SystemCPU = void;
	
	def @main() {
		{
			val sw = StopWatch{};
			test_Abs(buffa);
			System.Out << test_Abs.Name << " finished in " << sw.Elapsed() / 1000 << " sec.\n";
		}
	}
}

This simple little benchmark will fill up a statically allocated vector of 100000 SystemCPU instances 100000 times. Never-mind that the program only due to the sheer coincidence that everything is initialized to zero doesn’t leak memory. In order for it to not leak memory the call of a manual constructor must be accompanied by a call to manual destructor too, but calling the destructor would detract from the goal of the benchmark since the destructor is not free. And even if it were to leak, we are benchmarking the time it takes to call the constructor, so it is not important.

So here are the results:

  • 37.1421495178437 seconds on MSC14, 32-bit, release mode.
  • 35.1300343545186 seconds on TDM, 32-bit, release mode.

Fair enough! But what if I am wrong about the _Void bit and the placement new. What if these two constructs completely confuse both compilers? Well, we can easily rewrite the two constructors to be much more standard:

String::String(): data(0), length(0), capacity(0) {
}

SystemCPU::SystemCPU(): Vendor(), MMX(false), SSE(false), SSE2(false), SSE3(false),
						SSSE3(false), SSE41(false), SSE42(false) {
}

Now all fields are initialized in the initializer list and I also followed the declaration order. Let’s see the benchmarks:

  • 37.1537800874641 seconds on MSC14, 32-bit, release mode.
  • 35.1234764643324 seconds on TDM, 32-bit, release mode.

The results are the same, barring a bit of standard variance, so the constructs are not confusing the compiler.

But we are 600 words in and you may be asking the following question: what is this pointless benchmarking all about. Well, I’ll do one final tweak and replace the initializer list with a memset(0). It is a bit hacky and not as pretty or maintainable, but one would except the compiler to actually do this behind the scenes and if we get the same numbers, then that is evidence enough that the memset hack should not be used. Here are the modified constructors:

String::String() {
	memset(this, 0, sizeof(String));
}

SystemCPU::SystemCPU(): Vendor(_Void) {
	memset(this, 0, sizeof(SystemCPU));
}

And the results:

  • 8.30797196732857 seconds on MSC14, 32-bit, release mode.
  • 16.5283915465819 seconds on TDM, 32-bit, release mode.

This is insane! The MSC version is roughly 4 times faster and the TDM version is roughly 2 times faster with this hackjob. To investigate the 4 vs. 2 times difference I would need to go into the assembly and into memset and see what is going on, but that is not the point.

The point is: are C++ compilers not equipped with an optimization pass to handle this? Because if they are not, adding such a pass to the front-end would be a huge win.

This issue needs further investigation and I’ll be back with a follow up after I dig though some ASM. Hopefully it is not just some mistake on my part!

Post PT 9.0 Updates and OSS

Z2C PT 9.0 was announced here and on Reddit last week and released as a pre-alpha last week. So the question is: what now?

Development on PT 9.1 has started. It will contain bug-fixes, new features and library additions. But this goes for all versions from now on until the implementation is considered done, so I won’t repeat this every week. Instead I’ll talk about long term goals.

One of the Reddit questions was: why not release the source code as OSS. Now, the standard library is released as OSS, but the compiler itself and ZIDE are not. The reason I gave for this was time an perfectionism. So everything will be released as OSS. The compiler itself will be included in the standard library. Thus, it must look decent and have a stable API. The code is no where near ready for this and in consequence it is not OSSed yet.

A lot of time is needed to achieve this and in order to compensate for this, the compiler library will be split up more. A part of it will be separated as a low dependency “scanner”, a library that goes over Z2 code and builds class layout and other meta-information. This new library shall work without the compiler or the assembly and is designed to be very lenient with errors if needed, in order to provide all the code navigation features for ZIDE and other tools. So step one is to do this isolation part and refactor it a bit. This small library will be called “z2c-scan” and will be the first to be OSS. Using this, ZIDE can be OSSed as well, since it has no other dependencies.

The rest of the library will be split up again, with z2c-comp, z2-cpp and z2c-llvm being the compiler, C++ backend and LLVM backend respectively.

And a very long term refactoring process will be executed incrementally on all the code to make it fit for prime time and not leave it as ugly and dense as it can be in some places. In the early stages of the compiler, there was a goal to get it as small as possible. A pretty stupid goal and in consequence the compiler code is very dense and not that easy to maintain. This will change with time.

Other than that, there will be an update every two weeks. Even if these smaller updates are light on features, they do contain bug-fixes and will in turn create a better user experience.