Performance Optimisation in D - Squish File Compressor

Introduction

Squish is a Windows command line program that I created years / decades ago to shrink text and image for the Cybiko. It was not a great method but it could reduce some files by a third, ultimately it was too slow to be much use. It worked by using bit codes to represent bytes, if the file had a small number of different bytes, the the compression worked pretty well.

E.g. Hello World!

Letter	Occurrences	Bits
l	3	01
o	2	001
	1	101
!	1	0001
H	1	1001
W	1	1101
d	1	00001
e	1	10001
r	1	11001

Compressed bit stream:
10011000 10101001 10111010 01110010 10000100 01000000

In this example, the bytes required goes from 12 bytes to 6 bytes! The downside is that as the number of different characters increases, the compression gets worse, it can end having 4 bytes may represent 1 byte!

This method had the advantage of being easy to understand and not too hard to implement.

Note: If you are looking for a compression method to use, there are many much better ones available!

Porting to the D Programming Language

Looking through my old C programs, I thought this would be an interesting one to port to D. It would gain some memory safety and would give me the chance to improve it.

The port was fairly straight forward, the other guides cover most of what was done. Loops were changed to foreach, dynamic arrays used etc...

After this, the program worked but on my test text file, it seemed slow decompressing. Compressing was fine but the decompress was noticably very slow.

Benchmarking

To get a rough idea of the time taken, we can use the StopWatch:

import std.datetime.stopwatch;
...
StopWatch sw;
sw.start();
	
	go_decompress(output,message);

long msecs = sw.peek.total!"msecs";
writeln("Elapsed time: ",msecs,"ms");

This showed that compression took around 12ms which seemed fine but decompression was over 300ms! This was terrible! My computer is not the fastest (old Pentium Dual-Core) which may be why I noticed this in the first place. The downside of faster PC's is that they often mask inefficiency. On an i5, I doubt this problem would have even been noticed...

The Original Decompression Function

This is pretty much the same method as the C version. A block of bits gets built and is compared against a structure containing bitcodes, if there is a match, a value is returned.

bool decompress(ubyte[] input,ref ubyte[] output){
	int l;
	uint code;
	
	foreach(ubyte ch;input){
		ubyte va=cast(ubyte)0x80;
		foreach(i;0..8){
			if ((ch & va)==va ){
				code+=(0x80000000>>l);
			}
			va=(va & 0xff)>>1;
			
			l++;
			if (l>1){			
				auto outcode=codematch(code);
				if (outcode>=0){
					code=0;
					l=0;
					output~=node[outcode].value;
				}
			}
		}
	}
	return true;
}

It's simple but involves lots of loops. The C version saved directly to a file but this outputs to an array and then saves the array.

Optimising!

This involved a bit of trial and error to find the actual issue, so in order of things tried:

Removing a Shift

The bit mask (va) is a moving bit mask used to get the particular bit of a byte that has been loaded. Shifts are usually quick but as we are only checking 8 bits, this can be put into an array.

// Saves doing lots of shifts
ubyte[8] va=[0x80,0x40,0x20,0x10,0x08,0x04,0x02,0x01];
...
...
foreach(i;0..8){
	if ((ch & va[i])==va[i] ){
		code+=(0x80000000>>l);
	}		
	...
}

Cleaner but gave no noticeable improvement.

Better Memory Allocation

Increasing the size Dynamic Array can be costly, it may involve an allocation and a copy depending on how memory management works. Doing a GC profile did not give any indications of an issue. Often memory is over allocated to prevent too much of a penalty but we can help by starting with larger buffers and increasing as we go:

// Assume that the output is at least as big the input
output.length=input.length;

...

output[size]=outcode;
size++;
// Reached the current limit of array so increase!
if (size==output.length){
	output.length+=1024;
}
...

// Set the array to the actual size so we can easily write to file later
output.length=size;

Again better, but still no real change to the time taken!

Using an Associative Array

Running out of things to improve, the only real work being done is checking if the bit code exists and getting what the value is. Given that the bit code is checked every time it grows, the codematch() function gets called a lot. This is the function:

private int codematch(uint input){
	auto i=0;
	foreach(n;node){
		if (n.code==input){return i;}
		i++;
	}
	return -1;
}

Thinking a associative array could probably do this, the next change was:

// Make our Associative Array
ubyte[int] map;
foreach(n;node){
	map[n.code]=n.value;
}

..
		
// See if the bit code exists, if so, get the value
if (code in map){
	output[size]=map[code];
	..	
}

I did not expect much difference really, thinking that the underlying searching would still be some kind of loop but this reduced the function time to around 12ms! Nearly 30 times faster! Clearly this function was the bottleneck!

Final Version

This is the current version of the decompression code with all of the above changes.

bool decompress(ubyte[] input,ref ubyte[] output){
	int l;
	uint code;
	auto size=0;
	
	// Assuming that output is at least as big
	output.length=input.length;
	
	// Saves doing lots of shifts
	ubyte[8] va=[0x80,0x40,0x20,0x10,0x08,0x04,0x02,0x01];
	
	// Make our Associative Array
	ubyte[int] map;
	foreach(n;node){
		map[n.code]=n.value;
	}
	
	foreach(ubyte ch;input){
		foreach(i;0..8){
			if ((ch & va[i])==va[i] ){
				code+=(0x80000000>>l);
			}
			
			l++;
			
			// See if the bit code exists, then get the value			
			if (code in map){
				output[size]=map[code];
				code=0;
				l=0;
				size++;
				
				// Reached the current limit of array so increase!
				if (size==output.length){
					output.length+=1024;
				}
			}
		}
	}
	output.length=size;
	return true;
}

Conclusions

To improve your D programs, a good starting point is using StopWatch and checking the Garbage Collector profile log. Not reinventing the wheel is normally a good call too, generally a built-in or standard library function is faster, better and/or easier to use.

In this example, although the speed-up was large, this program is not critical and waiting a 0.3 seconds instead of 0.01 seconds is not the end of the world. However, if this was a key program that ran often, then the improvements could be the difference between needing a single server or 30! For the benefit of your users and the planet, it is worth running some benchmarks on your code!

Created 11/02/24