Gavin's Site

(Github repo)

This is a project that I've been working on a little bit here and there for the last year or so. It's based off the classic interview problem that goes something like this:

Given an integer n, generate all sequences of n pairs of well-formed parentheses.

I took that to its logical extreme. What if n is greater than 10 or so? How fast can I make this?

Turns out I can make it Pretty Fast.

Starting out

When I was initially writing my solution, I didn't look at any existing solution so my mind would not be Tainted before I made an attempt. However, it's worthwhile to look at a basic solution to get a baseline.

The highest-rated submission on Leetcode.com is pretty neat. It even claims it's "faster than 96%". Surely it must be pretty fast, right? I'll just modify it a little bit to print to stdout instead of returning since we'll be dealing with sizes too large to fit in memory. I'll also tweak the ordering to match later iterations.

def generateParenthesis(n: int):
    def dfs(left, right, s):
        if len(s) == n * 2:
            print(s[::-1])
            return 

        if right < left:
            dfs(left, right + 1, s + '(')

        if left < n:
            dfs(left + 1, right, s + ')')

    dfs(0, 0, '')

generateParenthesis(20)

Measuring using pv, I get about 40 MiB/s, which is about 978 ns per line.

Of course, I didn't start with python, I started with C. I didn't write this step when I was initially writing my solution, but let's port that python implementation.

#include <stdio.h>

#define SIZE 20

struct buf {
	char b[SIZE * 2 + 1];
};

static void
recurse(int left, int right, int len, struct buf s)
{
	if (len == SIZE * 2) {
		puts(s.b);
		return;
	}

	if (right < left) {
		s.b[SIZE * 2 - len - 1] = '(';
		recurse(left, right + 1, len + 1, s);
	}

	if (left < SIZE) {
		s.b[SIZE * 2 - len - 1] = ')';
		recurse(left + 1, right, len + 1, s);
	}
}

int
main(void)
{
	struct buf s = {0};
	recurse(0, 0, 0, s);
}

Building with

gcc -O3 -Wall -Wextra ./naive.c -o naive

We get around 675 MiB/s (58 ns per line). A considerable improvement, and actually much better than my original implementations. Maybe the "96%" claim had something to it.

But I can still do better.

Making it fast

Representing a parentheses string as a byte string is cool and all, but there are really only two states: open parenthesis and close parenthesis. Even better, they are adjacent to each other in ascii: ( is 0x28 and ) is 0x29. We can make our whole buffer a single integer.

What about computing the next string? Recursion is Slow, so instead let's find an iterative algorithm.

Starting off, lets look at two consecutive sets of parenthesis, represented as bit strings.

1010111000 = ()()((()))
1011001010 = ()(())()()

Hmm.

This looks suspiciously like we're just doing arithmetic on a Weird Base or something along those lines. We add 1 shifted over to the first set bit, and then reset one less than the carried number of bits to their original positions.

Let's break that down a bit

First, we add one to the lowest set bit

  1010111000
+       1000
= 1011000000

Next, we reset one less than the number of number of carried bits to their original positions. In this case, there were three carried bits, so we'll reset two.

  1010111000
      ^^^ carried in previous step

  1011000000
+       1010
= 1011001010

And voilĂ , we have the next set of parentheses.

Digression: Missed optimizations

For a long time, my "next paren" routine looked something like this:

inline static uint64_t
next_paren_bitmask(uint64_t curr)
{
	// first set bit
	const uint64_t first = _tzcnt_u64(curr);

	// number of contig bits grouped with first
	const uint64_t contig = _tzcnt_u64(~(curr >> first));

	// original bit positions
	const uint64_t orig = 0xAAAAAAAAAAAAAAAA; // 0b1010...

	// reset bits
	const uint64_t rst = _pdep_u64((1 << (contig - 1)) - 1, orig);

	// carried bits
	const uint64_t add = curr + (1 << first);

	return add | rst;
}

This worked, and it carried me to around 7 GiB/s, but there's a pretty major optimization that gcc missed.

Take a look at this function and what happens under gcc and clang (Compiler Explorer).

uint64_t 
sum(uint64_t val) 
{
    return 1ull << _tzcnt_u64(val);
}

gcc generates this assembly:

sum:
        tzcnt   rdi, rdi
        mov     eax, 1
        shlx    rax, rax, rdi
        ret

This is pretty much exactly what you would expect, and it's hard to complain about. That is, unless you saw Clang's codegen:

sum:
        blsi    rax, rdi
        ret

Woah.

That's so much better. It turns out that blsi is an instruction that does exactly what we were trying to do - it isolates the lowest set bit in an integer. According to the Intel Intrinsics Guide, it's equivalent to (-a) & a. That's pretty cool. What makes it even better is that blsi is way faster than tzcnt; it's both lower latency and higher throughput. I hypothesize that this is because it can leverage existing (and fast) arithmetic circuit designs rather than doing a full count of zeroes.

I even looked at gcc's optimizer, and it seems like it's completely incapable of inserting a blsi instruction.

I fixed that issue and experimented with some other ways of restoring the carried bits, and here's what I've ended up with:

[[gnu::always_inline]]
static inline uint64_t
next_paren_bitmask(uint64_t curr)
{
	// first set bit
	const uint64_t first = _tzcnt_u64(curr);

	// carry forward
	const uint64_t add = curr + _blsi_u64(curr);

	// number of contig bits grouped with first
	// const uint64_t contig = _tzcnt_u64(add) - first;

	// original bit positions
	const uint64_t orig = 0xAAAAAAAAAAAAAAAA; // 0b1010...

	// the bits that are moved back to their original position.
	// I've found quite a bit of speedup through these iterations
	// const uint64_t rst = _pdep_u64((1 << (contig - 1)) - 1, orig);
	// const uint64_t rst = orig & ((1 << (contig * 2 - 1)) - 1);
	// const uint64_t rst = orig & ((1 << ((contig << 1) - 1)) - 1);
	// const uint64_t rst = _bextr_u64(orig, 0, contig * 2 - 2);
	// const uint64_t rst = _bzhi_u64(orig, contig * 2 - 1);
	// const uint64_t rst = _pdep_u64((_blsi_u64(add) >> (first + 1)) - 1, orig);
	// const uint64_t rst = _bzhi_u64(orig << 1, contig * 2) >> 1;
	// const uint64_t rst = _pdep_u64(_blsmsk_u64(add >> 2) >> first, orig);
	// const uint64_t rst = _pdep_u64(_blsmsk_u64(add) >> first, orig) >> 4;
	const uint64_t rst = _pdep_u64(_blsmsk_u64(add) >> (first + 2), orig);

	const uint64_t ret = add | rst;

	return ret;
}

This is the code that helped bring my program up to 11.3 GiB/s, or 3.38 ns per line.

IO

Another question is how to actually output this all of these parentheses. write(2) is too slow.

This was a pretty easy question to answer for me. My main source of inspiration was htfizzbuzz, which uses vmsplice(2) to output to a Linux pipe. It works by letting the process reading from the pipe to read directly out of the memory of the writing process. It's buggy as heck, and it's definitely a security risk if used in the real world due to how easy it is to corrupt the buffer, but it's wicked fast, and that's what I care about.

When it comes to actually writing to the output, I use AVX2 instructions to broadcast the bits to entire bytes, and add it to a buffer of precomputed parentheses and line feeds. That gets spliced to stdout and I get to enjoy big number.

Conclusion

This was a very fun project, and it was very satisfying pushing the throughput number up. I've done all my testing with 20 pairs, and at that size, I'm bottlenecked by my ability to calculate the next set. If there's still improvements to be made, I think they will be found there.

Regardless of future improvements, I'm pretty happy with my ~300x speedup over the original python version.