This is a project that I've been working on a little bit here and there for the last year or so. It's based off the classic interview problem that goes something like this:
Given an integer
n
, generate all sequences ofn
pairs of well-formed parentheses.
I took that to its logical extreme. What if n
is greater than 10 or so? How fast can I make this?
Turns out I can make it Pretty Fast.
Starting out
When I was initially writing my solution, I didn't look at any existing solution so my mind would not be Tainted before I made an attempt. However, it's worthwhile to look at a basic solution to get a baseline.
The highest-rated submission on Leetcode.com is pretty neat. It even claims it's "faster than 96%". Surely it must be pretty fast, right? I'll just modify it a little bit to print to stdout instead of returning since we'll be dealing with sizes too large to fit in memory. I'll also tweak the ordering to match later iterations.
def generateParenthesis(n: int):
def dfs(left, right, s):
if len(s) == n * 2:
print(s[::-1])
return
if right < left:
dfs(left, right + 1, s + '(')
if left < n:
dfs(left + 1, right, s + ')')
dfs(0, 0, '')
generateParenthesis(20)
Measuring using pv
, I get about 40 MiB/s, which is about 978 ns per line.
Of course, I didn't start with python, I started with C. I didn't write this step when I was initially writing my solution, but let's port that python implementation.
#include <stdio.h>
#define SIZE 20
struct buf {
char b[SIZE * 2 + 1];
};
static void
recurse(int left, int right, int len, struct buf s)
{
if (len == SIZE * 2) {
puts(s.b);
return;
}
if (right < left) {
s.b[SIZE * 2 - len - 1] = '(';
recurse(left, right + 1, len + 1, s);
}
if (left < SIZE) {
s.b[SIZE * 2 - len - 1] = ')';
recurse(left + 1, right, len + 1, s);
}
}
int
main(void)
{
struct buf s = {0};
recurse(0, 0, 0, s);
}
Building with
gcc -O3 -Wall -Wextra ./naive.c -o naive
We get around 675 MiB/s (58 ns per line). A considerable improvement, and actually much better than my original implementations. Maybe the "96%" claim had something to it.
But I can still do better.
Making it fast
Representing a parentheses string as a byte string is cool and all, but there are really only two states: open parenthesis and close parenthesis. Even better, they are adjacent to each other in ascii: (
is 0x28 and )
is 0x29. We can make our whole buffer a single integer.
What about computing the next string? Recursion is Slow, so instead let's find an iterative algorithm.
Starting off, lets look at two consecutive sets of parenthesis, represented as bit strings.
1010111000 = ()()((()))
1011001010 = ()(())()()
Hmm.
This looks suspiciously like we're just doing arithmetic on a Weird Base or something along those lines. We add 1 shifted over to the first set bit, and then reset one less than the carried number of bits to their original positions.
Let's break that down a bit
First, we add one to the lowest set bit
1010111000
+ 1000
= 1011000000
Next, we reset one less than the number of number of carried bits to their original positions. In this case, there were three carried bits, so we'll reset two.
1010111000
^^^ carried in previous step
1011000000
+ 1010
= 1011001010
And voilĂ , we have the next set of parentheses.
Digression: Missed optimizations
For a long time, my "next paren" routine looked something like this:
inline static uint64_t
next_paren_bitmask(uint64_t curr)
{
// first set bit
const uint64_t first = _tzcnt_u64(curr);
// number of contig bits grouped with first
const uint64_t contig = _tzcnt_u64(~(curr >> first));
// original bit positions
const uint64_t orig = 0xAAAAAAAAAAAAAAAA; // 0b1010...
// reset bits
const uint64_t rst = _pdep_u64((1 << (contig - 1)) - 1, orig);
// carried bits
const uint64_t add = curr + (1 << first);
return add | rst;
}
This worked, and it carried me to around 7 GiB/s, but there's a pretty major optimization that gcc missed.
Take a look at this function and what happens under gcc and clang (Compiler Explorer).
uint64_t
sum(uint64_t val)
{
return 1ull << _tzcnt_u64(val);
}
gcc generates this assembly:
sum:
tzcnt rdi, rdi
mov eax, 1
shlx rax, rax, rdi
ret
This is pretty much exactly what you would expect, and it's hard to complain about. That is, unless you saw Clang's codegen:
sum:
blsi rax, rdi
ret
Woah.
That's so much better. It turns out that blsi
is an instruction that does exactly what we were trying to do - it isolates the lowest set bit in an integer. According to the Intel Intrinsics Guide, it's equivalent to (-a) & a
. That's pretty cool. What makes it even better is that blsi
is way faster than tzcnt
; it's both lower latency and higher throughput. I hypothesize that this is because it can leverage existing (and fast) arithmetic circuit designs rather than doing a full count of zeroes.
I even looked at gcc's optimizer, and it seems like it's completely incapable of inserting a blsi
instruction.
I fixed that issue and experimented with some other ways of restoring the carried bits, and here's what I've ended up with:
[[gnu::always_inline]]
static inline uint64_t
next_paren_bitmask(uint64_t curr)
{
// first set bit
const uint64_t first = _tzcnt_u64(curr);
// carry forward
const uint64_t add = curr + _blsi_u64(curr);
// number of contig bits grouped with first
// const uint64_t contig = _tzcnt_u64(add) - first;
// original bit positions
const uint64_t orig = 0xAAAAAAAAAAAAAAAA; // 0b1010...
// the bits that are moved back to their original position.
// I've found quite a bit of speedup through these iterations
// const uint64_t rst = _pdep_u64((1 << (contig - 1)) - 1, orig);
// const uint64_t rst = orig & ((1 << (contig * 2 - 1)) - 1);
// const uint64_t rst = orig & ((1 << ((contig << 1) - 1)) - 1);
// const uint64_t rst = _bextr_u64(orig, 0, contig * 2 - 2);
// const uint64_t rst = _bzhi_u64(orig, contig * 2 - 1);
// const uint64_t rst = _pdep_u64((_blsi_u64(add) >> (first + 1)) - 1, orig);
// const uint64_t rst = _bzhi_u64(orig << 1, contig * 2) >> 1;
// const uint64_t rst = _pdep_u64(_blsmsk_u64(add >> 2) >> first, orig);
// const uint64_t rst = _pdep_u64(_blsmsk_u64(add) >> first, orig) >> 4;
const uint64_t rst = _pdep_u64(_blsmsk_u64(add) >> (first + 2), orig);
const uint64_t ret = add | rst;
return ret;
}
This is the code that helped bring my program up to 11.3 GiB/s, or 3.38 ns per line.
IO
Another question is how to actually output this all of these parentheses. write(2)
is too slow.
This was a pretty easy question to answer for me. My main source of inspiration was htfizzbuzz
, which uses vmsplice(2)
to output to a Linux pipe. It works by letting the process reading from the pipe to read directly out of the memory of the writing process. It's buggy as heck, and it's definitely a security risk if used in the real world due to how easy it is to corrupt the buffer, but it's wicked fast, and that's what I care about.
When it comes to actually writing to the output, I use AVX2 instructions to broadcast the bits to entire bytes, and add it to a buffer of precomputed parentheses and line feeds. That gets spliced to stdout and I get to enjoy big number.
Conclusion
This was a very fun project, and it was very satisfying pushing the throughput number up. I've done all my testing with 20 pairs, and at that size, I'm bottlenecked by my ability to calculate the next set. If there's still improvements to be made, I think they will be found there.
Regardless of future improvements, I'm pretty happy with my ~300x speedup over the original python version.