I spent a few hours today debugging this — GPU programming is its own special form of hell. It turns out I had to disable ECC and reset the GPU:
nvidia-smi -i 1 -e 0
nvidia-smi -i 1 -r
Naturally, there was no indication of an ECC error having occurred, and just resetting the GPU’s doesn’t help matters.
Sigh.