We never failed to fail

I spent my day waiting for a system crash.

We’ve got this serious, hardcore system lockup that happens sometimes with the customer of the moment. Naturally, it only happens when the system is flat-out running their most important computation. Naturally, it can reasonably said that if this code doesn't work, then the system is useless to us.

Basically, if I get all the systems rocking on a parallel task, then sometimes after a few hours, one of them will crash hard enough that it doesn’t respond to ping. Naturally this takes the whole parallel job with it. Ctrl-alt-delete (through the KVM) doesn’t cut it. Needs a finger-on-the-button hard reboot. Of course, I have remote control over the power outlets for these systems … do I don’t have to fly to Maryland every time this happens … but still.

So, I started down the list: Power, cooling, bad memory, flaky filesystems, …

At the same time I was working down another list: Too much memory use? Too many tasks, colliding on some secret lock-file? Oversubscribing the NFS server

And yet a third: Bad input? Crappy data files?

Finally, I would get the system rocking and go on to other tasks … until after a while I would exclaim:

MOTHER-FUCKER!

Then I would try again.

Thought I got it. I really did. Left. Dropped the laptop at the hotel. Went to yoga. Good, relaxing, and the knee even handled it well.

When I got back to the hotel, I looked at the computer and said:

MOTHER-FUCKER!

Debug debug debug …



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.