{"id":2109,"date":"2009-04-02T21:58:59","date_gmt":"2009-04-03T01:58:59","guid":{"rendered":"https:\/\/dwan.org\/?p=2109"},"modified":"2020-11-27T17:59:36","modified_gmt":"2020-11-27T22:59:36","slug":"we-never-failed-to-fail","status":"publish","type":"post","link":"https:\/\/dwan.org\/index.php\/2009\/04\/02\/we-never-failed-to-fail\/","title":{"rendered":"We never failed to fail"},"content":{"rendered":"\n<p>I spent my day waiting for a system crash.<\/p>\n\n\n\n<p>We&#8217;ve got this serious, hardcore system lockup that happens <strong>sometimes<\/strong> with the customer of the moment. Naturally, it only happens when the system is flat-out running their most important computation. Naturally, it can reasonably said that <code>if this code doesn't work, then the system is useless to us<\/code>.<\/p>\n\n\n\n<p>Basically, if I get all the systems rocking on a parallel task, then <em>sometimes<\/em> after a few hours, one of them will crash hard enough that it doesn&#8217;t respond to <code>ping<\/code>. Naturally this takes the whole parallel job with it. Ctrl-alt-delete (through the KVM) doesn&#8217;t cut it. Needs a finger-on-the-button hard reboot. Of course, I have remote control over the power outlets for these systems &#8230; do I don&#8217;t have to fly to Maryland every time this happens &#8230; but still.<\/p>\n\n\n\n<p>So, I started down the list: Power, cooling, bad memory, flaky filesystems, &#8230;<\/p>\n\n\n\n<p>At the same time I was working down another list: Too much memory use? Too many tasks, colliding on some secret lock-file? Oversubscribing the NFS server<\/p>\n\n\n\n<p>And yet a third: Bad input? Crappy data files?<\/p>\n\n\n\n<p>Finally, I would get the system rocking and go on to other tasks &#8230; until after a while I would exclaim:<\/p>\n\n\n\n<p>MOTHER-FUCKER!<\/p>\n\n\n\n<p>Then I would try again.<\/p>\n\n\n\n<p>Thought I got it. I really did. Left. Dropped the laptop at the hotel. Went to yoga. Good, relaxing, and the knee even handled it well.<\/p>\n\n\n\n<p>When I got back to the hotel, I looked at the computer and said:<\/p>\n\n\n\n<p>MOTHER-FUCKER!<\/p>\n\n\n\n<p>Debug debug debug &#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I spent my day waiting for a system crash. We&#8217;ve got this serious, hardcore system lockup that happens sometimes with the customer of the moment. Naturally, it only happens when the system is flat-out running their most important computation. Naturally, it can reasonably said that&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[26,33],"tags":[],"class_list":["post-2109","post","type-post","status-publish","format-standard","hentry","category-consulting","category-technology"],"_links":{"self":[{"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/posts\/2109","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/comments?post=2109"}],"version-history":[{"count":1,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/posts\/2109\/revisions"}],"predecessor-version":[{"id":2110,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/posts\/2109\/revisions\/2110"}],"wp:attachment":[{"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/media?parent=2109"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/categories?post=2109"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/tags?post=2109"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}