{"id":374,"date":"2017-12-12T07:04:29","date_gmt":"2017-12-12T12:04:29","guid":{"rendered":"https:\/\/dwan.org\/?p=374"},"modified":"2019-10-25T15:05:33","modified_gmt":"2019-10-25T19:05:33","slug":"two-clouds-and-a-bicycle","status":"publish","type":"post","link":"https:\/\/dwan.org\/index.php\/2017\/12\/12\/two-clouds-and-a-bicycle\/","title":{"rendered":"Two clouds and a bicycle"},"content":{"rendered":"<p>Yesterday, I got to solve a puzzle that I\u2019ve solved more than a few times before.  <em>\u201cHow do we get a moderate amount of data from <em>here<\/em> to <em>there<\/em> in reasonably short order?\u201d<\/em><\/p>\n<p>The specifics in this sort of puzzle change all the time.  What constitutes a moderate amount of data? What sort of distance matters between \u201chere,\u201d and \u201cthere?\u201d Is it miles? vendors? representation?<\/p>\n<p>Other factors, like the timeline, remain constant. \u201cReasonably short order,\u201d usually means \u201cby tomorrow morning,\u201d give or take.  It\u2019s a short, but achievable timeline where smart decisions to reduce latencies in the real world (start immediately!) matter as much or more than decisions about transfer protocols or software tools that might give a more perfect use of resources.<\/p>\n<p>It\u2019s a case where the perfect is the enemy of the good, and where having seen puzzles like this before confers an advantage. As is true throughout life, having friends helps too.<\/p>\n<p>Here\u2019s the story:<\/p>\n<h3>500GB on a jump drive<\/h3>\n<p><a href=\"http:\/\/diamondagedatascience.com\/\">Diamond Age Data Science<\/a> is a consulting company in the business of computational biology and bioinformatics. The founder texted me in the early afternoon: They had agreed to do a a particular analysis, \u201cbefore the end of the year.\u201d They planned to use Amazon\u2019s cloud, and the source data was on a 1TB USB disk at <a href=\"https:\/\/labcentral.org\">LabCentral<\/a> in Cambridge, MA. The customer had tried a sensible thing \u2013 plug the disk into a laptop and use the GUI to drag and drop the files.<\/p>\n<p>The time estimate was FOUR DAYS from now.<\/p>\n<p>In real terms, this would have consumed one of the two weeks remaining before the end of year holidays. This would put the deadline at risk, and had the potential to interfere with a laptop-free vacation.<\/p>\n<p>In addition, the customer wanted to take their laptop home in the evenings even during the data transfer. That meant stopping and restarting, or perhaps lashing together an impromptu workstation.<\/p>\n<p>It was only about 500GB of data. 500 gigabytes is 4,000 gigabits. At one gigabit per second (a relatively standard wired connection in a modern office), this was not less than 66 minutes worth of transfer. Even accounting for inefficiencies on the order of a factor of two or three \u2013 this shouldn\u2019t have been an <em>overnight<\/em> problem \u2013 much less a week of delay.<\/p>\n<p>I happened to be about a 10 minute bicycle ride from the lab in question, and this is a game I like to play.<\/p>\n<p>To make it more fun, I decided to up the ante from \u201ctomorrow morning\u201d to \u201cbefore dinner.\u201d<\/p>\n<p><em>Could I move the data and still get home in time for dinner with a friend from out of town?<\/em> I figured that it was reasonably possible, but I couldn\u2019t afford to go in the wrong direction for even an hour.<\/p>\n<p>As an aside, overhead and protocol challenges are a <em>really<\/em> big deal in large scale data transfers in clinical medicine, finance, satellite, media\/entertainment, and so on. For a graduate education on this topic, read up on Michelle Munson\u2019s <a href=\"https:\/\/asperasoft.com\">Aspera<\/a>.  Start with the <a href=\"http:\/\/asperasoft.com\/technology\/transport\/fasp\/#tcp-464\">FASP<\/a> protocol.<\/p>\n<h3>Which way to the cloud<\/h3>\n<p>There are a <b>lot<\/b> of places in Boston and Cambridge with good connectivity. I figured that I could call in a favor with friends at <a href=\"https:\/\/www.rc.fas.harvard.edu\/\">Harvard\u2019s research computing<\/a>, at the <a href=\"https:\/\/broadinstitute.org\">Broad Institute<\/a>, or even at one of the local pharmaceuticals. Any of these could have worked out fine.<\/p>\n<p>However, all of those groups use fiber optic connections that terminate at the same building: The <a href=\"https:\/\/www.markleygroup.com\/\">Markley Data Center<\/a> at Downtown Crossing. It\u2019s perhaps the single best connected building in New England.  Also, the people who run the place are smart and like to play this sort of game.<\/p>\n<p>So I picked up the disk and biked across the river. It took about 20 minutes.<\/p>\n<p>My bicycle was about a 3Gb\/sec transfer device \u2013 with latencies on the order of 1,200,000 milliseconds.<\/p>\n<h3>Linux Servers are Free<\/h3>\n<p>There\u2019s an old saying \u2013 you never eliminate a bottleneck, you just move it around.  Since I wanted to be home in time for dinner, I wanted to think through, rather than discover those bottlenecks.  One of the big potential bottlenecks was the wireless card on my laptop. Most wireless networks are not full 1Gb\/sec. My Apple laptop, for all that I love it, does not come with a traditional RJ45 network plug.  I was going to need a wired connection.<\/p>\n<p>So I texted ahead and asked the folks at Markley if I could borrow a laptop or a workstation to plug the disk into.  This was also a hedge \u2013 assuming that we could get the data transfer rolling, I wouldn\u2019t be in the same situation as the folks across the river \u2013 waiting on a data transfer before unplugging my laptop.<\/p>\n<p>In the same conversation (stopped by the curb on the Longfellow bridge), we decided to create a virtual machine on <a href=\"http:\/\/www.markleygroup.com\/markley-cloud\">Markley\u2019s private cloud<\/a> for use as a staging machine. I\u2019ve been trapped in enough data centers over the years, babysitting some dumb process while my friends ate dinner without me. So I requested a bit of infrastructure. About 90 seconds later, I had an email in my INBOX including some (but not all) of the credentials needed to log into a dedicated internet connected server with a terabyte of attached disk.<\/p>\n<p>The big advantage of the VM was that it would be trivial for me to reach it from home. Also, I could install whatever software I needed on it without inconveniencing the Markley staff or forcing them to re-image the loaner laptop. The in-building network is extremely \u201cclean\u201d with almost zero packet loss and sub-millisecond latencies (better than a bicycle by about 7 orders of magnitude). I figured that it was very likely that we could get the data off of the disk to something in-building in a couple of hours without any drama. The wider internet is <em>not<\/em> necessarily so clean and consistent.<\/p>\n<p>I was, at this point, assuming that something would go wrong, and that I would be logging in after my guests went home to bump the process along.<\/p>\n<p>Besides, it\u2019s a 90 second task to configure a virtual machine on a private cloud.<\/p>\n<p>So by the time I got to Downtown Crossing, there was a workstation waiting for me, as well as a virtual machine that I could log into. The nice folks there had also found a Thunderbolt to RJ45 adapter, in case I didn\u2019t want to bother with the loaner laptop. On a whim, mostly curious how fast we could make this happen, I plugged the disk into my own laptop and started a simple \u201cscp\u201d of a test file.<\/p>\n<h3>Too slow!<\/h3>\n<p>I was getting 3 megabytes per second.<\/p>\n<p>3 megabytes per second is 24 megabits per second.  About 2.3% of the number that I had been using for my estimates.  That put me right back into the \u201cseveral days\u201d estimate from before.  My friend, sitting next to me, tried the same simple copy from the internal disk on his laptop to the server (connected via an identical connection) and got consistent performance of 80 megabytes per second.<\/p>\n<p>In the spirit of not overthinking it (and getting home in time for dinner!) we tested two things at once by swapping the disk over to his machine and starting the in-house copy to the VM. Why mess with it? It was moving into mid-afternoon now, and we wanted this first phase done before we were anywhere close to stressed.<\/p>\n<p>I should note that, at that point, it could -totally- have been the disk.  In that case, we would have been stuck. Fortunately, the connection screamed to life at 80MB\/sec.  He set to work starting a few batches of copies to run in parallel, which squeezed another 10MB\/sec or so out of the connection.  90 megabytes per second is 720 megabits per second, or 72% of the theoretical maximum of the connection.  Our theoretical 66 minute transfer was going to take 90 minutes to get to the VM.<\/p>\n<h3>Duh<\/h3>\n<p>It wasn\u2019t until half an hour later that I realized the problem with my laptop: I still had my VPN configured. After all that bicycling and thinking I was so clever \u2013 my laptop was doing exactly what I had set it up to do, and spraying packets all over the country, from Seattle to Georgia, so that whatever coffee shop, airline, or hotel I happened to be connecting from couldn\u2019t snoop on me.<\/p>\n<p>After a good laugh, we agreed that I have a really good VPN, all told. Also, we let the copy run, it was already about 1\/3 of the way done.<\/p>\n<h3>The Last Bit<\/h3>\n<p>The final step was almost laughably easy.  I wanted the data to wind up on Amazon\u2019s S3. My friend at Diamond Age had configured a user account for me, and had picked a name for the bucket.<\/p>\n<p>It turns out that enough people have solved this problem before that it was mostly copy-pasting examples from the Amazon Command Line FAQ. There was the tiniest bit of <a href=\"http:\/\/whatis.techtarget.com\/definition\/yak-shaving\">Yak Shaving<\/a> to get Python, PIP, and the Amazon CLI configured on the VM \u2013 but after that I ran a command very much like:<\/p>\n<p><code>aws sync \/data s3:\/\/my-awesome-bucket<\/code><\/p>\n<p>And was rewarded with transfer rates on the order of 66MB\/sec.<\/p>\n<p>I verified that I could get ~60 megabytes per second consistently from the private cloud to the public cloud, and also that we could do both transfers concurrently.  The link from the laptop to the VM was still contentedly chugging along at 90MB\/sec.<\/p>\n<p>I also verified that I could log into the VM from outside of the data center.  I would have felt pretty silly to have to come back the next day to do any necessary cleanup.<\/p>\n<p>My friend and I caught up a bit about our plans for the holidays. Then I biked home as the light faded.<\/p>\n<p>By the time I got home, the initial sync command was done, and all the data from the disk was present on the VM. I ran the sync again, with the options to do a detailed check for any differences that might have crept in by copying files in these two stages.<\/p>\n<p>It finished up while I was writing this post, after dinner.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Yesterday, I got to solve a puzzle that I\u2019ve solved more than a few times before. \u201cHow do we get a moderate amount of data from here to there in reasonably short order?\u201d The specifics in this sort of puzzle change all the time. What&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[33],"tags":[],"class_list":["post-374","post","type-post","status-publish","format-standard","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/posts\/374","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/comments?post=374"}],"version-history":[{"count":12,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/posts\/374\/revisions"}],"predecessor-version":[{"id":1144,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/posts\/374\/revisions\/1144"}],"wp:attachment":[{"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/media?parent=374"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/categories?post=374"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dwan.org\/index.php\/wp-json\/wp\/v2\/tags?post=374"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}