Author: cdwan

The network is slow: Part 1

Let me start off by agreeing that yes, the network is slow.

I’ve moved a fair amount of data over the years. Even when it’s only a terabyte or two, the network always seems uncomfortably slow. We never seem to get the performance we sketched out on the whiteboard, so the data transfer takes way longer than expected. The conversation quickly turns to the question of blame, and the blame falls on the network.

No disagreement there. Allow me to repeat: Yes, the network is slow.

This post is the first in a series where I will share a few simple tools and techniques to unpack and quantify that slowness and get things moving. Sometimes, hear me out, it’s not the entire network that’s slow – it’s that damn USB disk you have plugged into your laptop, connected over the guest wi-fi at Panera, and sprayed across half a continent by a helpful corporate VPN.

True story.

My point here is not to show you one crazy old trick that will let you move petabytes at line-rate. Rather, I’m hoping to inspire curiosity. Slow networks are made out of fast and slow pieces. If you can identify and remove the slowest link, that slow connection might spring to life.

This post is about old-school, low-level Unix/Linux admin stuff. There is nothing new or novel here. In fact, I’m sure that it’s been written a bunch of times before. I have tried to strike a balance to make an accessible read for the average command line user, while acknowledging a few of the more subtle complexities for the pros in the audience.

Spoiler alert: I’m not even going to get to the network in this post. This first one is entirely consumed with slinging data around inside my laptop.

Endless zeroes

When you get deep enough into the guts of Linux, everything winds up looking like a file. Wander into directories like /dev or /proc, and you will find files that have some truly weird and wonderful properties. The two special files I’m interested in today both live in the directory /dev. They are named “null” and “zero”.

/dev/null is the garbage disposal of Linux. It silently absorbs whatever is written to it, and never gives anything back. You can’t even read from it!

energon:~ cdwan$ echo "hello world" > /dev/null 
energon:~ cdwan$ more /dev/null
/dev/null is not a regular file (use -f to see it)

/dev/zero is the opposite. It emits an endless stream of binary zeroes. It screams endlessly, but only when you are listening.

If you want your computer to spin its wheels for a bit, you can connect the two files together like this:

energon:~ cdwan$ cat /dev/zero > /dev/null

This does a whole lot of nothing, creating and throwing away zeroes just as fast as one of the processors on my laptop can do it. Below, you can see that my “cat” process is taking up 99.7% of a CPU – which makes it the busiest thing on my system this morning.

Which, for me, raises the question: How fast am I throwing away data?

Writing nothing to nowhere

If my laptop, or any other Linux machine, is going to be involved in a data transfer, then the maximum rate at which I can pass data across the CPU matters a lot. My ‘cat’ process above looks pretty efficient from the outside, with that 99.7% CPU utilization, but I find myself curious to know exactly how fast that useless, repetitive data is flowing down the drain.

For this we need to introduce a very old tool indeed: ‘dd’.

When I was an undergraduate, I worked with a team in university IT responsible for data backups. We used dd, along with a few other low level tools, to write byte-level images of disks to tape. dd is a simple tool – it takes data from an input (specified with “if=”) and sends it to an output (specified with “of=”).

The command below reads data from /dev/zero and sends it to /dev/null, just like my “cat” example above. I’ve set it up to write a little over a million 1kb blocks, which works out to exactly a gigabyte of zeroes. On my laptop, that takes about 2 seconds, for a throughput of something like half a GB/sec.

energon:~ cdwan$ dd if=/dev/zero  of=/dev/null bs=1024 count=1048576
1073741824 bytes transferred in 2.135181 secs (502880950 bytes/sec)

The same command, run on the cloud server hosting this website, finishes in a little under one second.

[ec2-user@ip-172-30-1-114 ~]$ dd if=/dev/zero  of=/dev/null bs=1024 count=1048576
1073741824 bytes (1.1 GB) copied, 0.979381 s, 1.1 GB/s

Some of this difference can be attributed to CPU clock speed. My laptop runs at 1.8GHz, while the cloud server runs at 2.4GHz. There are also differences in the speed of the system memory. There may be interference from other tasks taking up time on each machine. Finally, the system architecture has layers of cache and acceleration tuned for various purposes.

My point here is not to optimize the velocity of wasted CPU cycles, but to inspire a bit of curiosity. While premature optimization is always a risk – I will happily take a couple of factors of two in performance by thinking through the problem ahead of time.

As an aside, you can find out tons of useful stuff about your Linux machine by poking around in the /proc directory. Look, but don’t touch.

[ec2-user@ip-172-30-1-114 ~]$ more /proc/cpuinfo | grep GHz
model name : Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz

Reading and writing files

So now we’ve got a way to measure the highest speed at which a single process on a single CPU might be able to fling data. The next step is to ask questions about actual files. Instead of throwing away all those zeroes, let’s catch them in a file instead:

energon:~ cdwan$ dd if=/dev/zero  of=one_gig  bs=1024 count=1048576
1073741824 bytes transferred in 7.431081 secs (144493358 bytes/sec)

energon:~ cdwan$ ls -lh one_gig
-rw-r--r--  1 cdwan  staff   1.0G Mar  5 08:57 one_gig

Notice that it took almost four times as long to write those zeroes to an actual file instead of hurling them into /dev/null.

The performance when reading the file lands right in the middle of the two measurements:

energon:~ cdwan$ dd if=one_gig of=/dev/null bs=1024 count=1048576
1073741824 bytes transferred in 4.222885 secs (254267367 bytes/sec)

At a gut level, this makes sense. It kinda-sorta ought-to take longer to write something down than to read it back. The caches involved in both reading and writing mean we may see different results if we re-run these commands over and over. Personally, I love interrogating the behavior of a system to see if I can predict and understand the way that performance changes based on my understanding of the architecture.

I know, you were hoping to just move data around at speed over this terribly slow network. Here I am prattling on about caches and CPUs and RAM and so on.

As I said above, my point here is not to provide answers but to provoke questions. Agreed that the network is slow – but perhaps there is some part of the network that is most to blame.

I keep talking about that USB disk. There’s a reason – those things are incredibly slow: Here are the numbers for reading that same 1GB file from a thumb drive:

energon:STORE N GO cdwan$ dd if=one_gig_on_usb of=/dev/null bs=1024 count=1048576
1073741824 bytes transferred in 75.596891 secs (14203518 bytes/sec)

That’s enough for one post. In the next installment, I will show a few examples of one of my all time favorite tools: iperf.

Biology is weird

Biology is weird. The data are weird, not least because models evolve rapidly. Today’s textbook headline is tomorrow’s “in some cases,” and next year’s “we used to think.”

It can be hard for non-biologists, particularly tech/math/algorithm/data science/machine learning/AI folks, to really internalize the level of weirdness and uncertainty encoded in biological data.

It is not, contrary to what you have read, anything like the software you’ve worked with in the past.  More on that later.

This post is a call for humility among my fellow math / computer science / programmer type people.  Relax, roll with it, listen first, come up to speed. Have a coffee with a biologist before yammering about how you’re the first smart person to arrive in their field. You’ll learn something. You’ll also save everybody a bit of time cleaning up your mess.


Don’t be the person who walks into a research group meeting carrying a half read copy of “Genome” by Matt Ridley, spouting off about how all you need is to get TensorFlow running on some cloud instances under Lambda and you’re gonna cure cancer.

This is not to speak ill of “Genome,” it’s a great book, and I’m super glad that lots of people have read it – but it no more qualifies you to do the heavy lifting of genomic biology than Lisa Randall’s popular press books prepare you for the mathematical work of quantum physics.

You’ll get more cred with a humble attitude and a well thumbed copy of “Life Ascending” by Nick Lane. For full points, keep Horace Judson’s “The Eighth Day of Creation” on the shelf.  Mine rests between Brooks’ “The Mythical Man Month” and “Personality” by Daniel Nettle.

The More Things Change

Back in 2001, the human genome project was wrapping up.  One of the big questions of the day was how many genes we would find in the completed genome.  First, set aside the important but fundamentally un-answerable question of what, exactly, constitutes a gene.  Taking a simplistic and uncontroversial definition, I recall a plurality of well informed people who put the expected total between 100,000 and 200,000.

The answer?  Maybe a third to a sixth of that.  The private sector effort, published in Science, reported an optimistically specific 26,588 genes.  The public effort, published in Nature, reported a satisfyingly broad 30,000 to 40,000. 

There was a collective “huh,” followed by the sound of hundreds of computational biologists making strong coffee. 

This happens all the time in Biology. We finally get enough data to know that we’ve been holding the old data upside down and backwards.

The fundamental dogma of information flow from DNA to RNA to Protein seems brittle and stodgy when confronted with retroviruses, and honestly a bit quaint in the days of CRISPR.  I’ve lost count of the number of lower-case modifiers we have to put on the supposedly inert “messenger molecule” RNA to indicate its various regulatory or even directly bio-active roles in the cell.

Biologists with a few years under their belt are used to taking every observation and dataset with a grain of salt, to constantly going back to basics, and to sighing and making still more coffee when some respected colleague points out that that thing … well … it’s different than we expected.

So no, you’re not going to “cure cancer” by being the first smart person to try applying math to Biology.  But you -do- have an opportunity to join a very long line of well meaning smart people who wasted a bunch of time finding subtle patterns in our misunderstandings rather than doing the work of biology, which is to interrogate the underlying systems themselves.


To this day, whenever I look at gene expression pathways I think: “If I saw this crap in a code review, I would send the whole team home for fear of losing my temper.”

My first exposure to bioinformatics was via a seminar series at the University of Michigan in the late 90’s. Up to that point, I had studied mostly computer science and artificial intelligence. I was used to working with human-designed systems. While these systems sometimes exhibited unexpected and downright odd behaviors, it was safe to assume that a plan had, at some point, existed. Some human or group of humans had put the pieces of the system together in a way that made sense to them.

To my eye, gene expression pathways look contrived. It’s all a bit Rube Goldberg down there, with complex and interlocking networks of promotion and inhibition between things with simple names derived from the names of famous professors (and their pets). 

My design sensibilities keep wanting to point out that there is no way that this mess is how we work, that this thing needs a solid refactor, and that … dammit … where’s the coffee?

It gets worse when you move from example to example and keep finding that these systems overlap and repeat in the most maddening way. It’s like the very worst sort of spaghetti code, where some crazy global variable serves as the index for a whole bunch of loops in semi-independent pieces of the system, all running in parallel, with an imperfect copy paste as the fundamental unit of editing.

This is what happens when we apply engineering principles to understanding a system that was never engineered in the first place.

Those of us who trained up on human designed systems apply those same subconscious biases that show us a face in the shadows of the moon. We’re frustrated when the underlying model is not based on noses and eyes but rather craters and ridges. We go deep on the latest algorithm or compute system – thinking that surely there’s reason and order and logic if only we dig deep enough.

Biologists roll with it. 

They also laugh, stay humble, and drink lots of coffee.

Fixing the Electronic Medical Mess

In my previous blog post, I talked about the fact that medical records are a dumpster fire from a scientific data perspective. Apparently this resonated for people.

This post begins to sketch some ideas for how we might start to correct the problem at its root.

Lots of people have thought deeply about this stuff. One specific example is the Apperta Foundation whose white paper makes a wonderful introduction to the topic.

@bentoth’s second point, above, is exactly correct. Until we put the patient at the center of the medical records system, we’re going to be digging in the trash.

The question is how we get from here to there.

Not Starting From Zero

Before digging in, I want to address a very valid objection to my complaint:

It’s true: Even given the current abysmal state of things, researchers are still making important discoveries. This indicates to me that it will be well worth our while to put some time and effort into improving things at the source. If we can get value out of what we’ve got now, imagine the benefits to cleaning it up!

Who Speaks for the Data?

One of the first steps towards better data, in any organization, is to identify the human beings whose job is, like the Lorax, “speak for the data.” Identifying, hiring, and radically empowering these folks is a recommendation that I make to many of my clients.

Just … please … don’t call them a “data janitor.”

If you tell me that you have “data janitors,” I know that you consider your data to be trash. Beyond that, I also know that you consider curation, normalization, and data integration to be low-respect work that happens after hours and is not part of the core business mission. It’s not a big jump from there to realize that the structures and incentives feeding the problem aren’t going to change. Instead, you’re just going to hire people to pick through the trash and stack up whatever recyclables they can find.

I’ve even heard people talk about hiring a “data monkey.” Really, seriously, just don’t do that, even in casual conversation. It’s not cool.

Who does the work?

It takes a huge amount of work to capture primary observations, and still more effort to connect them to the critical context in which they were created. Good metadata is what allows future users to slice, dice, and confidently use datasets that they did not personally create.

Then there is the sustained work of keeping data sets coherent. Institutional knowledge and assumptions change and practices drift over time. Even though files themselves may not be corrupted, data always seems to rot unless someone is tending it.

This work cannot simply be layered on as yet another task for the care team. Physicians and nurses are already overwhelmed and overworked. Adding another layer of responsibility and paperwork to their already insane schedules will not work.

We need to find a resource that already exists, that scales naturally with the problem, and who also has a strong self-interest in getting things right.

Fortunately, such a resource exists: We need to leverage patients and their families. We need to empower them to curate and annotate their own medical records, and we need to do it in a scalable and transparent way.

I’m willing to bet that if we start there, we’ll wind up with a population who are more than happy, for the most part, to share slices of their data because of the prospective benefits to people other than themselves.

The tools already exist

Health systems don’t encourage it, but patients can and do demand access to the data derived from their own bodies. People suffering from rare or undiagnosed diseases make heavy use of this fact. They self-organize, using services like Patients Like Me or Seqster to carry their own information back and forth between the data silos built by their various providers and caregivers. Similarly, physicians can work with services like the Matchmaker Exchange to find clues in the work of colleagues around the world.

Unfortunately, there is no easy way for this cleaned and organized version of the data to get back into the EMR from which it came. That’s the link to be closed here – people are already enthusiastically doing the work of cleaning this data. They are doing it on their own time and at their own expense because the self-interest is so clear.

The job of the Data Lorax is to find a way to close that loop and bring cleaned data back into the EMR. This is different from what we do today, so we’re going to need to adapt a lot of systems and processes, and even a law or a rule here or there.

Fortunately, it’s in everybody’s interest to make the change.

The Electronic Medical Mess

I posted a quick tweet this morning about the state of data in health care.

Over the years, I’ve worked with at least half a dozen projects where earnest, intelligent, diligent folks have tried to unlock the potential stored in mid to large scale batches of electronic medical records. In every case, without exception, we have wound up tearing our hair and rending our garments over the abysmal state of the data and the challenges in getting access to it at all. It is discordant, incomplete, and frequently just plain-old incorrect.

I claim that this is the result of structural incentives in the business of medicine.

What is a Medical Record?

Years ago the medical record was how physicians communicated amongst themselves. The “clinical narrative” was a series of notes written by a primary care physician, punctuated by requests for information and answers from specialists. Physicians operated with an assumption of privacy in these notes, since patients didn’t generally ask to see them. Of course they were still careful with what they wrote. If things went sideways, those notes might wind up being read aloud in front of a judge and jury.

In the 80’s, electronic medical records (EMRs) added a new dimension to this conversation. EMRs were built, in large part, to support accurate and timely information exchange between health care organizations and “payers” including both corporate and government insurance. EMRs digitized the old clinical narrative basically unchanged. They sometimes allowed in-house lab values to be transferred as data rather than text, though in many cases that sort of feature came much later. Most of the engineering effort went into building a framework for billing and payment.

The savvy reader will note that neither of these is a particularly good way to build a system for the collection of patient data.  Instead, we’re dealing with risk avoidance.

A Question of Risk and Cost

Being the Chief Information Officer (CIO) of a health care system or a hospital is a hard, stressful, and frequently thankless job. Information Technology (IT) is usually seen as a cost center and an expense rather than as a driver of revenue. A savvy CIO is always looking for ways to reduce costs and allow their institution to put more dollars directly into the health care mission. Successful hospital CIOs spend a lot of time thinking about risk. There are operational risks from attacks like ransomware, compliance risks, risks that the hospital will expose patient data inappropriately, financial risks from lost revenue, legal risks from failing to meet standards of care, and many more.

These pressures lead to a very sensible and consistent mindset among hospital CIOs: They have a healthy skepticism of the “new shiny,” an aversion to change, and a visceral awareness of their responsibility to consistent and compliant operations

So physicians are incentivized to avoid litigation, hospital information systems are incentivized to reduce exposure, and the core software we use for the whole mess is written primarily to support financial transactions.

Every single person I’ve ever met in the business and practice of health care, without exception, wants to improve patient lives. This is not a case where we need to find the bad, the malevolent, or the incompetent people and replace them. Instead, it’s one of those situations where good, smart hardworking people are stuck with a system that we all know needs a solid shake-up.

That means that when someone (like me) shows up and proposes that we change a bunch of hospital practices (including modifying that damn EMR software) so that we can gather better data, it falls a bit flat. If I reveal my grand plan to take the data and use it for some off-label purpose like improving the standard of care globally, I am usually politely but firmly shown the door.

But it gets worse.

Homemade Is Best

Back in the bad old days, it was possible to convince ourselves that observations made by physicians were the best and only data that should be used in the diagnosis of disease. That’s demonstrably untrue in the age of internet connected scales and wearable pulse and sleep monitors. I’ve written before about the reaction I receive when I show up to my doctor as a Patient With A Printout (PWP). Even here in 2019, there are not many primary care physicians who are willing to look at data from a direct to consumer genetics or wellness company.

The above isn’t strictly true. I know lots of physicians who have a very modern approach to data when we talk over coffee or dinner. However, at work, they have to do a job. The way they are allowed to do that job is defined by CIOs and hospital Risk Officers who grow nervous when we try to introduce outside data sources in the clinical context. What assertions do we have that these wearable devices meet any standards of clinical care? Who, they might ask, will be be legally responsible if a diagnosis is missed or an incorrect treatment applied?

So we’re left with a population health mindset that says “never order a test unless you know what you’re going to do with the result,” except that in this case it’s “don’t look at a test that was already done, you might wind up with an inconvenient incidental finding, and then we’ll have to talk to legal.”

Health systems incentivize risk avoidance above more accurate or timely data. They do this because they are smart, and because they want to stay in business.

So we collect information with a system tuned for billing, run by people whose focus is on risk avoidance. Is it any wonder that when we extract that data, what we find is a conflicting and occasionally self-contradictory mess?

There’s no incentive to have it any other way,

A Better Way

Here in 2019, most people who pay attention to such things believe that data driven health insights will lead to better clinical outcomes, better quality of life, lower overall costs for health care, and many other benefits.


One ray of hope comes from the online communities that spring up to connect people with rare and terrible diseases. These folks share information amongst friends, family, researchers, and physicians as they search desperately for any hope of a cure. Along the way, they create and curate incredibly valuable data resources. The difference between these patient centric repositories and any extraction that we might get from an EMR is simply night and day.

A former colleague was fond of saying, “a diagnosis of cancer really clarifies your thinking about the relative importance of data privacy.”

Put another way: If we put the patient at the center of the data equation, rather than payment, we’re really not that very far from a much better world – and all those wonderful technologies I mentioned will suddenly be quite useful.

Unfortunately, that’s a political question these days:

So where do we go from here? I’m not sure.

I do know for certain that -merely- flinging the messy pile of junk against the latest Machine Learning / Artificial Intelligence / Natural Language Processing software, without addressing the underlying data quality, is unlikely to yield durable and useful results.

Garbage in, garbage out – as the saying goes.

I would love to hear your thoughts.

Letting the genome out of the bottle

About eleven years ago, in January of 2008, the New England Journal of Medicine published a perspective piece on direct to consumer genetic tests, “Letting the Genome out of the Bottle, Will We Get Our Wish.” The article begins by describing an “overweight” patient who “does not exercise.” This man’s children have given him the gift of a direct to consumer genetics service at the bargain price of $1,000.

The obese person who (did we mention) can’t be troubled to go to the gym is interested in medical advice based on the fact that they have SNPs associated with both diabetes and cardiovascular disease. The message is implied in the first paragraph, and explicitly stated in the last:  “Until the genome can be put to useful work, the children of the man described above would have been better off spending their money on a gym membership or a personal trainer so that their father could follow a diet and exercise regimen that we know will decrease his risk of heart disease and diabetes.”

Get it?  Don’t bother us with data.  We knew the answer as soon as your heavy footfalls sounded in the hallway.  Hit the gym.

The authors give specific advice to their colleagues “for the patient who appears with a genome map and printouts of risk estimates in hand.”  They suggest dismissing them:  “A general statement about the poor sensitivity and positive predictive value of such results is appropriate … For the patient asking whether these services provide information that is useful for disease avoidance, the prudent answer is ‘Not now — ask again in a few years.'”

Nowhere do the authors mention any potential benefit to taking a glance at the sheaf of papers this man is clutching in his hands.

Just 10 years ago, a respected and influential medical journal told primary care physicians to discourage patients from seeking out information about their genetic predisposition to disease.  Should someone have the nerve to bring a “printout,” they advise their peers to employ fear, uncertainty, and doubt. They suggest using some low-level statistical jargon to baffle and deflect, before giving answers based on a population-normal assumption.

The reason I’m writing this post is because I went to the doctor last week and got that exact answer, almost verbatim.  I already went off about this on twitter.  I’m writing this because I think that it may benefit from a more nuanced take.

More on that at the end of the post.

Eight bits of history

For all its flaws, the article does serve as a fun and accessible reminder of how far we have come a decade.

I did 23andme when it first came out. I’ve downloaded my data from them a bunch of times.  Here are the files that I’ve downloaded over the years, along with the number of lines in each file:

cdwan$ wc -l genome_Christopher_Dwan_*
576119 genome_Christopher_Dwan_20080407151835.txt
596546 genome_Christopher_Dwan_20090120071842.txt
596550 genome_Christopher_Dwan_20090420074536.txt
1003788 genome_Christopher_Dwan_Full_20110316184634.txt
1001272 genome_Christopher_Dwan_Full_20120305201917.txt

The 2008 file contains about 576,000 data points.  That doubled to a bit over a million when they updated their SNP chip technology in 2011.

The authors were concerned that “even very small error rates per SNP, magnified across the genome, can result in hundreds of misclassified variants for any individual patient.”  When I noticed that my results from the 2009 download were different from those in 2008, I wrote a horrible PERL script to figure out the extent of the changes. I still had sitting around on my laptop, so ran it again today. I was somewhat shocked that it worked on the first try, a decade and at least two laptops later!  

My 23andme results were pretty consistent. Of the SNPs that were reported  in both v1 and v2, my measurements differ at a total of 54 loci. That’s an error rate of about one hundredth of one percent. Not bad at all, though certainly not zero.

For comparison, consider the height and weight that usually gets taken when you visit  a doctor’s office. In my case, they do these measurements with shoes and clothing on – meaning that I’m an inch taller (winter boots) and about 8 pounds heavier (sweater and coat) if I see my doctor in the winter. Those are variations of between 1% and 5%.

Fortunately, nobody ever looks at adult height or weight as measured at the doctor’s office. They put us on the scale so that the practice can charge our insurance providers for a physical exam, and then the doctor eyeballs us for weight and any concealed printouts.

A data deluge

Back to genomics: $1,000 will buy a truly remarkable amount of data in late 2018.  I just ordered a service from Dante Labs that offers 30x “read depth” on my entire genome.  They commit to measure each of my 3 billion letters of DNA at least 30 times.  Taken together, that’s 90 billion data points, or 180,000 times more measurements than that SNP chip from a decade ago.  Of course, there’s a strong case to be made that those 30 reads of the same location are experimental replicates, so it’s really only 3 billion data points or 6,000 times more data. Depending on how you choose to count, that’s either 12 or 17 doublings over a ten year span.   

Either way, we’re in a world where data production doubles faster than once per year.

This is a rough and ready illustration of the source of the fuss about genomic data.  Computing technology, both CPU and storage, seems to double in capacity per dollar every 18 months. Any industry that exceeds that tempo for a decade or so is going to experience growing pains.

To make the math simple, I omitted the fact that this year’s offering -also- gives me an additional 100x of read depth within the protein coding “exome” regions, as well as some even deeper reading of my mitochondrial DNA.

One real world impact of this is that I’m not going to carry around those raw reads on my laptop anymore. The raw files will take up a little more than 100 gigabytes, which would be. 20% of my laptop hard disk (or around 150 CD ROMs). 

I plan to use the cloud, and perhaps something more elegant than a 10 year old single threaded PERL script, to chew on my new data.

The more things change

Back to the point:  I’m writing this post because, here in late 2018, I got the -exact- treatment that the 2008 article recommends. It’s worse than that, because I didn’t even bring in anything as fuzzy as genotypes or risk variants.  Instead, I brought lab results, ordered through Arivale, and generated by a Labcorp facility to HIPAA standards.

I’ve written about Arivale before.  They do a lab workup every six months. That, coupled with data from my wearable and other connected devices provides the basis for ongoing coaching and advice.

My first blood draw from Arivale showed high levels of mercury. I adjusted my diet to eat a bit lower on the food chain. When we measured again six months later, my mercury levels had dropped by 50%. However, other measurements related to inflammation had doubled over the same time period.  Everything was still in the “normal” range – but a fluctuation of a factor of two struck me as worth investigating.

I use one of those fancy medical services where, for an -additional- out-of-pocket annual cost, I can use a web or mobile app to schedule appointments, renew prescriptions, and even exchange secure messages with my care team. Therefore, I didn’t have to do anything as undignified as bringing a sheaf of printouts to his upscale office on a high floor of a downtown Boston office building.  Instead, I downloaded a PDF from Arivale and sent them as a message with my appointment request.

When we met, my physician had printed out the PDFs.  Perhaps this is part of that “digital transformation” I’ve heard so much about. The 2008 article is studiously silent on the topic of doctors bearing printouts. I’m guessing it’s okay if they do it.

Anyway, I had the same question as the obese, exercise-averse patient who drew such scorn in the 2008 article:  Is there any medical direction to be had from this data?

My physician’s answer was to tell me that these direct to consumer services are “really dangerous.”  He gave me the standard line about how all medical procedures, even minimally invasive ones, have associated risks. We should always justify gathering data in terms of those risks, at a population level. He cautioned me that going down the road of even looking at elevated inflammation markers can lead to uncomfortable, unnecessary, and ultimately dangerous procedures.

Thankfully, he didn’t call me fat or tell me to go get a gym membership.

This, in a nutshell is our reactive system of imprecision medicine.

This is also an example of our incredibly risk averse business of medicine, where sensible companies will segment and even destroy data to avoid the danger of accidentally discovering facts that they might be obligated to report or to act on.

This, besides the obvious profit motive, is why consumer electronics and retail outfits like Apple and Amazon are “muscling into healthcare.”

The void does desperately need to be filled, but I think it’s pretty terrible that the companies best poised to exploit the gap are the ones most ruthlessly focused on the bottom line, most extractive in their runaway capitalism, and who have histories of terrible practices around both labor and of privacy.

A happy ending, perhaps

I really do believe that there is an opportunity here: A chance to radically reshape the practice of medicine. I’m a genomics fanboy and a true believer in the power of data.

To be clear, the cure is not any magical app. The transformation will not be driven simply by encoding our data as XML, JSON, or some other format entirely. No specific variant of machine learning or artificial intelligence is going to un-stick this situation.

It’s not even blockchain.

The answer lies in a balanced approach, with physicians being willing to use data driven technologies to amplify their own senses, to focus their attention, to rapidly update their recommendations and practices, and to measure and adjust follow ups and follow throughs.

To bring it back to our obese patient above, consider the recent work on polygenic risk scores, particularly as they relate to cardiovascular health. A savvy and up-to-date physician would be well advised to look at the genetics of their patients – particularly those of us who don’t present as a perfect caricature of traditional risk-factors for heart disease.

I’ve written in the past about another physician who sized me up by eyeball and tried to reject my request for colorectal cancer screening, despite a family history, genetic predisposition, and other indications.  “You look good,” he said, “are you a runner?”

There is a saying that I keep hearing:  “Artificial Intelligence will not replace physicians.  However, physicians who use Artificial Intelligence will replace the ones who do not.”

The same is true for using all the data available. In my opinion, it is well past time to make that change.

I would love to hear what you folks think.

Thank you!

About 20 months ago, I left a fantastic job at the Broad Institute to strike out on my own as an independent consultant. At the time, I was nervous. I was pretty sure that I could manage the nuts and bolts of running a small business. I’ve got experience using spreadsheets to track potential customers and to remind me to follow up on invoices. I’ve managed projects, reviewed contracts, and picked up enough negotiation and other critical soft skills to get by.

The big question in my mind was this: Would people would still take me seriously when I wrote from a shared home office or a co-working space in Somerville rather than from a private office on the 11th floor of one of the biggest names in Kendall Square. That was a leap of faith for me. I honestly didn’t know.

Nearly two years in, I’m thrilled to report that it’s working out great.

All of this is because of the amazing professional community of friends, colleagues, vendors, customers, and collaborators that I’ve met and worked with over the years. You folks reading this post made this possible.

You, specifically. Thank you. I’m not going to list all your names, but I recently had a chance to make a picture out of some of your logos:

As Eric Lander frequently says when he speaks in public: “Wow!”

In case you’re wondering, I will probably have a “real job” (paycheck, office, boss) again sometime in the future. Here’s why:

I miss sharing in the mission. One of the hallmarks of a good consultant is that we leave once the need for specialized and time critical services has passed. That leaving is bittersweet. If I do my job right, I get to see client after client outgrow their need for me.

I also miss mentoring, building teams, and working on not just technical efficiency but also on culture, inclusion, fairness, access, and the quality of life. I can give little nudges to these things from the outside, but really making a difference requires time and focus that a consulting engagement usually doesn’t afford.

For all that, I’ve got no plans to rejoin the 9-5 crowd any time soon.

When I left the Broad, I made a deliberate decision to move away from my comfort zone. I didn’t just quit a job, I also moved away from what I already knew and towards what I know to be important in the future. That meant that I set aside perfectly good opportunities to tune up high performance computing systems, and instead spent a summer researching and writing a white paper about Blockchain. I demurred on cloud migrations and dug in to enhance my admittedly basic knowledge of effective, practical information security. I got facile with the language of governance and compliance, and started in on covered entities, HIPAA, and all that jazz.

My goal in all of this was to swim rapidly out of the research shallows, all the way out to the gnarly rapids where data, computing and information intersect with clinical care.

Forget 20 months, I’ve spent nearly 20 years working with genomic data. I want to see what’s holding us back from the long promised genomic medicine revolution, I want to find the very toughest problems, and I want to help solve them.

And really, the core of my gratitude is that I feel like I’m getting a chance to do that.

Thanks to all of you.


This is a personal story about workplace bias.

In my first management gig, between 2004 and 2013, I built an all male team.

“Keep the company mostly male” was never a goal. In fact, if anyone had said that kind of crap out loud – the whole team would have reacted with disgust. I’m pretty sure that we didn’t break any laws or even “best-practices.” We had the required nondiscrimination policies and we took our annual sexual harassment training seriously.

For all that, the numbers are unambiguous: The people we hired and retained for the technical team were almost all men. Since I had a major role in our recruiting, hiring, and workplace practices, I’ve got to own that.

Bias is harder to isolate and pin down than crimes like assault and harassment. Bias feels vague, which provides wiggle room for those of us who hold the power to make change, rather than excuses.

I’m writing this now because I wish that someone had pointed it out to me at the time.

The dangers of monoculture

I’m proud of my nine years at that company. By every metric that we used, I think that we did a really good job. We bootstrapped from four people to fourteen without taking any external investment. We made payroll every single month, launched three products, and did all sorts of cool stuff.

Still, I know that we could have done better – and I’m not just talking about the moral perspective here.

When recruiting, we tended to reach only within our existing network. We hired people that we already knew or had heard of rather than casting a wider net. That’s part of why we recapitulated the biases of our industry.

This also created an intellectual monoculture in which we were all pretty sure of our own superiority. For all that the team was broad-minded and incredibly creative – we were also stuck with a tiny slice of the intellectual landscape.

Year over year this siloing held us back. It made it all too easy to believe that we were the very best. From where I sit now, that comes across as immature. It’s the arrogance of a regional sports champion that has never gone to a national competition.

From the outside, I can see how provincial we were.

That attitude (plus always showing up with an all male team) certainly cost us customers over the years. The NY Times has a decent article on how all male sales teams are less competitive.

On the performance side, you don’t have to take my word for it. Read Why Diverse Teams Are Smarter from the Harvard Business Review. Look at the Mckinsey report on how more diverse leadership makes companies more profitable.

My very favorite study in this space says it right in the abstract:

Groups with out-group newcomers (i.e., diverse groups) reported less confidence in their performance and perceived their interactions as less effective, yet they performed better than groups with in-group newcomers (i.e., homogeneous groups). Moreover, performance gains were not due to newcomers bringing new ideas to the group discussion. Instead, the results demonstrate that the mere presence of socially distinct newcomers and the social concerns their presence stimulates among oldtimers motivates behavior that can convert affective pains into cognitive gains.

I know a man who is an influential leader at a well known organization. He scans the author lists of scientific papers before reading the abstracts. If he doesn’t recognize the names, he doesn’t bother to go further. He once told me why: “It saves time. If they were any good, I would already have heard of them.”


Nature vs Nurture

The few women who did choose to join the company tended to leave after a much shorter tenure than the men. That difference speaks to workplace culture. I have to own that too, since I was responsible for many of the team’s day to day practices.

With what I know now, I can see that I built a place where it was pretty easy for people like me to succeed. My guess is that the more different a person was from me in their work and life patterns, the harder they would find it to succeed in my organization.

That inattention sabotaged the few people who made it past the filters described above.

As above, there’s no malice required. I just wasn’t paying attention.

Stated baldly, it’s a pretty weak and inexperienced manager who can only manage people just like himself.

It matters

This isn’t all in the past.

Just this year, in 2018, I took flak from several long-term friends because my “political correctness” made it harder for us to organize a joint marketing opportunity.

It’s all too easy to make excuses. The reality is that the state of inclusion in our industry is an embarrassing and broken thing.

It is our job to fix it.

Stat News has published several articles lately that shine a spotlight on some of the most egregious behavior. One stark example is their coverage of the Party At Bio Not At Bio. In case you haven’t heard, sponsors paid to have their logos painted on nearly nude women who danced for the crowd’s amusement.

That’s not a “Mad Men” episode, it’s the state of biotech in 2018.

The sponsors, organizers, and attendees aren’t bad people – but they weren’t paying attention. Making sure that the environment was safe and inclusive didn’t make the list of priorities.

It takes work to overcome these systemic biases. Fortunately, there are resources available. This twitter list of 300 women in tech you should follow is up to 522 members. makes it laughable to use the excuse that there are no qualified women available for speaking engagements.

Broccoli in your teeth

It’s still hard to talk about this stuff. I hesitated a long time before writing this post, and still longer before hitting “publish.”

That hesitation is because it is awkward, uncomfortable and weird every single time that I take a customer or a business contact aside and privately point out that they’ve got an all-male team.

People get defensive, evasive, and occasionally even insulting and sanctimonious. They come back with “what about,” and even bring up examples from my own past where I didn’t live up to the ideals that I’m now pushing on them.

It’s important to power through that discomfort: A former colleague ran her group on the “must inform” principle. Whether it was broccoli in the teeth, a wardrobe malfunction, or something more significant – it was the team’s obligation to help each other.

The benefit was clear: Her team never showed up with junk in their teeth, or with easily correctible biases showing.

I wrote this because I wish that my mentors back in the day had said something to me.


I love staying in touch with you, my far-flung network of friends and family. I use this platform to do that. I love seeing the snapshots from your lives, your pets, your children. I love hearing about your spouses and houses and hobbies.I usually fight the urge to engage online about politics.

Sometimes I slip up. When I do engage, it costs me massive amounts of precious time and energy. There is only so much “me” any given day – I want and need to pour myself into my work, my family and friends, or my community.

As I turn the corner into my 40’s, I see that I can still do a lot with my life, and I also see that it is finite. I want to be intentional about how I spend my time.Arguing online, especially in public forums, is exhausting and generally pointless. Even the very best forums are either echo chambers where we all already agree (I’m looking at you, activist atheist community) or else nasty shouting matches where we all play for points from the audience rather than listening and considering and maybe changing our minds.

I’ve gotten to know my local city representative, JT, over the last year or so. He hosts office hours at his home on Friday mornings. We’ve probably sat together for 8 or 10 hours at this point, sometimes with other people joining in, but mostly just the two of us. We mostly agree. Sometimes we don’t. He’s a skilled conversationalist and listener, and has taught me a lot about how he sees the world. I leave those conversations energized and with -more- of “me” to give back.

I’ve had a couple of good conversations over messenger lately – in particular with Robert and Jason. We’ve got very different perspectives – and I value the time that they put into helping me to understand how they see the world, and why.

Please understand what while I care deeply about the world’s continuing slide into madness (which has been going on my entire life, and will continue long after I’m gone) – I’m not going to spend a lot of time talking about it on facebook.

Thanks, all. I hope you have a good day.

Manufacturing improvements apply to HPC

The Strategy Board

My former colleagues at the Broad Institute recently published a marvelous case study. They describe, in a delightfully brisk and jargon-free way, some of the process improvements they used to radically increase the productivity of the genome sequencing pipeline.

This post is about bringing the benefits of their thinking to our high performance computing (HPC) systems.

The fundamental change was to modify the pipeline of work so that instead of each stage “pushing” to the next, stations would “pull” work when they were ready to receive it. This should be familiar to folks who have experience with Kanban. It also overlaps with both Lean and Agile management techniques. My favorite part of the paper is that they applied similar techniques to knowledge work – with similar gains.

The spare text of the manuscript really doesn’t do justice to what we called the “strategy board meeting.” By the time I started attending in 2014 it was a massive thing, with fifty to a hundred people gathering every Wednesday morning. It was standing room only in front of a huge floor-to-ceiling whiteboard covered with brightly colored tape, dry erase writing, and post-it notes. Many of the post-it notes had smaller stickers stuck on them!

Somehow, in an hour or less every week, we would manage to touch on every part of the operation – from blockers in the production pipeline through to experimental R&D.

My favorite part was that it was a living experiment. Some weeks we would arrive to find that the leadership team had completely re-jiggered some part of the board – or the entire thing. They would explain what they were trying to do and how they hoped we would use it, and then we would all give it a try together.

I really can’t explain better than the paper itself. It’s 100% worth the read.

The computational analysis pipeline

When I started attending those strategy board meetings in 2014, I was responsible for research computing. This included, among other things, the HPC systems that we used to turn the raw “reads” of DNA sequence into finished data products. This was prior to Broad’s shift to Google’s Cloud Platform, so all of this happened on a large but finite number of computers at a data center in downtown Boston.

At that time, “pull” had not really made its way into the computational side of the house. Once the sequencers finished writing their output files to disk, a series of automatic processes would submit jobs onto the compute cluster. It was a classic “push,” with the potential for a nearly infinite queue of Work In Progress. Classical thinking is that healthy queue is a good thing in HPC. It gives the scheduler lots of jobs to choose from, which means that you can keep utilization high.

Unfortunately, it can backfire.

One of the little approximations that we make with HPC schedulers is to give extra priority to jobs that have been waiting a long time to run. On this system, we gave one point of priority (a totally arbitrary number) for every hour that a job had been waiting. On lightly loaded systems, this smooths out small weirdnesses and prevents jobs from “starving.”

In this case, it blew up pretty badly.

At the time, there were three major steps in the genome analysis pipeline: Base calling, alignment, and variant calling.

In the summer of 2015, we accumulated enough jobs in the middle stage of the pipeline (alignment) that some jobs were waiting a really long time to run. This meant that they amassed massive amounts of extra priority in the scheduler. This extra priority was enough to put them in front of all of the jobs from the final stage of the pipeline.

We had enough work in the middle of the pipeline, that the final stage ran occasionally, if at all.

Unfortunately, it didn’t all tip over and catch fire at once. The pipeline was in a condition from which it was not going to recover without significant intervention, but it would still emit a sample from time to time.

As the paper describes, we were able to expedite particular critical samples – but that only made things worse. Not only did it increase the wait for the long-suffering jobs in the middle of the pipeline, but it also distracted the team with urgent but ultimately disruptive and non-strategic work.


One critical realization was that in order for things to work, the HPC team needed to understand the genomic production pipeline. From a system administration perspective, we had high utilization on the system, jobs were finishing, and so on. It was all too easy to push off complaints about slow turnaround time on samples as just more unreasonable demands from an insatiable community of power-users.

Once we all got in front of the same board and saw ourselves as part of the same large production pipeline, things started to click.

A bitter pill

Once we knew what was going on, it was clear that we had to drain that backlog before things were going to get better. It was a hard decision because it meant that we had to make an explicit choice to deliberately slow input from the sequencers. We also had to choose to rate-limit output from variant calling.

Once we adopted a practice of titrating work into the system only at sustainable levels, we were able to begin to call our shots. We measured performance, made predictions, hit those predictions, fixed problems that had been previously invisible, and added compute and storage resources as appropriate. It took months to finish digging out of that backlog, and I think that we all learned a lot along the way.

All of this also gave real energy to Broad’s push to use Google’s cloud for compute and data storage. That has been transformational on a number of fronts, since it turns a hardware constraint into a money constraint. Once we were cloud-based we could choose to buy our way out of a backlog, which is vastly more palatable than telling world-leading scientists to wait months for their data.

Seriously, if your organization is made out of human beings, read their paper. It’s worth your time, even if you’re in HPC.

Surfing the hype curve

I’ve spent most of my career on the uncomfortable edge of technology. This meant that I was often the one who got to deal with gear that was being pushed into production just a little bit too early, just a little bit too fast, and just a little bit too aggressively for everything to go smoothly.

This has left me more than a little bit jaded on marketing hype.

Not too long ago I posted a snarky rejoinder on a LinkedIn thread. I said that I had a startup using something called “on-chain AI,” and that we were going to “disrupt the nutraceutical industry.”

I got direct messages from serious sounding people asking if there was still time to get in early on funding me.

Not long after that, a local tech group put out a call for lightning talk abstracts. I went out on a limb and submitted this:

Quantum AI Machine Learning Blockchains on the IoT Cloud Data Ocean: Turning Hype Into Reality

It's easy to get distracted and confused by the hype that surrounds new computing and data storage technologies. This talk will offer working definitions and brutally practical assessments of the maturity of all of the buzzwords in the title.

Somewhat to my horror, they accepted it.

Here are the slides. I would love to hear your thoughts.