Author: cdwan

The Freelancer’s Toolbox

A few years ago, I wrote a series of posts about the mechanics of consulting, developing business, and setting rates. People have told me that they found those useful, so here’s one about tools.

There is a big difference between “doing a little consulting” vs. going independent for the foreseeable future. You don’t need most (or any) of this infrastructure if you’ve just gotten clearance from your boss to do a one-off gig or if you’ve got a couple of months between roles. You also don’t need an LLC, a tax ID, a bank account or a corporate credit card. Just track hours, income, and expenses on a spreadsheet, set aside about half of what you make for taxes, declare it all as consulting income, and move on with life.

A person could work for a long time without most of these tools. Many do. After all, why invest $9/mo in a time tracking app when you could just keep notes on a scrap of paper? That argument holds, and honestly it holds pretty well, for accounting and productivity software – again – as long as you’re just doing a couple of gigs per year. A well organized set of spreadsheets, a disciplined approach, and a couple of hours every month or so can take a person far.

It’s that “couple of hours” that eventually turns the corner, at least for me. Tooling up saves time, and the old saying about time is absolutely true for consultants.

With that as context: Here are the tools that I pay for rather than re-inventing them. They save me time, both in the data-entry part and also in the correcting-errors and the showing-my-work parts. The list is long because I’m a big believer in using tools that are built to do the specific thing I’m trying to accomplish rather than using terrible bolted-on functionality, hating it, and then working around using Excel (looking at you, time-tracking in Quickbooks).

JWZ‘s law of software envelopment applies: “Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.” To this great statement I would humbly add a corrolary: “We don’t have to use that garbage.”

The List

Time Tracking: If you’re going to charge for your time, you need to track it. Even on fixed-fee engagements, you should be able to quickly and accurately answer the question of how many hours you have worked for a particular client. Pro tip for clients: Asking “how much of our 32 hour not-to-exceed will we have consumed as of this afternoon?” is one of the fastest ways to separate the n00bs from the pros.

I use Toggl ($9/mo). It has a straightforward web interface, a browser plugin, and a mobile app. People also like Harvest, which has a robust free tier that does basically the same thing.

Password Management:

You should already be using a password manager.

Get a password manager. Seriously. Please go set it up. I’ll wait. Good password practices are the digital equivalent of hand-washing. Passwords are the dead minimum and SMS messaging has security holes you could drive a truck through. For that reason, please also set up two-factor with app-based authentication and biometrics, and burn those post-it notes you’ve got stuck under your keyboard. Information hygiene, and its appearance, is doubly important for freelancers. We have access to many different systems and absolutely must not rely on memory or [SHUDDER] re-use to keep our client’s systems secure.

I use LastPass ($3/mo !!!), though 1Password is also great. I do not recommend relying on browser plugins, apple’s keychain, or google’s password manager for client passwords – mostly because they are not terribly portable if you wind up using client hardware. Also, if you’re like me, you’ve probably got a lot of other stuff on Apple and Google that doesn’t need to mix with your client credentials.

File Sharing: You’re going to need to share files with your clients and you need to be able to do it with at least some minimal consideration for information security. I use DropBox Professional ($19/mo). I bump up from the standard package for the “enhanced auditing capabilities” and versioned rollbacks. Box is also fine, though I find the interface a bit less intuitive. It’s possible to get by with Google Drive or Microsoft’s OneDrive, though I find Google makes it w-a-a-a-y too easy to overshare, and using Microsoft occasionally drops me into a nested hell of re-authentication that always seems to lead, somehow, to my XBox Live account.

Many clients will add you to their shared folder setup, which makes all of this un-necessary. Roll with this and assign a nice strong new password as part of getting it set up.

Microsoft: Go ahead and get yourself a subscription to Microsoft 365 ($9/mo). Just do it. I know that Google is free. Do it anyway. There is a center of mass around Microsoft’s productivity tools and it will save you time and effort to have your own copy.

Adobe: You will need to edit and digitally sign PDFs as part of the contract process. Adobe Acrobat Pro is ($20/mo). Having your own copy will (again) save you time.

Zoom: It is absolutely worth $12.50/mo to never have to cut a good conversation short at 39 minutes.

Sure, you can get by on Chime, Teams, Meet, WebEx or whatever [shudder] … but nothing says “I am totally legit and will show up to do the work I am pitching you” like a 41 minute Zoom meeting.

DocuSign: Having my own Docusign account ($10/mo gets me five “envelopes” per month) turns out to be super useful for pushing decisions forward when I run into a client who is unsure how, exactly, we go about formalizing agreements these days.

Accounting: Make no mistake, Intuit’s QuickBooks is awful, but it’s the best of a bad lot. Hateful as it is – the damn thing is going to live forever. My advice is to use it rather than falling in love with some easier to use, cheaper, more modern alternative that will eventually die like all the rest (I’m totally bitter about the times this has happened in the past).

The team at Intuit are world leaders in the ongoing competition among UI/UX folks to see how little of the screen area an app can dedicate to the thing that the user is actually trying to do. This makes the already high and ever rising price galling. I use their “Plus” offering ($90/mo !!!) because I occasionally subcontract out to other freelancers, I give my accountant a login at tax time, and I give my clients the option to pay by credit card or ACH without me having to deal with their digits.

Oh, but I do hate Intuit’s products. Everybody does. They are living up to what I said on LinkedIn a year ago: The correct rate is “the client is complaining about it but paying anyway.” Well played, Intuit, well played. We complain, but we pay.

Payroll: You don’t have to have a payroll company, but if you do, Gusto ($40/mo + $6/mo per person) is the best.

The reason you might consider having a payroll company (rather than just taking money out of the company whenever you need it) is a little thing called the “S-corp election.” This is an 100% legal and ordinary move that lots of people do every year. Without changing anything about your LLC, you (your accountant, really) can choose to be taxed a bit more like a company and a bit less like a person.

The benefit of the S-corp election is based on the difference between individual and corporate tax rates. The federal corporate tax is a flat 21%. Massachusetts (where I live and do most of my work) adds a flat 8%. Many freelancers find – after adding medicare, social security, and the various other taxes to their base income tax – that their gross taxes are higher than federal plus the state corporate rates.

With the S-corp election you declare and pay yourself the lowest salary that you are able to claim with a straight face. Your payroll company pays you that, taking out all those individual taxes and issuing a W-2. You pay the (presumably lower) corporate tax rate on any remaining profits.

All all the software you use to make that happen – from time tracking through to accounting to payroll – are business expenses – which means that they get subtracted off prior to taxes.

I promise that this is not a scam.

Conclusion

All-in, that’s about $250/mo in software. It’s certainly not zero, but it’s not going to move the needle if you really commit to freelancing. You take it all of your profits (AKA income) as business expenses, and you’ll save a -lot- of time.

There are other recurring expenses too, though they depend on your exact practice.

If you do in-person meetings, you should have access to a co-working space so you don’t have to meet clients in coffee shops. Many freelancers choose to pay for a postal address that’s not their home address, both for image and for privacy reasons. I use a minimally fancy American Express card ($500/yr) because, among other things, it lets me create custom statements per client for expenses.

I will leave all of that, and more, for a later post.

As always, I’m very interested to hear your thoughts.

Plenty To Do

One of the things that we’ve got going for us as technologists is that the underlying reality of biology changes pretty slowly. The human genome has been the same size for at least the last 10,000 years, and maybe as long as 300,000 years depending on who you ask. This means that 30 measurements of each of our 3.2 billion base pairs at (give or take) 10 bits per base yields the same ~100 gigabytes (GB) of raw files as it did back at the beginning of high throughput sequencing.

Back in the early 2000s, 100GB was a lot of data – more than the storage capacity of most workstations. Size limits on 32-bit filesystems could quietly truncate files, and a full gigabit connection to the internet was an expensive proposition. Merely storing and accessing the files correctly, much less getting any science done, was a solid day’s work for the journeyman technologist. These days cell phones and USB sticks hold terabytes and nobody bats an eye at using the cloud – at least not for technical reasons.

All of which gives little comfort because scientific ambition expands to fill the space created by technology.

[ME, CHEERFUL] Hey, yo! I’ve built that system you said you needed. The one to host and analyze a hundred thousand human genomes? I’m gonna take the win and cut out for a long weekend.
[BIOLOGIST, ENRAPTURED] What if we sequenced every single cell in a sample? There are thousands of cells per sample! We could use your whole system on just a couple of samples!
[ME, PALM ON FACE] That sounds great. That’s perfect.

Back in the day we measured gene expression using the ratio of intensities from competing varieties of fluorescent molecules crammed into literal divots on a plastic “microarray.” At the end of the day we just needed to store a couple of floating point numbers for each of, perhaps, a million locations of genomic interest. Even complex time-series experiments would generate a few megabytes of raw data. The metadata describing how the experiment had been conducted would often occupy more space on the disk (and more time for the analyst) than the data itself.

Then some (literal) genius was like “what if we used the DNA sequencer for that? We could -count- transcripts directly instead of relying on ratios” and the person with the single cell system chimed in with “oh have I got an idea for you.”

I was chatting with a group recently whose present-generation instrument generates 5 Terabytes (TB) over the course of a week or so. It’s cool science, capable of imaging both transcripts and proteins with sub-cellular resolution. It’s quite likely to reveal some fundamental stuff that will require us to once again update -all- the textbooks and retrain -all- the large language models.

They’re keeping up pretty well for the time being, but the next revision of their platform will increase data volumes fivefold while also reducing runtimes to a little over a day. It’s those geometric accelerations that we technologists need to watch out for. The thing where genomics accelerated faster than Moore’s law for several years was the result of innovations that made it simultaneously 1,000 times faster and also 1,000 times cheaper to sequence DNA.

[BIOLOGIST] What if there was stuff outside the germ line? Some sort of META-genome? Could we sequence that too?
[LAB] 100%
[CLINICAL] I bet we could measure circulating tumor and/or fetal DNA if we just kept on sequencing deeper and deeper.
[LAB] On it.
[GENE EDITING PEOPLE] Everybody, everybody, check this out!
[ME] I had a data model. It was a nice data model.

Not that I’m complaining. This is good, interesting, meaningful work. I’m glad to have lucked into an industry with enough technology challenges to keep me and everybody I know busy for our entire careers – provided that the money people can figure out how to get the industry working again.

I’m confident that future generations of technologists will have ample opportunity to place palm to forehead, breathe deeply, and say “oh wow that’s cool. Yes, of course I can support that, I just don’t know how yet.”

Authorship Indeterminate

I’ve noticed a shift in the AI conversation lately. Folks seem to be converging on the idea that “AI” means systems that create artifacts (text, images, video, sound, code) so similar to those made by humans that it is hard – for the untrained eye at least – to tell the difference. The underlying technology and algorithms certainly overlap with machine learning, data science, and statistics, but it’s the “human like” part that makes it AI these days.

It’s more “authorship indeterminate” than “artificially intelligent.”

It’s fun, but not terribly useful, to debate whether these systems might be or become self-aware in anything like the way that we believe ourselves to be. Humans map consciousness and intentionality onto all kinds of stuff. When my cat meows and pesters me, my instinct is to ask what she wants – as though she was a tiny cat-person. I don’t spend too much time worrying about whether there is really a tiny cat-person inside her fuzzy head.

For what it’s worth, I do totally believe that there is a tiny cat-person in there. I also believe that there is a full-size person in your head – for mostly the same reasons.

We’ve all heard the trope about how consciousness arose because our ancestors outcompeted our not-ancestors by being better at identifying predators in the shadows. Survival of the fittest and all that. I prefer the emerging story that the most successful communities among our ancestors were the ones who best supported, carried, and cut each other slack. As Dr Lou Cozolino says, those who are nurtured best, survive best. Most people understand that it’s better, in a purely self-interested sense, to live in a community of mutual support, and that the way you get that sort of society is by supporting the people around you.

“Do unto others,” rests on a belief in others. AI puts that assumption in question.

For me, the biggest risk of AI is that indeterminacy of authorship – the knowledge that there is no human behind most of the words, sounds, or images around us – will prove corrosive to society itself. We have all become jaded to the fact that the person on the other end of the unexpected phone call, letter, or email is not actually looking out for our best interests. Increasingly, we’re cynical that there was ever a person on the other end at all.

Did a bot write this? LOL!

For all that I’m hugely optimistic about the promise of technology to make life better – we also always need to guard against the downsides. We are entering a period where generative AI systems will make it much more difficult to tell the difference between authentic communication and fully-automated manipulation. I’m an optimist in the long run, but I think that this is going to get worse before it gets better.

It is critically important that we, collectively, figure out how to maintain our shared reality and mutual self-interest in the face of systems that cause any reasonable person to doubt. Without undue hyperbole, I think that this could well be the most important work of our age.

I’m interested to hear your thoughts.

A Tale of Three Conferences

I attended three industry conferences (Bio-IT World, Rev4, and BIO) over the last four weeks. This post shares my big-picture takeaway from each conference, as well as a bit about how they stitch together.

I think that I may be the only person who attended all three of these shows. If I’m wrong about that – please reach out! I’m curious to connect and share notes. Even if you only attended two, one, or even none at all – I would still appreciate hearing your thoughts on the themes of biology, technology, AI, and the biotech industry writ large.

My core conclusion is this: Tools change but purpose endures. More on that at the end of the post.

Bio-IT World

The Bio-IT World conference in Boston has been something of a touchstone in my professional life for more than two decades. My recent post, The More Things Change was informed, among other things, by a trip down memory lane where I scrolled through presentations and agendas from prior years.

It’s not entirely hyperbole to say that half the industry is busy refining large language models (LLMs) to accelerate the creation of manuscripts and regulatory filings, while the other half is building knowledge graphs, domain specific ontologies, and using natural language parsers (NLP) to cope with an already unmanageable tide of manuscripts and filings. Half of us are building machines to automate the smashing of data down into text and figures while the other half are trying to figure out how to automate the un-smashing process.

What this means is that we are still, as an industry, organized along pre-digital lines. We still assume legacy roles and processes that require humans to write notes to each other across certain organizational lines. Some larger and more forward looking organizations (including several pharmaceutical companies) are overcoming this with internal efforts to de-silo and go digital – but the handoffs between organizations, especially with regulators and the public, still assume human authors and human readers.

Rev 4

I had no idea what I was in for at Rev4. I figured that attending an AI conference in Manhattan really couldn’t be wrong – especially when the keynote roster included Neil DeGrasse Tyson, Steven Pinker, Steven Levy, Karim Lakhani, and Cassie Kozyrkov.

My big takeaway was that nobody has all the answers, there is a lot of “hustle and hope” going on, and there is going to be no shortage of interesting, important work for those of us who can keep our eyes on the prize and focus on KPIs and ROIs rather than on some particular tool or other.

Kozyrkov’s frankly brilliant and quotable keynote made a distinction between “thinking” and something she termed “thunking,” likening it to “the sound made by a brick, dropped from shoulder height, or perhaps a lovely evening of manual data entry.” She urged the audience to focus on thinking and leave the thunking to the machines.

She also drew a delightful parallel between “prompt engineering” for LLMs and any sort of management communication. Code is more precise but less expressive than language, so we should expect that it will take a couple of iterations to sufficiently specify the context and intent – no matter whether the audience is an LLM or a human engineer (or a human engineer using an LLM). “Don’t give instructions like a jerk,” she said.

Many of the conference themes were tautologies. How do AI leaders lead? Why they lead with AI of course. Who is going to “win” with AI? Well it’s the people who are “all in” on AI.

As the meme says: “Why tho?

The hype reminds me of cloud discourse circa 2013. Back then, no matter your enthusiasm and excitement, it was challenging to articulate potential drawbacks or challenges of a “cloud transformation.” Technologists risked being painted as a luddite server-hugging dinosaurs for bringing up uncomfortable facts like “renting is always more expensive than owning,” and “okay, we won’t have the data center engineers anymore, but now we need cloud engineers.” It took about ten years for the industry to reach a sensible equilibrium on cloud. I expect a similar sort of timeline on AI.

BIO

I had never been to BIO before. It was by far the biggest of the three shows – with perhaps 15,000 people descending on Boston and taking over the entire seaport / convention district for several days. My takeaway is that it takes a village to get a drug over the finish line. There is a vast world of details, skillsets, teams, companies, and people who neither work directly on the science nor on the technology side of things. Without them all of our clever code and compounds would remain academic.

I also got a peek at how serious business development people approach a conference, and it is not at all the casual “wander the floor and see what’s up” random walk that I have historically taken. I tagged along to various networking events with friends in legal services, real estate, development, resales, and marketing. In doing so, I got a glimpse of the structure, discipline, tenacity and skill that it takes to succeed at what they do – which is helping us to succeed at we do.

Anybody who claims that they’ve got it all figured out, that it’s just this one simple trick, and that we won’t need most of the people in this industry is going to be sorely mistaken – just like all the other hopeful hustlers who preceded them.

Purpose

I said at the top that “tools change but purpose endures:” This showed up most clearly at Rev4, but it was certainly present as a theme at Bio-IT and at BIO. Organizations who get distracted by the new shiny, who lose sight of the ‘why’ in favor of the ‘how,’ are going to encounter AI as friction. We are replacing a coal engine with an electric engine – but getting where we want to go remains the point of it all.

This connects back to the data smashing / re-hydration challenge I saw at Bio-IT. We continue to act as though manuscripts and filings are important in their own right. The real goal, for me at least, is to remove technology as a barrier to safer, more effective therapies and longer, healthier lives.

As I’ve said before, I’m still optimistic, and still all in.

Sequencing depth, how much is enough?

This is the fifth in a series of high-level posts reviewing foundational concepts and technologies in genomics. The first four were: “How Many Genes Does a Person Have,” “How do Genomes Vary, Person to Person,” and “Sanger Sequencing,” and “Sequencing by Synthesis.


The “depth” of a genome sequence refers to the number of times that we observe each location on the chromosome. A “30x whole genome,” means that we have on at least, or on average, or for at least one location (check the fine print) gotten 30 observations on some or all of our ~3.2 billion chromosomal locations. This has emerged as a sort of industry standard. This post explores how that might have happened and whether it’s still the right answer.

There are two main pressures that drive us to sequence deeper:

  1. It drives down error rates
  2. At great depths we can reliably detect trace quantities of tumor (liquid biopsy), fetal (non-invasive prenatal tests), or other non germ-line DNA.

The cost of instrument time and reagents creates an obvious pressure to minimize, or at least optimize, sequencing depth. The costs of data storage and analysis act more subtly in the same direction.

Compute requirements for primary bioinformatic analysis (alignment) scale linearly with depth, and memory requirements can grow even faster than that in comparative (secondary and sometimes tertiary) analyses. I have seen more than a few mutational analysis pipelines crumble when teams casually swapped in deeper sequences with way more reads. The files are bigger, you should expect any particular process to take longer. Comparative processes go geometric on RAM.

How much disk space does a 30x genome take up?

Since I mentioned data volumes, here is how to derive approximate file sizes for your sequencing data. I’ve gotten paid to answer this question so many times, it’s time to give it away for free.

Each base pair consists of a call – a letter in the four letter alphabet (G, A, T, C) and an integer for the quality score. Our starting point for estimating data volumes is to represent the call with two bits (four possibilities) and the quality score as an 8-bit integer. That gives us 10 bits per base: 1.25 bytes per base.

Yes yes, we could do better than 8 bits for that integer – especially since the only thing that really matters at the end of the day is whether it’s above or below a particular threshold. More on that later, and figuring out why the industry has not harvested that particular low hanging fruit is left as an exercise for the reader.

Anyway. There are, give or take, 3.2×109 (3.2 billion) chromosomal locations in the human genome, so each 1x whole genome ought to take around 4×109 bytes. Thirty of them (a 30x WGS) ought to take up around 1.2×1011 bytes – approximately 120 gigabytes. I say approximately because our megabyte / gigabyte notation is a little weird.

Conventional metric prefixes for large numbers are defined around powers of 10:

  • 103 kilo
  • 106 mega
  • 109 giga
  • 1012 tera
  • 1016 peta

Data storage was historically defined in terms of powers of two (for reasons). This approximates but does not equal the powers of ten.

  • 210 or 1,024 bytes in a kilobyte
  • 220 or 1,0242 or 1,048,576 bytes in a megabyte
  • … and so on

All the standards bodies (IEEE, ISO, the EU, and so on) agreed in the 90s and early 2000s that this was deeply confusing, so they decreed that data volumes shall be defined just like every other metric prefix – in terms of powers of ten. The power of ten based units get the old dignified suffixes (KB, MB, GB, and TB) while the powers-of-two way get a silly sounding series (Kibibyte, Mebibyte, Gibibyte, and Tebibyte … with silly suffixes KiB, MiB, GiB, TiB).

This declaration has, of course, changed no hearts and no minds, so now those of us who are aware of the situation just specify both.

So to be really persnickety, 1.2×1011 bytes is 111.75 GiB, which is equal to the 120GB I quoted above.

As a side note, you should never ever ever use “GB” by itself in bioinformatics or genomics and expect people to understand what you meant. It’s easily confused the “giga base pairs,” which is helpfully abbreviated “GB” by the sequencing folks. Always spell it out, whether its gigabytes or gigabases, and speak not of tebibases. This is forbidden.

Real talk about redundant storage

Nobody actually has 120GB sitting around on disk for each of their 30x whole genome sequences. Most teams have somewhere between two and six times that amount, with some but not all of it compressed.

Two to sixfold redundancy. Why? Because we are, collectively, idiots.

When data comes off an Illumina sequencer, it lands in a “run folder” in a precursor format (BCL, for “base call”) where data is organized cycle by cycle from the sequencing reaction. There is one and only one thing that anybody ever does with a run folder, which is to put it through a tool called “bcl2fastq” which converts BCL to FASTQ (get the name?).

Despite the fact that there is no value added between BCL and FASTQ, many organizations retain their run folders forever. If you keep one copy, that’s one whole extra copy of the data. If you also back up that one extra copy to a redundant site / geography (which is not usually a bad idea) … that’s two extra copies.

Seriously, delete your BCLs, they’re an instrument artifact. Yes, I am familiar with the various regulations around clinical data. They don’t really apply here. FASTQ was the landing point and experimental result. BCL is an accident of history that Illumina is working on fixing by putting base calling inside the instrument.

FASTQ files are organized by read, with all the base calls with the same molecular bar code grouped together along with their associated quality scores. FASTQ and BCL contain the same information content and will compress to approximately the same size.

There are three things we ever do with FASTQ reads:

  • Align them to a reference genome
  • Perform a reference free alignment of the reads against each other.
  • Align them to an exome and count up transcripts to try to estimate prevalence and hope it maps to protein expression.

The output of alignment is a Binary AlignMent or BAM file. The BAM contains all the same information content (base calls and quality scores) as either the FASTQ or BCL (assuming that you are not an idiot and include the un-aligned reads), but it sorts them into a common frame of reference suitable for downstream use. BAMs are generally -already- compressed by the various tools used to create them. They may superficially appear smaller than the FASTQs and the BCLs … but once you compress the precursor formats you will find that Shannon was right and they’re all the same size at the end of the day.

There’s your third, duplication. Six if you’re making redundant copies of all of this.

Related: I keep hearing people ask whether or not domain expertise matters in either data science or IT. I assure you that domain expertise matters very much in both domains. Factor of six, I say.

There are several tools that do really good domain specific compressions of FASTQ and BAM. Illumina is pushing the Original Read Archive (ORA) format as part of their DRAGEN hardware accelerator package. PetaGene has some truly impressive results (in addition to other features around cloud transparency). Go ahead and check those out, but also remember the big picture thing. Compression is great, de-duplication is even better, and don’t get me started on unmapped reads.

Fun fact: The Q in FASTQ stands for quality, but the rest of the letters don’t stand for anything at all, at least so far as I know.

Back to Sequencing depth

Illumina’s quality metric, ‘Q’ is a function of the estimated error rate, which is defined as Q = -10log10(e). At Q20, we expect an error rate of one in 100. Most of us use a hard threshold of Q30 for base calling, leading to an expected rate of incorrect calls of 0.1%, one in a thousand or 10-3.

If we observe each locus on the chromosome once, we expect 3.2×109 x 10-3 = 3.2×106, or a little over 3 million errors. As an engineering mentor was fond of saying – “even a very small thing, if it happens many many times over, bears watching.”

Our second observation produces an additional 3 million errors, but the vast majority of them will be in places where we previously had a good read. That gives us 6.4×106 (six and a half million) loci with a single error and 3.2×109 x (10-3)2 = 3.2×109 x 10-6 = 3.2 x 103 = 3,200 locations where -both- observations are incorrect.

Individual errors accumulate linearly while the sites with overlapping errors drop geometrically. We use this as a sort of filter. When we’re confident that all sites have enough good reads – we can stop.

For locations with multiple errors, we have no guarantee that the incorrect base call will be identical from read to read. Sometimes we get the same wrong answer several times in a row, and sometimes not. If we assume that errors at any location are evenly distributed across the three possible incorrect bases (why not?), at 2x sequencing we expect about 1,100 locations where we have what looks like a very confident base call that is actually two errors stacked on top of each other.

A first plateau

The lowest depth where we expect less than one location to be entirely composed of errors is 4x. 3.2×109 x (10-3)4 = 3.2×109 x 10-12 = 3.2×10-3. However, that’s not good enough because we will be left with 3.2×109 x (10-3)3 = 3.2×109 x 10-9 = ~3 locations where three out of four of the reads were errors. Sure, we’ll recognize that something was up at that site (most of the time, unless the errors align), but we’re interested in confident calls at all loci – not merely turning our unknown unknowns into known unknowns.

The lowest depth where we expect less than one location where most or the majority of the reads will be errors is 7x. At 3x, we expect ~3 locations where it’ll be a tie. We need 4x to have a majority.

Recall though, that humans have two copies of each of our chromosomes, meaning that two different observations at the same supposed location might be entirely correct and consistent with the underlying biology. While laboratory techniques do exist to do “phased” sequencing that tags and tracks the two copies of each chromosome independently, they are expensive and introduce still more possible errors.

As an aside, overcoming the phasing gap right on the sequencer without complicated barcoding is one of the most underrated aspects of long read sequencing like Oxford Nanopore and Pac Bio.

Simplistically, you might expect to have to double the sequencing depth (all the way up to 14x) in order to confidently sample twice as much chromosomal terrain. Of course, most loci are homozygous, so you could bring that number back down again if you had a really accurate estimate of the rate of heterozygous sites. This would require deeper sequencing first, on a very large number of individuals, in order to have sufficient confidence in the frequencies of heterozygous site.

It never ends.

Here there be dragons

Beyond this point the math gives way to big round numbers and hand waving.

The relative abundance of different reads at a heterozygous location is important in quality control. If we have a site where half of the observations say “G” and the other half say “T”, that matches our understanding of the biology in most people. On the other hand, if we see a distribution other than 50/50 at certain sites, particularly if it’s 25% / 75%, that is a strong signal of contamination. Four copies of a chromosome rather than two means contamination.

Or maybe tumor, fetal, or transplant DNA – as mentioned above. It never ends.

Similarly, the abundance of reads at a particular location can be used to infer the number of repeats or length of copy number variants, and can even provide clues about large scale re-arrangements.

Either way, we need far more than confidence that we’ll make any particular call correctly – we need to have the error rates all the way down in the noise.

So, like, 30x? For research? And like maybe 100x for clinical?

I know people who do this sort of math for a living, and I know that they would cringe at my rough and ready arithmetic.

It’s not wrong though, and seriously – do check on that whole “mean vs. median coverage and over what fraction of the genome.” There’s some serious dirt under that rug in certain shops.

Overcoming Ops Debt

I would like to talk about tech debt’s sneaky sibling, something that I think of as “operations debt.”

Tech debt is the accumulated burden of shortcuts, approximations, defects, and hacks that creep into a product or system as the team focuses on production rather than perfection. A little bit of tech debt is a good thing. Shipping is a feature and your product needs it. There is no virtue in spending additional sprints putting a high shine on a cannonball. In excess, though, tech debt creates an ongoing burden on the team, both because brittle and imperfect code is harder to support – and also by an accumulative duct tape on duct tape effect that eventually necessitates the dreaded and much maligned rewrite.

Operations debt is a related concept, and might even be a subset of tech debt. It’s the ongoing burden that a team or individual endures when the broader organization continues to rely on them for help with stuff that either never was or is no longer supposed to be their job. The reward for making something useful is an endless parade of people who seek you out for just one little tip, trick, tweak, or bit of advice or support. If you make a useful open source tool, you wind up supporting the whole world unless you are -spectacularly- good at setting boundaries.

The debt metaphor is a good one: Let’s say, for example, that you want to have friends over to watch the big game on Saturday, but you don’t have a TV and you don’t have cash. You -really- want to have a party, so you buy the TV on credit. Everybody has a great time, but when you wake up on Sunday you are left holding a big TV, good memories, and also a monthly payment that cuts into your future finances until you pay it off.

I once had a report who – despite being a director and manager of managers by the time we met – would still come back to his office from time to time and find senior members of other teams sitting on his desk, wanting updates to that spreadsheet he made back when he was “Andy from the lab.”

Operations debt is insidious. Without a deliberate effort to surface it, ops debt doesn’t show up on the backlog, doesn’t get estimated or assigned story points, and can’t be distributed or re-assigned across the team. Your best contributors – the authors of the stuff that gets broad use, and the ones that everybody likes because they are so nice – simply become slower and less productive over time.

Most managers underestimate how little distraction it takes to bring an individual or a team’s velocity to zero.

Curing operations debt takes time and effort, but transparency and accountability – coupled with a commitment to keeping the team moving forward without abandoning the broader organization – can get the job done.

The virtue of having one (1) front door

My starting point for managing ops debt (as well as a host of related challenges) is fourfold:

  • Create a single ticket system for all requests.

    Jira, RT, or even ServiceNow are all fine. I’ve even seen integrations between slack and google forms work – though I question the wisdom of creating yet another chunk of bespoke software that will eventually need to be supported. The particular technology matters less than the organizational and management commitment behind it.
  • Assign an Ops lead / quarterback who is responsible for triage, dispatch, and communication. This person should be technically competent, trusted by the organization, and comfortable holding their ground and pushing back.

    It’s important to note that the ops lead is not expected to actually do all the requested work any more than the quarterback is expected to throw, catch, and block all at the same time. Their job is to -surface- the operational debt in a format that the team can track it along with all the rest of their commitments. If things go well and you have the resources, give the operations lead a modest team and ask them to constantly be reducing the most common requests to practice – writing playbooks that allow junior team members or contractors to do the work, or even scripting up self service solutions that nip the debt in the bud.
  • Tell everybody from the CEO on down that they have to use the ticket system or talk to the ops quarterback when make requests of the team. If it doesn’t exist in the ticket system, it doesn’t exist.
  • Practice radical transparency and gently but firmly allow senior leadership to do the job they should have been doing all along – establishing clear priorities for those one-off requests that are leeching the hours in the days away from individual contributors.

Radical Transparency

We are all familiar with trouble ticket systems that function as inscrutable tombs for requests and complaints. That’s why I recommend making the operations queue visible to the entire organization.

Radical transparency is the upper right quadrant where we both care deeply and also engage directly. It exists in contrast with ruinous empathy (care but don’t engage – thoughts and prayers), obnoxious aggression (don’t care, just here to tell you what’s wrong), and manipulative insincerity (bless their hearts). Radical candor can be scary at first, but it has become my go-to over the last decade.

Also, I believe that everybody makes better decisions when they have access to better information, so why not show people what’s really going on?

Questions like “what the heck else is the team working on?” and “where am I in the queue?” ought to be self service. At the very least there should be one (1) point of contact who is able to provide authoritative answers. That’s your ops lead / quarterback. Over time, radical transparency pushes questions of inter-departmental prioritization back up the chain of command – where they belong. It’s wildly unfair, yet utterly ordinary, for senior managers to abdicate their core duty and push these questions of priority down on team leads and individual contributors – who try diligently even as the strategic objectives slip away.

TL;DR: Don’t just go sit on Andy’s desk.

People Change Slow

Here’s the fun part: You’ve got to wait 3 to 6 months for people to stop hating the change.

Changing human patterns of behavior is unbelievably slow. Any time you restructure an organization or change an procedure, you can expect 6 months of confusion and complaint. People liked the old way better. They don’t understand this new thing. Why can’t they just sit on Andy’s desk until he updates the spreadsheet like he did last month? Who even uses this ticket system anyway?

Because Andy has a different job now, that’s why, and we’re making space for them to do it.

At about the six month point, you will see the first glimmers of daylight on improvements due to the new way of working. In about 18 months, people will have forgotten about the old way entirely. It really seriously does take 6 months to see the benefits of a reorganization and 18 months before it will feel right to the team. If you make further changes, it resets that clock and corrodes trust that management knows what the heck we’re doing.

That long timeline and inevitable complaint about complexity is one reason to keep it utterly simple. There is one (1) intake for operations support for the team. There is one (1) person whose job is to triage and communicate. There is one (1) master backlog of all the outside requests, and it is force ranked. You can go look at it yourself, and if you don’t like what you see – don’t take it up with Andy, take it up with me.

Trust me, the team will be happier and more productive for it.

The more things change

I used to make a pretty decent living installing the Linux operating system on bare metal servers.

Nobody does that anymore, or at least nobody brags about it on social media, which is basically the same thing.

It was a good gig. I traveled the world, spent a lot of time in rooms with -really- good HVAC, and became so familiar with the timing of the various hardware and firmware processes that I could reliably run five to ten concurrent installations off a single Keyboard / Video / Monitor shakily balanced on a utility cart – swapping cables and hitting keys without having to look at or even connect the screen – like some sort of strange cyberpunk organist.

Eventually BOOTP and IPMI made the details of the hardware timings mostly irrelevant. Servers were equipped to wake up and look for guidance over the network rather than waiting for some junior engineer to plug up a keyboard or a mouse. I pivoted to making a decent living tuning and massaging the moody and persnickety software that hands out IP addresses, routing configurations, hostnames, and instructions around disk partitioning and default boot orders.

Over time, technology ate that too – so I kept moving up the stack – using tools like Rocks, and eventually Chef and Ansible, to make sure that the servers, now contentedly installing their own operating systems, would also automatically pull down updated code for software services. The very best of these systems developed a certain dynamism – checking themselves out of work from time to time to automatically pull down updates and potentially entire new software stacks.

Eventually, in 2007 (ish) Amazon launched EC2, S3, and SQS. At that point it became completely ordinary to have computer programs defining both infrastructure and software. Over time, containerization (Docker), markdown (Rstudio and Jupyter), and Terraform further blurred the lines. Then Kubernetes came along and strung it all together in a way that made my cyberpunk-organist heart sing.

Rewind:

Around the same time that I was installing Linux on servers, I maintained an HTML file – housed on a particular server in a particular building in Ann Arbor Michigan – that linked out to a few dozen or maybe even a few hundred other files on other servers in other (mostly) college towns all over the world. A few dozen, or maybe a few hundred, or even maybe a few thousand people used that file as a stepping stone in their very own personal quest to find information about a thing.

The thing was college a cappella music like the kind in Pitch Perfect. Forgive me.

Then two guys named Larry and Sergey started a company that used computer programs to trawl and index the internet … dynamically generating HTML in response to arbitrary queries. These days the very idea of a -manually- curated list of URLs seems utterly quaint and bespoke. Wikipedia came along as a way to crowdsource a curated reference for a world full of topics.

Now we have Large Language Models (LLMs) like ChatGPT and the rest, which are frankly brilliant tools to provide not just links into documents but dynamically generated and context-sensitive summaries with exquisitely accurate language.

The Future

Here’s my optimism: At no point in this whole story was I “replaced.”

Jobs are certainly going to change. People (and algorithms) who previously brought value by reading, summarizing, and regurgitating pre-existing text … well … that particular job is going the way of my cyberpunk organist gig from the late 90s. It will be uncomfortable to realize that many things we thought were important and unique – like writing summaries for our managers – are actually fairly mechanistic.

I intend to continue writing my artisanal blog posts for the same reason I always have – regardless of whether anybody else reads these words – I get value from the act of writing. As the old quote goes: “Nothing so sharpens the thought process as writing down one’s arguments. Weaknesses overlooked in oral discussion become painfully obvious on the written page.”

I remain optimistic: No matter how clever I ever got about automating computers to install their own operating systems, I was always left with the question of what we were going to use those computers to do. No matter how effortlessly we manage to summarize what has come before, the question of where we are going will remain.

LLMs and other generative AI technologies are absolutely amazing at recapitulating the past but – so far at least – they have yet to be able to direct us into the future.

And that’s cause for optimism.

Sequencing by Synthesis

This is the fourth in a series of high-level posts reviewing foundational concepts and technologies in genomics. The first three were: “How Many Genes Does a Person Have,” “How do Genomes Vary, Person to Person,” and “Sanger Sequencing.” This one is about high throughput DNA sequencing, focusing on Illumina’s Sequencing by Synthesis (SBS) technology.


For the last decade or so, the market for high-throughput DNA sequencing instruments has been utterly dominated by a single company: Illumina. Their “Sequencing By Synthesis” (SBS) approach was originally commercialized by Solexa, who launched the Genome Analyzer in 2006. Illumina acquired Solexa in 2007, and all of their instruments – both the lower throughput MiSeq and the higher capacity HiSeq and NovaSeq – have used variations on the same fundamental process. While other high throughput technologies have made significant inroads in the early 2020s – anybody working with sequence data should be familiar with the fundamentals of SBS.

Two keys to creating high-throughput laboratory processes are multiplexing and miniaturization. We want to concurrently run many different reactions in the same physical container (multiplex), and we also want to use as little stuff (molecules, reagents, energy) per reaction as we possibly can (minimize). usually replaces one set of problems with another. While low throughput processes struggle with the level of effort and cost per reaction, higher throughput processes tend to be more complex. Batch effects and subtle errors inevitably creep in.

It was the combined benefits of miniaturization and multiplexing that drove the radical increase in DNA sequencing capacity and adoption of the early 2000s. High throughput technologies – mostly SBS – meant that sequencing was suddenly both 1,000-fold cheaper per base pair and also 1,000-fold faster per reaction. This caused the industry to, briefly, accelerate “faster than Moore’s law.” It’s important for industry watchers to realize that this acceleration was a one time thing, driven by specific advances in sequencing technology. Single molecule sequencing technologies have the potential to drive a similar change in the next few years by exploiting still further miniaturization (one molecule per read!) combined with exceptionally long read lengths.

Flowcell

The core of SBS technology is the flowcell – a specially prepared piece of glass and plastic slightly larger than a traditional glass microscope slide. Each flowcell has one or more lanes (physical channels) that serve, in an exceptionally broad sense, the same function as the capillary tubes from Sanger sequencing. Both flowcell lanes and capillary tubes are single-use containers into which we put prepared DNA, run a reaction, and from which we read out results. In Sanger sequencing, we prepare millions to billions of copies of the same fragment of DNA, synthesize it in a variety of lengths, and read out the results by weight. Sanger sequencing gives us one read (the order of the residues in a contiguous stretch of DNA) per capillary tube – using millions to billions of molecules to do it. In SBS, we load millions to billions of different fragments of DNA (all of about the same length, and tagged in special ways as described below), copy each of them a few dozen times, and then simultaneously generate millions of reads using high resolution imaging.

In order to spread out and immobilize the DNA fragments so we can keep track of them, the bottom surface of each lane in a flowcell is coated with a “lawn” of oligonucleotide primers. These hybridize with matching reverse complement primers appended to both ends of the DNA fragments to be sequenced. In a sense, these primed locations are the real miniaturized containers for sequencing, and the flowcell – despite its small size relative to human hands – is just a high capacity container.

Fragmentation

Most DNA sequencing technologies require consistently sized fragments of input DNA as input. This is accomplished through a process called fragmentation. In enzymatic fragmentation, chemicals are used to cut (“cleave”) the DNA at random locations. By carefully controlling the temperature and time of the reaction, it is possible to achieve consistent fragment lengths. The alternative is sonic fragmentation or sonication, which uses high frequency vibrations to break up longer molecules at particular lengths. Sonication is less sensitive to variations in timing and temperature, but requires additional manipulation of the samples and dedicated instrumentation.

Whatever approach is used, variability in fragment lengths leads to erratic performance of all downstream steps in the process.

Capture

Often, we want to sequence only a subset of the genome. Whole Genome Sequencing, where we sequence everything, is the exception. For example, in an Exome, we might want to sequence only the exons that code for proteins. Panels are even more selective, picking out just a few actionable genes and regions of clinical or research interest. Just as with Sanger sequencing, manufactured oligonucleotide primers or baits are used to capture and select out exactly the bits we want while the rest are washed away. Sets of baits developed for a particular purpose are referred to as capture kit, and the process of selecting out the targeted DNA is somewhat casually called capture.

The attentive reader will notice a substantial amount of chicken vs. egg in this process. We’re sequencing because we don’t know the full DNA sequence present in the sample, and targeting our efforts using techniques that actually assume quite a lot of that same knowledge. In a general sense, both things can be true. Genomes are remarkably consistent from person to person and every exon contains highly conserved regions. However, for any sort of detailed analysis, particularly when working with rare or complex variants, it is important to keep in mind the layered stacks of assumption and biases that go into the data. Also bear in mind that at this point in the process we are still talking about strictly chemical manipulations. We are nowhere near alignment and mapping, which are the algorithmic reconstruction of longer sequences out of shorter ones.

Primers, Molecular Bar Codes, and Multiplexing

Having achieved a consistently sized collection of DNA fragments, and having washed away all the bits that are not of interest, it’s time to affix the primers mentioned above, as well as unique labels so we can tell one sample from the next after mixing them together (multiplexing). These indexes or molecular bar codes (more manufactured oligos) are snippets of DNA with well known sequences that will be unique within any particular lane on a flowcell. They get sequenced along with the samples, and are then used to computationally de-multiplex the reads.

Got that? We apply a physical tag to each fragment, sequence it, and then sort it out digitally on the back end.

Adding additional utility sequence to the DNA (like indexes) winds up as an information tradeoff. We can obtain more and more specific information about the history of a particular chunk of DNA by using some of the base pairs on every read as bar codes and tags rather than reading the sample itself. Highly multiplexed technologies like spatial transcriptomics use multiple layers of bar codes – reading less and less sequence from the original sample (which is presumably less important) in exchange for more and more detailed information about where it came from.

In any event, after all of this tagging, fragments from multiple samples are mixed together and loaded onto a lane of a flowcell. As mentioned above, the primers on the ends of the fragments anneal to the oligos on the surface of the flowcell – hopefully resulting in an evenly dispersed lawn of DNA – all attached at one end.

Clustering

The imaging devices used in modern Illumina instruments are not sensitive enough to consistently detect the fluorescence events from single molecules. To overcome this, a physical process called clustering is used to create a group of identical copies around the original fragment that attached at a particular location on the flowcell. Clustering starts by denaturing (separating) the DNA strand, which exposes the primer that was added to the free end. One copy (the one that is not bound to the flowcell) is detached and washes away. The molecule then flexes over to form an arch, binding to one of the nearby primers on the surface of the flowcell. Nucleotides are washed over the flowcell to create a matched pair to this doubly bound strand (a “synthesis reaction”), and then the DNA is denatured yet again, yielding two adjacent single-stranded fragments in reverse complement from each other. This process is repeated, building up clusters of sufficient numbers of molecules to be reliably detected.

It’s important to remember that this is -still- not a digital manipulation or a computer program. We’re dealing with massive numbers of molecules that all just do their thing in solution with very predictable results. Some DNA fragments will inevitably bind close enough to each other that their clusters will interfere. Others will be lost entirely during the series of denaturing and rebuilding steps. Some synthesis reactions will incorporate errors – substituting, omitting, or repeating stretches of one or more nucleotides.

Sequencing (finally)

Finally, we come to sequencing. This part of the process has a lot in common with Sanger sequencing, since it relies on the controlled addition of fluorescent di-deoxynucleotides (ddNTPs). Unlike the Sanger process, which builds out all the various possible lengths of DNA fragment simultaneously and sorts them by weight, SBS proceeds in highly regimented cycles, running one base pair at a time. Each cycle introduces a round of fluorescent ddNTPs to any DNA fragment with the correct residue at the first open position. The flowcell is then illuminated with a laser, and a high resolution image is captured that hopefully shows each of millions of clusters glowing in one and only one of the four frequencies associated with the four types of fluorescent ddNTP. The ddNTPS are then washed away and a single step of normal synthesis is allowed to proceed, advancing the ticker on each DNA fragment by exactly one letter.

The short form is that each cycle gives us information on a single nucleotide, in the same position, from each of millions of fragments of DNA, all at the same time.

Illumina sells reagent kits for 50, 100, and 150 cycle runs, each of which can be either single-ended or paired-end. In either case, the DNA is first sequenced from the 3′ end. In paired end sequencing, the DNA is then allowed to arch over and anneal at the 5′ end (as happened during during clustering) and sequenced 5′ to 3′ for an -additional- 50, 100, or 150 base pairs per fragment. Paired end sequencing provides a critical additional piece of information for downstream informatics, since (assuming that our fragmentation worked well) we know the relative orientation and distance between the two reads.

One cycle on a modern instrument takes around 10 minutes, which leads to runtimes between 11 hours (50 cycles, single end) and 48 hours (300 cycles, paired end).

The result of all this work is a stack of images which still needs to be computationally processed in order to produce reads. In the early days of high throughput sequencing, the raw images had to be offloaded from the instrument and processed (base calling) via a separate computing environment. At the time, there were endless conversations about whether there was enough potential value to merit retaining incremental data taken during the sequencing process – including the raw CCD images. The hope was that future algorithms might allow us to rescue marginal data or to detect errors more accurately. These images were huge, and the memory of their size still distorts some estimates of the scale of the data “problem” in genomics – mostly among certain data storage vendors who still can’t be troubled to know the difference. As costs have come down and volumes have increased, those conversations have tapered off. Most modern instrument vendors no longer even offer the option of downloading raw CCD images. Base calling now happens on-instrument, and the bioinformatic processes start with demultiplexing – sorting the reads out sample by sample.

I hope to cover the next couple of steps in the standard process – primary and secondary bioinformatic analysis – in a future post.

The hourly / annual pricing fallacy

I’ve heard from several independent consultants lately that prospective clients are pressuring them to cut rates and justifying it with the old fallacy about how your hourly rate should just be your annual salary divided by 2,000 working hours per year. I have written about how this is utter garbage in a longer piece about how to set rates, one of three “how to” pieces on consulting that I posted back in 2017. The other two were about business development and some basic mechanics of consulting.

TL/DR: As an independent, your top line rate (the first number you say, the starting point from which negotiations proceed) should be approximately triple that garbage number. You should be willing to consider discounting down to merely double the garbage number (a 33% discount from the top line) for steady, committed, enjoyable work. Under no circumstances should you start from the garbage number and attempt to negotiate up. That’s a trap, and clients who start the relationship that way will -always- be more trouble than they are worth.

Talking About Money

Conversations about money can be uncomfortable. Income and wealth (or the lack of either / both) are tied up in people’s minds with concepts of self-worth, success, importance, power, and control. That makes it harder than it needs to be to get good advice about pricing. It also creates a risk that people who feel uncomfortable talking about money will sell themselves short.

Being uncomfortable when talking about money will hold you back – whatever your role or field. It’s well worth your time to get to the point where you can discuss salaries, rates, raises, benefits, potential discounts, promotions, and so on without giggling, twisting your hands together, standing up and walking around, and (worst of all) blurting out something like “okay okay I could do it for half that much is that okay? Wait in fact never mind I’ll do it for free! OMG please like me!”

The pause between when you propose a price and when the other person responds is -super- important. You need to be comfortable giving them a chance to say “yes.” It’s part of a broader truth on the importance of (a) being clear about what you really need / want (b) leaving enough space for other people to understand and accommodate it.

Anyway, back to setting rates.

Specifics

If you feel compelled to give specific examples on why consultants need to charge more by the hour to avoid shorting ourselves, I’ve listed a few below. I can pretty easily justify a 50% markup on salary just by listing out all the stuff that would usually be part of the package for a salaried employee, but (and this is important) do not fall into the trap of being made to justify every dollar above a notional salary comparison. As I said above, clients who treat you like that before the work starts will tend to nickel and dime every aspect of the relationship. It’s far better to let somebody else have their work and find customers who see the value that you bring.

Anyway, here’s an incomplete list:

  • Self employment tax (15% off the top)
  • Health insurance. The average family “Silver” plan costs more than $1k per month.
  • Dental and vision coverage (which strangely do not count as “health.”)
  • Short and long term disability (you need this even more if you are independent)
  • Professional “errors and omissions” insurance (sensible clients will require this)
  • Tax advantaged retirement contributions
  • Vacation and sick time
  • Equipment (laptops, software licenses, space to sit and work)
  • Training to stay current, attendance at conferences, professional development
  • Non-salary compensation like bonuses and equity (really significant for senior folks)

Independents also do a ton of non-billable work. Personally, I spend the whole rest of my work week (everything that is not taken up with paid work) on marketing (this blog post counts!), reaching out to prospective clients, screening them (do they have money? if they have money, will they choose to pay me or tell me stories?), writing SOWs, negotiating terms, invoicing, reminding people about invoices, listening to excuses about late invoices, and so on.

Consulting firms, like legal practices, usually aim for at least 80% billable. I’ve been in consulting on and off for 20+ years, independent for about half of that. Trust me when I say that maintaining 80% billable is far from easy.

Make Leaving Easy

The singer / songwriter Ani Difranco says, “I leave for a living, music is just something I do on the way out the door.” It’s the same with my consulting practice. One of the primary selling points of consultants and contractors is that we work with little to no commitment.

My standard agreement says that “either party can end the engagement at any time via email notification, for any reason or for no reason.” The recent waves of layoffs should have driven home to everybody that even salaried positions should never be treated as any kind of guarantee and that corporations have no particular loyalty to employees.

For consultants, being easy to terminate is actually part of our value proposition. In return for being available for piece work, on short notice, with no benefits, without a title, and with no commitment of further employment, we get a higher rate.

What To Do?

My rule of thumb is to take the salary based number and triple it to establish a top-line hourly rate for small one-off engagements. That gives lots of breathing room to have a value based conversation. I discount from there down to about double, mostly for concessions that will reduce my non-billable work. Guaranteed monthly minimum payments, good payment terms, long-term commitments, and freedom to use the client as a public reference are all valuable concessions that I will take into account.

However, and this is important, healthy business relationships are not really about the rate. Value is the important thing. In healthy negotiations, most of the conversation centers on identifying a scope of work that will bring so enough value that rates can be almost an afterthought. Often, as consultants, we are in a position to drive changes that (for example) help clients to de-risk multi-million dollar decisions. I ask questions like “can we put a price-tag on getting this right on the first try?”

Finally, a closing thought: The correct rate for your services is a sweet spot where (a) There are no surprises or gotchas – everybody understands the deal going in (b) The client is grumbling a bit but still paying it (this means that you are getting a decent slice of the available value) (c) You can maintain a decent work/life balance -AND- meet your financial needs (d) you break into a smile when your phone rings and your client asks for more of your time – because you’re getting paid well to do something important and you get to do more of it.

All In on Artificial Intelligence

More than 20 years ago, fresh out of school in the 90s, I built artificial intelligence (AI) systems for a military contractor. I trained neural nets, used natural language processing to populate decision support systems, experimented with genetic algorithms, and refined support vector machines. In 2000 I pivoted to bioinformatics and genomics – mostly because I wanted to help people live longer, healthier lives rather than building algorithms to kill them ever more efficiently.

It’s taken a while for the public sector to catch up, but we’re finally here and I am all in on AI for genomics and precision medicine.

The abstract from a 1998 paper in which I used neural networks to detect and classify military targets.

I’m thrilled that we finally have the data, the algorithms, and most of all the focus and discipline to bring AI to bear on human health. Practically speaking, I’m pivoting my professional practice to preferentially engage with the projects and organizations centering AI in the context of large, well curated, cross-domain datasets. That last bit – about the data being well curated and cross-domain is important because AI is fundamentally a data play. The emerging success of AI in our field is made possible by decades of work creating data platforms (warehouses, lakes, grids, commons, and all the rest).

Domain Expertise Matters

In the early 2000s, I was one of those young technology hotshots who thought that the key to biology was to be found in some clever algorithm. It didn’t take very many years supporting real biology before I abandoned that idea entirely and focused on learning enough science to not embarrass myself in lab meetings any more.

The first biology paper with my name on it, for which I wrote some truly gnarly but not incorrect PERL

It’s hard to overstate the importance of domain expertise and data hygiene as we turn powerful algorithms and tools loose on large amounts of data. Recent history is littered with humiliating racist gaffes by companies who should have known better. Indeed, genomics is only slowly coming to grips with the fact that our datasets and early discoveries were built on a foundation of bias.

All of that is to say, “Garbage In / Garbage Out” still holds true even in the era of AI – as does “buyer beware.” Folks who carelessly fling petabyte scaled S3 buckets through a meat grinder / laundry list of algorithms … well … they will get what you might expect. It’s the same mistake I made early on, failure to respect the level of knowledge, context, and expertise encoded in seemingly simple concepts like a “gene” or a “diagnosis.”

If you’re ready to leverage those silos of data to power breakthroughs – whether it’s for early stage compound development, incorporating EMR and RWE for clinical decision support, or for getting over that last-mile into standard of care clinical use use: Give me a call, I would love to work with you, and I will prioritize engagements that support this timely and important mission.