Archive for the ‘Biology’ Category

Hydroponics: Growing an Appreciation for Plants

Tuesday, August 9th, 2022

I once heard a saying – “Don’t feel pity on plants because they can’t move. Feel pity on us, because we have to”. I really didn’t have an appreciation for what this meant until the COVID pandemic hit, which restricted my movement for a couple of years, and I decided to spend some of my new-found time at home learning how to raise plants in my little flat in central Singapore. The result is a small hydroponics system that now lines the sunny windows of my place, yielding fresh herbs weekly that I incorporate into my dishes.

For me, hydroponics really drove home how remarkable plants are: from a bin containing nothing but water and salts, a fully-formed plant emerges. No vitamins, amino acids, or other nutrients – just add sunlight, and the plant produces everything it needs starting from a single, tiny seed. The seed encodes every gene it needs to survive and reproduce – our basil plant, for example, is tetraploid, which means it has four copies of every gene. Perhaps this somewhat explains the adaptability of plant clones – it is almost as if every branch on our basil bush has a separate character, each one trying a different angle at survival. Some branches would grow large and leafy, others small and dense, and if you propagate by a cutting, the resulting plant would inherit the character of the cutting. Thus, a lone plant should not be mistaken as lonely: it needs not a mate to create diverse offspring. Every tetraploid cell contains the genetic diversity of two diploids (whereas a human is one diploid), allowing it to adapt without need of sex or seedlings.

I also did an experiment and grew some sage from seed, and planted one set in dirt and another in hydroponics. Even though from the same seed stock, the resulting individuals bore little resemblance to each other. The dirt-grown sage looked much like the herb you’re familiar with in the grocery store – dark green, covered with fine hair, and densely arranged on a stem. The hydroponically grown sage instead grows like a vine, with long thin green stems between each leaf, the leaves themselves having a lighter color and less hair. The flavor is even a bit different; the hydroponic sage emits a slightly sulfurous odor when disturbed, and exhibits a bit more mint on the palate when eaten.

Even more fascinating is how the plants seem to “groom the water”. I’ve noticed that the most successful plants we’ve tried to grow can lower the pH of the water on their own, and regulate it within a fairly consistent band (more on this later!). Furthermore, they seem to have recruited commensurate organisms to live among their roots. The basil grows long white or translucent roots with a pale white mycorrhiza, while the sage has a brownish symbiont and a short, bushy root ball. Thus I only fully replace the water of the hydroponic system as a last resort if a plant seems diseased; normally I will cycle the water by removing about a half of the reservoir and topping it up, so as not to displace the favored microbes from the ecosystem.

The Setup

The initial inspiration to try hydroponics actually came a bit by chance. We bought some locally grown hydroponic lettuce, and noticed that they were packaged as whole plants, complete with roots. We were curious – could we pluck most of the leaves of this lettuce, and then stick the plants in water, and grow another serving of hydroponic lettuce?

Surprisingly enough, it worked! Even with a crude setup consisting of a handful of generic plant fertilizer and a small aquarium bubbler, we were able to take a single plant and grow a couple more servings of lettuce from it. Unfortunately, with time, the plants started to grow very “stemmy” and pale, and eventually they succumbed to tiny mites that infested their leaves.

Inspired by this initial success, I started to read up a bit on how others did hydroponics. One of the top hits is a blog by Kyle Gabriel, detailing how he built an extremely sophisticated system based around a Raspberry Pi, and a multitude of sensors, valves and pumps. It was sort of a nerd’s dream of how farming could be fully automated. I figured I’m pretty handy with a soldering iron, so maybe I could give a go at building a system like his. So, I dug up a spare Raspberry Pi, some solid state relays and white LEDs left over from when I did the house lighting, and put together a simple system that just automated the lighting and took hourly photos of the plants as they grew. The time-lapses were fascinating!

You can’t really watch a plant grow in real time, but, over a period of days one can easily see patterns in how plants grow and adapt.

With this small success, I put my mind toward further automation – adding various pumps and regulators for the system. However, as I started to put together the BOM for this, I realized very quickly that there was going to be no return on investment for building out a system this complicated. Plus, I really didn’t like that the whole system ran on code – I did not relish the idea of coming home to a room flooded with water or a set of rotten plants because my control program hit a segfault.

So, I sat back and thought about things a bit. First, one observation I had was despite providing the plants with a 10,000 lux light source 12 hours a day, they still had a tendency to grow toward the nearby window. As an experiment, I took one bin and removed it from the regulated light source, and just stuck it up against the window. The plant grew much better with natural sunlight, so, I removed all the artificial lighting, unplugged the Raspberry Pi, and just stuck all the plants against the windowsill (it definitely helps that I live one degree off the equator – it’s eternally summer here, with sunrise at 7AM and sunset at 7PM, 365 days a year). I was happy to save the electricity while getting bigger plants in the process.

For water level automation, I replaced the computer with two float switches in series. One switch cuts off the pump if the water level gets too low in the feed reservoir; the other cuts off the pump if the water level gets too high in the plant’s growth bin. You can use the same type of switch for both purposes; by just mounting the switch upsidedown you can invert the function of the switch.

The current “automated” system, consisting of a reservoir on the left, with a peristaltic pump on top of the reservoir bin, and two float switches. The silicone tube that takes the solution from the reservoir to the plant bin is covered by an old sock to prevent algae from growing in the solution when it’s not moving. There is also an aeration pump, not visible in this photo.

The float switch mounted on the top, functioning as a “break-when-full” switch. You can see the plant’s roots have taken over the entire bin! A couple of spacers were also added to adjust the height of the water.

The float switch mounted on the bottom of a tank, functioning as a “break when empty” switch. In order to provide clearance for the switch on the bottom, a couple of wine corks were hot-glued to the bottom of the bin. The switch comes with a rubber o-ring, creating an effective seal and no leakage.

So, with a couple of storage bins from Daiso, two float switches and a peristaltic pump, I’ve constructed a system that automates the care of our plants for up to two weeks at a time for under $40. No transistors required – just old-school technology dating from the 1800’s!

There is one other small detail necessary for hydroponics – an aeration pump. Any aquarium pump will do – although we eventually upgraded to some fancy silent pumps instead of the cheaper but noisier diaphragm based ones. Some blogs say that the “roots need oxygen” to survive, but my suspicion is actually that the pumps mostly serve to circulate the nutrient solution. If you leave the pump off, the roots will rapidly deplete the water around them of nutrients, and without any circulation you’re relying purely on a slow process of diffusion for nutrients to reach the roots. I’ve noticed that on bins with a low air flow, the roots will grow thick and matted, but bins with a faster air flow, the roots barely need to grow at all – my hypothesis is this reflects the plant allocating less resources toward root growth in bins with greater circulation, because fresh solution is always available with faster circulation.

The Tricky Bit

The electronics were actually the easiest part of the whole enterprise; the hardest part was figuring out what, exactly, I had to add to the water to get the plants to flourish. Once I got this right, the plants basically take care of themselves; of course it helps to pick plant varietals that are pest-resistant, and have the innate ability to regulate the pH level around their roots.

When I started, I was naively aware that plants needed nitrogen-bearing fertilizers. Reading the label on packaged fertilizer solutions, they use an “NPK” system, which stands for nitrogen-phosphorous-potassium. OK, sure, so plants need a bit more than just nitrogen. Surely I could just pick up some of this NPK stuff, dissolve it in some water, and we’re good to go…

…but how much of this should I add? This deceptively simple question lead me down a several-month rat-hole that took many failed experiments and daily journals of observations to find an answer. The core problem is that most plant bloggers like to use units like “one handful” as a unit of measure; the more precise ones would write something to the effect of “one capful per gallon”. As an engineer, units of handfuls and capfuls are extremely dissatisfying: how many grams per liter, dammit!

This lead me to research several academic papers about plant nutrition, which lead to reading graphs about plant growth under “controlled” conditions that lead to astonishingly contradictory results to what the plant bloggers would write: the NPK ratios implied by some of the academic works were wildly different from what the plant bloggers relayed in their actual experience.

It turns out the truth is somewhere in between. A big confounding factor is probably the nature of the soil used in the research, versus the base quality of the water used in your hydroponic system. Most of the research I uncovered was written about fertilizing plants grown in soil, and for example “loamy diatomaceous earth” turns out to be quite a complicated mix of nutrients in and of itself.

The most informative bit of research that I uncovered was experiments done where they would ash a plant after it was grown and measure out all the base elements from the resulting dry weight of the plant. It was here that I learned that, for example, molybdenum is absolutely essential to the growth of plants. It’s almost never mentioned in soil cultures, because dirt almost always has sufficient trace quantities of molybdenum to sustain plants, but water cultures quickly become molybdenum-deficient, and the plants will become pale and sickly without a supplement.

I also learned that plants need calcium and magnesium in astonishingly large quantities; as much as they need phosphorous and potassium. Again, these two nutrients are less discussed in soil-based literature because many rocks are basically made of calcium and magnesium, and as such plants have no trouble extracting what they need from the soil.

Finally, there is the issue of iron. Iron turns out to be the hardest nutrient to balance in a hydroponic system. Despite being extremely plentiful on Earth, and indeed, possibly being the penultimate composition of the entire universe, it is extremely scarce as a free atom in the biosphere. This is in part because it gets strongly bound to other molecules. For example, oxygen binds to myoglobin with a log K1 of 6.18, which means that it is a million times more likely to find oxygen bound to myoglobin than unbound in solution. This may sound strong, but EDTA, a chelating agent, has a log K1 of something like 27.7, so it is one octillion (1,000,000,000,000,000,000,000,000,000) times more likely to exist as bound to iron than unbound in equilibrium. In a way, iron is so biologically important that organic life had an arms race to bind free iron, and some ridiculously potent molecules exist to rapidly sweep the tiniest amount of iron out of solution. Fortunately, as long as I (or more conveniently, the plant itself) can keep the pH of the water below 5.5, I can take advantage of the extremely strong binding of EDTA to iron to keep it dissolved in solution and out of reach of other organisms trying to scavenge it out of the water. The plants can somehow take in the bound iron-EDTA complex, degrade the EDTA and extract the iron for its use (this took a long time and many trials with various iron binding agents to figure out how to remedy the chlorosis that would eventually take over every plant I grew).

Alright, now that I have a vague understanding of the atoms that a plant needs to survive, the question is how do I get them to the plant – and in what ratios? The answer to this is equally as vague and frustrating. You can’t simply throw a chunk of magnesium metal into a bin of water and expect a plant to access it. The magnesium needs to be turned into a salt so that it can readily dissolve into the water. One of the easiest versions of this to buy is magnesium sulfate, MgSO4, also known as epsom salts. So, I can just read the blogs and find the ones that tell you how many grams of magnesium sulfate to add per liter of water and be done with it, right?

Wrong again! It turns out that MgSO4 has several “hydration states” (11 total). Even though it looks like a hard, translucent crystal, Epsom salt is actually more water than magnesium by weight, as 7 molecules of water are bound to every molecule of magnesium sulfate in that preparation.

Of course, no plant blogger ever specifies the hydration states of the salts that they use in their preparations; and many on-line listings for agricultural-grade salts also fail to list the exact hydration state of their salts. Unfortunately this means there can be extremely large deviations in actual nutrient availability if you purchase a dissimilar hydration state from that used by the plant blogger.

That left me with purchasing a set of salts and trying to calculate, from first principles, the ratios that I needed to add to my hydroponics bins. The salts I finally decided on purchasing are:

  • Monopotassium phosphate (anhydrous) K2PO4
  • Potassium sulfate (anhydrous) K2SO4
  • Calcium nitrate Ca(NO3)2•4H2O – hygroscopic
  • Magnesium sulfate MgSO4•7H2O

Plus a pre-mixed micronutrient from a local hydroponics shop that contains the remaining essential elements in the following ratios:

  • Iron as EDTA chelate 21.25 mg/mL
  • Manganese 5.684 mg/mL
  • Boron 0.483 mg/mL
  • Zinc 0.617 mg/mL
  • Copper 0.267 mg/mL
  • Molybdenum 0.471 mg/mL

For the salts, I computed a matrix that allows me to solve for the amount of nutrient I want in solution, by taking the mass fraction of each nutrient available, writing it in matrix form, and then inverting it (had to crack open my linear algebra book from high school to remember what determinants were! Who knew that determinants could be useful for farming…).

You can make the matrix yourself by expressing the ratio of the milligrams of nutrient (as derived by the atomic weight of the nutrient) per milligram of compound (as derived by summing up the weight of all the atoms in the molecular formula, including the hydration state), and putting it into a matrix form like this:

And then taking the coefficients into an inverse matrix calculator and deriving a final format that allows you to plug in your desired NPK ratio and compute the mass of the salts you need to dissolve in water to achieve that:

As a sanity check, I plug the calculated weights back into the forward matrix to make sure I didn’t mess up the math, and I also add up all the dissolved solids to a TDS (total dissolved salts) number, so I can cross-check the resulting solution easily using a cheap TDS meter (link without referral code). In case you want to start from a template, you can download the spreadsheet. The template contains the pre-computed ratio that I currently use for growing all my herbs with compounds that I can source easily from the local market, and it seems to work fairly well for plants ranging from brazilian spinach to basil to sage.

As a side note, calcium nitrate is pretty tricky to handle. It’s very hygroscopic, so if left in ambient humidity, it will absorb water from the atmosphere and “melt away” into a concentrated, syrupy liquid. I usually add a few percent extra by weight over the formula to compensate for the excess water it accumulates over time. Also, I store the substance in an air-tight bag, and I always wear nitrile gloves while handling the compound to avoid damaging my hands.

For the micronutrients, it’s a bit trickier to dose correctly. Fortunately, I have a micropipette set that can measure out solutions in the range from 1uL to 200uL, from back when I did some genetic engineering in my kitchen (pipettes are also surprisingly cheap (without referral code) now). Again, the blogs are not terribly helpful about dosing – you get advice along the lines of “one drop per bucket” or something like that. What’s a drop? What’s a bucket? The exact volume of a drop depends on the surface tension and viscosity of the liquid, but I went with the rule of thumb that one drop is 50uL (20 drops per mL) as a starting point.

Initially, I tried 60uL of micronutrients per 1.5L of solution, but the plants started to show evidence of boron poisoning (this is a great guide for diagnosing plant nutritional problems based on the appearance of the leaves), so after a few iterations and replacements of the water to flush out the excess accumulated micronutrients, I settled on 30uL of micronutrients per 1.5L of solution, with a 15uL per week bump for iron-hungry species like spinach.

At a microliters-per-week consumption rate, even the smallest bottle of 150mL micronutrient solution will last years, but the tricky part is storing it. In order to avoid contaminating the bottle, I aliquot the solution every couple months into a set of 1.5mL eppendorfs which I keep in my wine fridge, alongside the original bottle. Even though I try my best to avoid contaminating the eppendorfs, after a couple weeks a pellet forms at the bottom from some process that is causing the micronutrients to come out of solution, so I typically end up discarding the aliquot before it is entirely used up.

The Final Result

It’s pretty neat to go from a pile of salts to delicious herbs. About a gram of salts go in, and a week later a couple dozen grams of leaves come out!

In go salts…

Out comes basil!

Basil in particular has been a real champ at growing in our hydroponics bins – we are at the point now where between two plants, we’re regularly giving it away to friends because it yields more than we can eat, even though I cook Italian food almost every other night. A handful of basil, a bit of salt, olive oil, tomatoes and garlic and we have a flavorful bruschetta to kick off a meal! Our other favorite is sage, it’s great for flavoring pork and poultry, but it’s very hard to find for some reason in Singapore. So, having a bit of fresh sage around is convenient and saves us a bit of money since it can be quite expensive to buy in specialty stores.

It’s been less practical to grow bulk vegetables, such as spinach. Brazilian spinach has been fairly successful in terms of growth, but it takes about a month for a cutting to grow to maturity, and we need about four plants to make a salad, so we’d need several racks of bins to make a dent in our vegetable consumption. Also, in general our herbs have had less pests than leafy green vegetables; maybe their strong flavor comes from compounds they produce that also serves to repel bugs? So, in addition to being great flavor for our sauces, the herbs have required no pesticides.

Overall, it was satisfying to learn about plant biology while developing a better connection to my food through technology. It was also a calming way to pass time during the pandemic; agriculture requires patience and time, but the reward is visceral. Having kept a miniature farmer’s almanac to decode missing pieces of information from the blogosphere, I have an new appreciation for how such personal journals could lead to scientific discoveries. And, I’m a much better chef than I was a couple years ago. Somehow, just having the fresh herbs around inspired explorations into new and exciting pairings; it gave me a whole new way to think about food.

A footnote on novel H1N1

Friday, August 19th, 2011

A couple years ago, I wrote a post about the H1N1 “swine flu” outbreak, talking a bit about the mechanics of the virus and how it could be hacked. Today I read an interesting tidbit in Nature referencing this article in Science that is a silver lining on the H1N1 cloud.

You know how every flu season there’s a new flu vaccine, yet somehow for other diseases you only need to be vaccinated once? It’s because there’s no vaccine that can target all types of flu. Apparently, a patient who contracted “swine flu” during the pandemic created a novel antibody with the remarkable ability to confer immunity to all 16 subtypes of influenza A. A group of researchers sifted through the white blood cells of the patient and managed to isolate four B cells that contain the code to produce this antibody. These cells have been cloned and are producing antibodies facilitating further research into a potential broad-spectrum vaccine that could confer broad protection against the flu.

For some reason I find this really interesting. I think it’s because at a gut level it gives me hope that if a killer virus did arise that wipes out most of humanity, there’s some evidence that maybe a small group of people will survive it. Also, never getting the flu again? Yes, please! On the other hand, this vaccine will be a fun one to observe as it evolves, particularly around the IP and production rights that results from this. Who owns it, and who deserves credit for it? Does the patient that evolved the antibody deserve any credit? What will be the interplay between the researchers, the funding institutions, the health industry and the consumer market? Should/can the final result or process be patented so that ultimately, a corporation is granted a monopoly on the vaccine (maybe there’s already a ruling on this)? Should we administer the resulting vaccine to everyone, risking the forced evolution of a new “superstrain” of flu that could be even deadlier, or should we restrict it only to the old, weak, and young? While these questions have been asked and sometimes answered in other contexts, everyone can relate to suffering through the flu, so perhaps the public debate around such issues will be livelier.

Reverse Engineering Superbugs

Wednesday, June 8th, 2011

The outbreak of the EHEC O104:H4 E. coli “superbug” in Europe has got me thinking about biology again.

The rise of antibiotic-resistant superbugs are a product of our love of antibiotics. In the absence of antibiotics, a bug that has few resistances will grow faster and more efficiently than one that has to put on bullet-proof armor every morning and lug around heavy artillery. In other words, the biological machinery required to produce antibiotic resistance comes at a fitness cost for the bug. In antibiotic-free conditions, non-resistant strains grow faster than the resistant strains; and with as little as 20 minutes per generation, just a couple days can yield hundreds of generations. This is why, thankfully, not every bug out there has a full suite of drug resistance — a chief enemy of the superbug is the common bug.

According to this evolutionary theory for the acquisition and loss of drug resistance genes, a hospital is an ideal breeding environment for superbugs: they are asceptic (less competition from common bugs), and full of antibiotics (plenty of selective pressure to acquire resistance genes).

Thus it is curious to find superbugs in food. Farms are teeming with common bugs, creating a selective pressure to lose antibiotic resistance genes. While antibiotics are routinely put into farm animal feed, it’s probably not cost-effective to use broad-spectrum antibiotics on such a scale. Perhaps O104:H4 is just a spontaneous coincidence, a fluke — a bug had acquired a set of genes, got lucky and grew, and just as quickly got edged out by more competitive neighbors. This could explain why it’s been tough to find its origin.

Fortunately, the entire sequence of the O104:H4 bug is available for download on the internet. Our friends in China — BGI, located in Shenzhen — acquired a sample and in an unusual act released the sequence for public download. This is unusual because research organizations typically hold this kind of data close to the chest, partially for peer review to vet it before public release, and partially for competitive advantage in academic publications — proprietary access to data is a common method to reduce competition for high-profile publications, and thus ensure your academic reputation. Whatever their reasons are for sharing the data, I think it’s worth noting the contribution, because now everybody in the world can perform an analysis on the bug.

And that’s where the fun begins! Analyzing the sequence data requires a little know-how, but fortunately, my “perlfriend” is a noted bioinformaticist. The raw sequence data provided by BGI is a set oversampled sub-sequences, which have to be assembled based on matching up overlapping regions. Once you assemble the sequence, you get a set of contiguous reads, but there are still gaps. It’s a bit like trying to compose a large picture out of a number of small photos taken at random. With enough sampling you will eventually create a complete picture, but for various technical reasons there are still ambiguities and gaps.

After assembly, the genome of O104:H4 is stitched from over a half million short DNA samples into 513 contiguous fragments of DNA (“contigs” in bio-speak), with a total length of 5.3 million base pairs (notably, wikipedia cites E. coli as having only 4.6 million base pairs, so O104:H4 is probably at least 15% longer — and likewise takes more time to replicate than a non-drug resistant strain). Here’s contig 34 of the assembly:


(Fun fact: the word “Gattaca” occurs 252 times in the genome of O104:H4)

Aside from making gratuitous pop culture references, the raw DNA isn’t very useful to us — it’s as if we were staring at binary machine code. In order to analyze the data, you need to “decompile” the methods contained within the DNA. Fortunately, protein sequences are highly conserved. Thus, a function that has been determined through biological experiment (for example, snipping out the DNA and observing what happens to the cell, or transfecting/transforming the DNA into a new cell and seeing what new abilities are acquired) can be correlated with a sequence of DNA, which can then be pattern-matched over the entire record to determine what functions (genes) are inside the overall genome.

The pieces needed to do this reverse-engineering are a protein database, and a tool called “blastx”. All of these tools are available free for download.

The list of known proteins can be downloaded from Searching for “drug resistance” restricted to E. coli organisms yields a nice list of proteins that have been identified by scientists over the years to confer upon E. coli parts of drug-resistance machinery. Overall, our query to the uniprot database returned 1,378 proteins that are described to confer drug resistance to E. coli.

Have a look at Multidrug transporter emrE []. Inside the link, you’ll find a description of the biological mechanism for its function (it pumps antibiotics out of the cell), its secondary structure (a notion of the shape of the protein) and its 110-residue amino acid sequence.

Here’s another example of a snippet from the database for a drug you may recognize:

>sp|P0AD65|PBP2_ECOLI Penicillin-binding protein 2 OS=Escherichia coli (strain K12) GN=mrdA PE=3 SV=1

(Incidentally, I find it amusing that the sequence for PBP2 is shorter than, for example, my PGP public key block)

PBP2_ECOLI is linked to penicillin resistance, and functions as a mutant of a gene that determines the shape of the bacteria. Reading through the bio-speak, it seems that this resistant variant is adapted to buy Amoxicillin online; bacteria with non-resistant forms of this gene are unable to form properly shaped cell walls and thus die. So, by browsing this database, we are getting a feel for the variety of countermeasures that bacteria has: sometimes they are active (pumping the antibiotic out of the cell) and sometimes they are passive (mutations that enable operation despite the presence of antibiotics).

Now, you need the actual decompiler itself. The program we used is called blast; specifically, a variant known as blastx. Blast stands for “basic local alignment search tool”. This analysis program computes all of the possible translations of the E. coli DNA to protein sequences (there are 6 overall: 5′->3′, 3′->5′, each multiplied by three possible framing positions of the codons), and then does a pattern-matching of the resulting amino acid sequences with the provided database of known drug-resistance sequences. The result is a sorted list of each known drug resistance protein along with the region of the E. coli genome that best matches the protein.

Here’s the output for the penicillin example:

# BLASTX 2.2.24 [Aug-08-2010]
# Query: 43 87880
# Database: uniprot-drug-resistance-AND-organism-coli.fasta
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value\
, bit score
43 sp|P0AD65|PBP2_ECOLI 100.00 632 0 0 29076 30971 1 632 0.0 1281
43 sp|P0AD68|FTSI_ECOLI 25.08 650 458 21 29064 30926 6 574 2e-33 142
43 sp|P60752|MSBA_ECOLI 32.80 186 120 6 12144 12686 378 558 6e-17 87.0
43 sp|P60752|MSBA_ECOLI 27.78 216 148 5 77054 77677 361 566 8e-14 76.6
43 sp|P77265|MDLA_ECOLI 27.98 193 133 6 12141 12701 370 555 2e-10 65.5


Here, you can see that the gene for PBP2_ECOLI has a 100% match inside the genome of O104:H4.

Now that we have this list, we can answer some interesting questions, such as “How many of the known drug resistance genes are inside O104:H4?” I find it fascinating that this question is answered with a shell script:

cat uniprot_search_m9 | awk '{if ($3 > 99) { print;}}' | cut -f2 |grep -v ^# | cut -f1 -d"_" | cut -f3 -d"|" | sort | uniq | wc -l

My perlfriend writes these so quickly and effortlessly it’s as if she’s tying IMs to friends — I half expect to see an “lol” at the end of the script. Anyways, the above script tells us that 1,138 genes are a 100% match against the database of 1,378 genes. If you loosen the criteria up to a 99% match, allowing for one or two mutations per gene — possibly a result of sequencing errors or just evolution — the list expands to 1,224 out of 1,378.

The inverse question is which drug-resistance genes are most definitely not in O104:H4. Maybe by looking at the resistance genes missing from O104:H4, we can gather clues as to which treatments could be effective against the bug.

In order to rule out a drug-resistance gene, we (arbitrarily) set a criteria of any gene with less than 70% best-case matching as “most likely not” a resistance that the bug has. The result of this query reveals that there are 116 genes that are known to confer drug resistance that are less than 70% matching in O104:H4. Here is the list:

A0SKI3 A2I604 A3RLX9 A3RLY0 A3RLY1 A5H8A5 B0FMU1 B1A3K9 B1LGD9 B3HN85 B3HN86 B3HP88 B5AG18 B6ECG5 B7MM15 B7MUI1 B7NQ58 B7NQ59 B7TR24 BLR CML D2I9F6 D5D1U9 D5D1Z3 D5KLY6 D6JAN9 D7XST0 D7Z7R4 D7Z7W9 D7ZDQ3 D7ZDQ4 D8BAY2 D8BEX8 D8BEX9 DYR21 DYR22 DYR23 E0QC79 E0QC80 E0QE33 E0QF09 E0QF10 E0QYN4 E1J2I1 E1S2P1 E1S2P2 E1S382 E3PYR0 E3UI84 E3XPK9 E3XPQ2 E4P490 E5ZP70 E6A4R5 E6A4R6 E6ASX0 E6AT17 E6B2K3 E6BS59 E7JQV0 E7JQZ4 E7U5T3 E9U1P2 E9UGM7 E9VGQ2 E9VX03 E9Y7L7 O85667 Q05172 Q08JA7 Q0PH37 Q0T948 Q0T949 Q0TI28 Q1R2Q2 Q1R2Q3 Q3HNE8 Q4HG53 Q4HG54 Q4HGV8 Q4HGV9 Q4HH67 Q4U1X2 Q4U1X5 Q50JE7 Q51348 Q56QZ5 Q56QZ8 Q5DUC3 Q5UNL3 Q6PMN4 Q6RGG1 Q6RGG2 Q75WM3 Q79CI3 Q79D79 Q79DQ2 Q79DX9 Q79IE6 Q79JG0 Q7BNC7 Q83TT7 Q83ZP7 Q8G9W6 Q8G9W7 Q8GJ08 Q8VNN1 Q93MZ2 Q99399 Q9F0D9 Q9F0S4 Q9F7C0 Q9F8W2 Q9L798

Again, you can plug any of these protein codes into the uniprot database and find out more about them. For example, BLR is the “Beta-lactam resistance protein”:

Has an effect on the susceptibiltiy to a number of antibiotics involved in peptidoglycan biosynthesis. Acts with beta lactams, D-cycloserine and bacitracin. Has no effect on the susceptibility to tetracycline, chloramphenicol, gentamicin, fosfomycin, vacomycin or quinolones. Might enhance drug exit by being part of multisubunit efflux pump. Might also be involved in cell wall biosynthesis.

Unfortunately, a cursory inspection reveals that most of the functions that O104:H4 lacks are just small, poorly understood fragments of machines involved in drug resistance. Which is actually an interesting lesson in itself: there is a popular notion that knowing a DNA sequence is the same as knowing what diseases or traits an organism may have. Even though we know the sequence and general properties of many proteins, it’s much, much harder to link them to a specific disease or trait. At some point, someone has to get their hands dirty and do the “wet biology” that assigns a biological significance to a given protein family. Pop culture references to DNA analysis are glibly unaware of this missing link, which leads to over-inflated expectations for genetic analysis, particularly in its utility for diagnosing and curing human disease and applications in eugenics.

While the result of this just-for-the-fun-of-it exercise isn’t a cure for the superbug, the neat thing about living here in The Future is that just a few days after an outbreak of a deadly disease halfway across the world, the sequence of the pathogen is available for download — and with free, open tools anyone can perform a simple analysis. This is a nascent, but promising, technology ecosystem.

A schematic for M. pneumoniae metabolism

Monday, January 17th, 2011

With the madness of CES over and the Chinese New Year holiday coming up, I finally found some time to catch up on some back issues of Science. I came across a beautiful diagram of the metabolic pathways of one of the smallest bacteria, Mycoplasma Pneumoniae. It’s part of an article by Eva Yus et al (Science 326, 1263-1271 (2009)).

Looking at this metabolic pathway reminds me of when I was less than a decade old, staring at the schematic of an Apple II. Back then, I knew that this fascinatingly complex mass of lines was a map to this machine in front of me, but I didn’t know quite enough to do anything with the map. However, the key was that a map existed, so despite its imposing appearance it represented a hope for fully unraveling such complexities.

The analogy isn’t quite precise, but at a 10,000 foot level the complexity and detail of the two diagrams feels similar. The metabolic schematic is detailed enough for me to trace a path from glucose to ethanol, and the Apple II schematic is detailed enough for me to trace a path from the CPU to the speaker.

And just as a biologist wouldn’t make much of a box with 74LS74 attached to it, an electrical engineer wouldn’t make much of a box with ADH inside it (fwiw, a 74LS74 (datasheet) is a synchronous storage device with two storage elements, and ADH is alcohol deydrogenase, an enzyme coded by gene MPN564 (sequence data) that can turn acetaldehyde into ethanol).

In the supplemental material, the authors of the paper included what reads like a BOM (bill of materials) for M. pneumoniae. Every enzyme (pentagonal boxes in the schematic) is listed in the BOM with its functional description, along with a reference that allows you to find its sequence source code. At the very end is a table of uncharacterized genes — those who do a bit of reverse engineering would be very familiar with such tables of “hmm I sort of know what it should do but I’m not sure yet” parts or function calls.

One Mutation per 15 Cigarettes Smoked

Friday, January 22nd, 2010

Now that’s a memorable factoid. Nature recently published a paper titled “A small-cell lung cancer genome with complex signatures of tobacco exposure” (Nature 463, 184-190 (14 January 2010), Pleasance et al), which as its title implies, contains the summary of the sequence of a cancer genome derived from a lung cancer tumor. It’s an interesting read; I can’t claim to understand it all. At a high level, they found 22,910 somatic substitutions, 65 insertions and deletions, 58 genomic rearrangements, and 334 copy number segment variations were identified; as I understand it, these are uncorrectable errors, i.e. the ones that got past the cell’s natural error-correction mechanisms. That’s out of about 3 gigabases in the entire genome, or an accumulated error rate of about 1 in 5 million.

I’m not an expert on cancer, but the way it was explained to me is that basically every cell has the capacity to become a cancer, but there are several dozen regulatory pathways that keep a cell in check. In a layman sort of way, every cell having the capacity to become a cancer makes sense because we come from an embryonic stem cell, and tumorigenic cancer cells are differentiated cells that have lost their programming due to mutations, thereby returning to being a (rogue) stem cell. So, a cancer happens when a cell accumulates enough non-fatal mutations such that all the regulation mechanisms are defeated. Of course, this is basically a game of Russian roulette; some cells simply gather fatal mutations and undergo apoptosis. In order to become a cancer cell, it has to survive a lot of random mutations, but then again there are plenty of cells in a lung to participate in the process.

Above: a map of the mutations found in the cancer cell. The 23 chromosomes are laid end to end around the edge of the circle. There’s a ton of data in the graph; for example, the light orange bars represent the heterozygous substitution density per 10 megabases. A higher resolution diagram along with a more detailed explanation can be found in the paper.

The tag line for this post is lifted from the discussion section of the paper, where they assume that lung cancer develops after about 50 pack-years of smoking, which roughly translates to the ultimate cancer cell acquiring on average one mutation every 15 cigarettes smoked. Even though this is an over-simplification of the situation, the tag line is memorable because it makes the impact of smoking seem much more immediate and concrete: it’s one thing to say on average, in fifty years, you will get cancer from smoking a pack a day; it’s another to say on average, when you finish that pack of cigarettes, you are one mutation closer to getting cancer.