Reverse Engineering Superbugs

The outbreak of the EHEC O104:H4 E. coli “superbug” in Europe has got me thinking about biology again.

The rise of antibiotic-resistant superbugs are a product of our love of antibiotics. In the absence of antibiotics, a bug that has few resistances will grow faster and more efficiently than one that has to put on bullet-proof armor every morning and lug around heavy artillery. In other words, the biological machinery required to produce antibiotic resistance comes at a fitness cost for the bug. In antibiotic-free conditions, non-resistant strains grow faster than the resistant strains; and with as little as 20 minutes per generation, just a couple days can yield hundreds of generations. This is why, thankfully, not every bug out there has a full suite of drug resistance — a chief enemy of the superbug is the common bug.

According to this evolutionary theory for the acquisition and loss of drug resistance genes, a hospital is an ideal breeding environment for superbugs: they are asceptic (less competition from common bugs), and full of antibiotics (plenty of selective pressure to acquire resistance genes).

Thus it is curious to find superbugs in food. Farms are teeming with common bugs, creating a selective pressure to lose antibiotic resistance genes. While antibiotics are routinely put into farm animal feed, it’s probably not cost-effective to use broad-spectrum antibiotics on such a scale. Perhaps O104:H4 is just a spontaneous coincidence, a fluke — a bug had acquired a set of genes, got lucky and grew, and just as quickly got edged out by more competitive neighbors. This could explain why it’s been tough to find its origin.

Fortunately, the entire sequence of the O104:H4 bug is available for download on the internet. Our friends in China — BGI, located in Shenzhen — acquired a sample and in an unusual act released the sequence for public download. This is unusual because research organizations typically hold this kind of data close to the chest, partially for peer review to vet it before public release, and partially for competitive advantage in academic publications — proprietary access to data is a common method to reduce competition for high-profile publications, and thus ensure your academic reputation. Whatever their reasons are for sharing the data, I think it’s worth noting the contribution, because now everybody in the world can perform an analysis on the bug.

And that’s where the fun begins! Analyzing the sequence data requires a little know-how, but fortunately, my “perlfriend” is a noted bioinformaticist. The raw sequence data provided by BGI is a set oversampled sub-sequences, which have to be assembled based on matching up overlapping regions. Once you assemble the sequence, you get a set of contiguous reads, but there are still gaps. It’s a bit like trying to compose a large picture out of a number of small photos taken at random. With enough sampling you will eventually create a complete picture, but for various technical reasons there are still ambiguities and gaps.

After assembly, the genome of O104:H4 is stitched from over a half million short DNA samples into 513 contiguous fragments of DNA (“contigs” in bio-speak), with a total length of 5.3 million base pairs (notably, wikipedia cites E. coli as having only 4.6 million base pairs, so O104:H4 is probably at least 15% longer — and likewise takes more time to replicate than a non-drug resistant strain). Here’s contig 34 of the assembly:


(Fun fact: the word “Gattaca” occurs 252 times in the genome of O104:H4)

Aside from making gratuitous pop culture references, the raw DNA isn’t very useful to us — it’s as if we were staring at binary machine code. In order to analyze the data, you need to “decompile” the methods contained within the DNA. Fortunately, protein sequences are highly conserved. Thus, a function that has been determined through biological experiment (for example, snipping out the DNA and observing what happens to the cell, or transfecting/transforming the DNA into a new cell and seeing what new abilities are acquired) can be correlated with a sequence of DNA, which can then be pattern-matched over the entire record to determine what functions (genes) are inside the overall genome.

The pieces needed to do this reverse-engineering are a protein database, and a tool called “blastx”. All of these tools are available free for download.

The list of known proteins can be downloaded from Searching for “drug resistance” restricted to E. coli organisms yields a nice list of proteins that have been identified by scientists over the years to confer upon E. coli parts of drug-resistance machinery. Overall, our query to the uniprot database returned 1,378 proteins that are described to confer drug resistance to E. coli.

Have a look at Multidrug transporter emrE []. Inside the link, you’ll find a description of the biological mechanism for its function (it pumps antibiotics out of the cell), its secondary structure (a notion of the shape of the protein) and its 110-residue amino acid sequence.

Here’s another example of a snippet from the database for a drug you may recognize:

>sp|P0AD65|PBP2_ECOLI Penicillin-binding protein 2 OS=Escherichia coli (strain K12) GN=mrdA PE=3 SV=1

(Incidentally, I find it amusing that the sequence for PBP2 is shorter than, for example, my PGP public key block)

PBP2_ECOLI is linked to penicillin resistance, and functions as a mutant of a gene that determines the shape of the bacteria. Reading through the bio-speak, it seems that this resistant variant is adapted to buy Amoxicillin online; bacteria with non-resistant forms of this gene are unable to form properly shaped cell walls and thus die. So, by browsing this database, we are getting a feel for the variety of countermeasures that bacteria has: sometimes they are active (pumping the antibiotic out of the cell) and sometimes they are passive (mutations that enable operation despite the presence of antibiotics).

Now, you need the actual decompiler itself. The program we used is called blast; specifically, a variant known as blastx. Blast stands for “basic local alignment search tool”. This analysis program computes all of the possible translations of the E. coli DNA to protein sequences (there are 6 overall: 5′->3′, 3′->5′, each multiplied by three possible framing positions of the codons), and then does a pattern-matching of the resulting amino acid sequences with the provided database of known drug-resistance sequences. The result is a sorted list of each known drug resistance protein along with the region of the E. coli genome that best matches the protein.

Here’s the output for the penicillin example:

# BLASTX 2.2.24 [Aug-08-2010]
# Query: 43 87880
# Database: uniprot-drug-resistance-AND-organism-coli.fasta
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value\
, bit score
43 sp|P0AD65|PBP2_ECOLI 100.00 632 0 0 29076 30971 1 632 0.0 1281
43 sp|P0AD68|FTSI_ECOLI 25.08 650 458 21 29064 30926 6 574 2e-33 142
43 sp|P60752|MSBA_ECOLI 32.80 186 120 6 12144 12686 378 558 6e-17 87.0
43 sp|P60752|MSBA_ECOLI 27.78 216 148 5 77054 77677 361 566 8e-14 76.6
43 sp|P77265|MDLA_ECOLI 27.98 193 133 6 12141 12701 370 555 2e-10 65.5


Here, you can see that the gene for PBP2_ECOLI has a 100% match inside the genome of O104:H4.

Now that we have this list, we can answer some interesting questions, such as “How many of the known drug resistance genes are inside O104:H4?” I find it fascinating that this question is answered with a shell script:

cat uniprot_search_m9 | awk '{if ($3 > 99) { print;}}' | cut -f2 |grep -v ^# | cut -f1 -d"_" | cut -f3 -d"|" | sort | uniq | wc -l

My perlfriend writes these so quickly and effortlessly it’s as if she’s tying IMs to friends — I half expect to see an “lol” at the end of the script. Anyways, the above script tells us that 1,138 genes are a 100% match against the database of 1,378 genes. If you loosen the criteria up to a 99% match, allowing for one or two mutations per gene — possibly a result of sequencing errors or just evolution — the list expands to 1,224 out of 1,378.

The inverse question is which drug-resistance genes are most definitely not in O104:H4. Maybe by looking at the resistance genes missing from O104:H4, we can gather clues as to which treatments could be effective against the bug.

In order to rule out a drug-resistance gene, we (arbitrarily) set a criteria of any gene with less than 70% best-case matching as “most likely not” a resistance that the bug has. The result of this query reveals that there are 116 genes that are known to confer drug resistance that are less than 70% matching in O104:H4. Here is the list:

A0SKI3 A2I604 A3RLX9 A3RLY0 A3RLY1 A5H8A5 B0FMU1 B1A3K9 B1LGD9 B3HN85 B3HN86 B3HP88 B5AG18 B6ECG5 B7MM15 B7MUI1 B7NQ58 B7NQ59 B7TR24 BLR CML D2I9F6 D5D1U9 D5D1Z3 D5KLY6 D6JAN9 D7XST0 D7Z7R4 D7Z7W9 D7ZDQ3 D7ZDQ4 D8BAY2 D8BEX8 D8BEX9 DYR21 DYR22 DYR23 E0QC79 E0QC80 E0QE33 E0QF09 E0QF10 E0QYN4 E1J2I1 E1S2P1 E1S2P2 E1S382 E3PYR0 E3UI84 E3XPK9 E3XPQ2 E4P490 E5ZP70 E6A4R5 E6A4R6 E6ASX0 E6AT17 E6B2K3 E6BS59 E7JQV0 E7JQZ4 E7U5T3 E9U1P2 E9UGM7 E9VGQ2 E9VX03 E9Y7L7 O85667 Q05172 Q08JA7 Q0PH37 Q0T948 Q0T949 Q0TI28 Q1R2Q2 Q1R2Q3 Q3HNE8 Q4HG53 Q4HG54 Q4HGV8 Q4HGV9 Q4HH67 Q4U1X2 Q4U1X5 Q50JE7 Q51348 Q56QZ5 Q56QZ8 Q5DUC3 Q5UNL3 Q6PMN4 Q6RGG1 Q6RGG2 Q75WM3 Q79CI3 Q79D79 Q79DQ2 Q79DX9 Q79IE6 Q79JG0 Q7BNC7 Q83TT7 Q83ZP7 Q8G9W6 Q8G9W7 Q8GJ08 Q8VNN1 Q93MZ2 Q99399 Q9F0D9 Q9F0S4 Q9F7C0 Q9F8W2 Q9L798

Again, you can plug any of these protein codes into the uniprot database and find out more about them. For example, BLR is the “Beta-lactam resistance protein”:

Has an effect on the susceptibiltiy to a number of antibiotics involved in peptidoglycan biosynthesis. Acts with beta lactams, D-cycloserine and bacitracin. Has no effect on the susceptibility to tetracycline, chloramphenicol, gentamicin, fosfomycin, vacomycin or quinolones. Might enhance drug exit by being part of multisubunit efflux pump. Might also be involved in cell wall biosynthesis.

Unfortunately, a cursory inspection reveals that most of the functions that O104:H4 lacks are just small, poorly understood fragments of machines involved in drug resistance. Which is actually an interesting lesson in itself: there is a popular notion that knowing a DNA sequence is the same as knowing what diseases or traits an organism may have. Even though we know the sequence and general properties of many proteins, it’s much, much harder to link them to a specific disease or trait. At some point, someone has to get their hands dirty and do the “wet biology” that assigns a biological significance to a given protein family. Pop culture references to DNA analysis are glibly unaware of this missing link, which leads to over-inflated expectations for genetic analysis, particularly in its utility for diagnosing and curing human disease and applications in eugenics.

While the result of this just-for-the-fun-of-it exercise isn’t a cure for the superbug, the neat thing about living here in The Future is that just a few days after an outbreak of a deadly disease halfway across the world, the sequence of the pathogen is available for download — and with free, open tools anyone can perform a simple analysis. This is a nascent, but promising, technology ecosystem.

15 Responses to “Reverse Engineering Superbugs”

  1. […] Reverse Engineering Superbugs @ bunnie’s blog… While the result of this just-for-the-fun-of-it exercise isn’t a cure for the superbug, the neat thing about living here in The Future is that just a few days after an outbreak of a deadly disease halfway across the world, the sequence of the organism is now available for download and with free, open tools anyone can perform a simple analysis. This is a nascent, but promising, technology ecosystem. Filed under: random — by adafruit, posted June 8, 2011 at 2:34 pm Comments (0) […]

  2. Jonas says:

    It’s both awesome and scary that we can determine what antibiotics will work and not and that without being even close to understand why.

    • Ian Stirling says:

      It’s not quite that simple.
      Knowing a car contains a 1W LED doesn’t mean that the interior lighting is LED based.
      It may simply mean that the owner has an LED stuck to their shoe.

      It’s possible that the gene is present in a nonfunctional state, so you can’t quite say gene for X resistance present, must be resistant to X – it’s at best a hint.

  3. jldugger says:

    “This is unusual because research organizations typically hold this kind of data close to the chest, partially for peer review to vet it before public release, and partially for competitive advantage in academic publications.”

    Technically, the Wellcome trust requires publication of raw data. I think the NSF has a similar attachment. Doubt the same is true in China, but I could be wrong.

    As far as the software analogy goes, I’d not call blast a decompiler so much as a fuzzy matching tool, along the lines of diff on steroids. Other than that, it’s true, we’re a far way from truly decompiling DNA. We’re getting better — we know what promoter / sequences are, and where transcription starts. But beyond that, determining function from first principals and a sequence is a long long ways off.

  4. Mihai Todor says:

    Bunnie, you seem to be asking for another slashdot flood with this article :D

  5. schobi says:

    Thanks for the great introduction to the world of genes and biology!

    One aspect confuses me. The public download post at
    writes: “The analysis further showed that this deadly bacterium carries several antibiotic resistance genes, including […] Beta-lactam antibiotics”.
    On other sources I find that Beta-lactam antibiotics are the class of penicillin antibiotics and EHEC is resistant.

    but you wrote that you identified the “Beta-lactam resistance protein” as “most likely not” and “less than 70% matching”.

    How should we read this? Even 70% means that the bacteria has a resistance? How reliable is this matching of bases? Is there anything left at all?

    • bunnie says:

      That’s right, beta-lactam includes penicillin. The question you ask is very apropos to a central challenge of genetic analysis that is currently under-appreciated: “we’ve got lots of data. Now what?” This is where computer science crosses over into real biology.

      Most biological pathways have multiple routes that can achieve a similar goal. You’ll note that in the above analysis, EHEC O157:H7 is a 100% match for PBP2, which confers penicillin resistance. The mechanism for penicillin resistance by PBP2 is completely different than the mechanism implicated by BLR. It seems that BLR may act as more of a pump to evacuate the antibiotic from the cell, whereas PBP2 is a change to the shape of the protein that penicillin binds to, making penicillin effectively useless even if it were inside the cell wall.

      So, even though the “suggested name” of BLR is “Beta Lactam Resistance Protein” according to the uniprot database, the short name alone cannot fully capture the biological subtleties. Also, note that names are assigned sort of in a “discovered first” order, so in fact BLR could have been named long before we knew there were multiple paths for beta-lactam resistance, so some proteins have outdated/inaccurate names that have stuck out of tradition.

      Perhaps more illuminating is the detailed description of BLR — notice all the qualifying “mights” … “might enhance”, “might also be involved” … clearly, while experiments have shown some correlation of BLR with beta lactam resistance, the mechanism itself has yet to be fully understood and a subject of further research. At the moment, our computational tools are not powerful enough to determine “in silico” the biological purpose of a protein; we can simply say if a protein is there or not, and we must still rely upon our front-line of “wet biologists” to fill in how to read these tea leaves.

      As for your question about the 70%, I had noted this was an arbitrary measure that we had picked. At 70% or less matching, in the best-case 70% of the base pairs match the known sequence. Even a mutation of just a few base pairs can disrupt or drastically change protein function, so 70% was chosen as a conservative threshold to “definitely” rule out related function. Going back to my analysis database, I find that BLR doesn’t even exist at any appreciable level of match inside EHEC O157:H7, which probably means that gene simply doesn’t exist in that bug. On the other hand, as noted above, EHEC O157:H7 has a mutation which would allow it to effectively swim in penicillin and not break a sweat, so lacking the gene could be in fact a competitive advantage, as it might be redundant and thus a cost to fitness.

      • Orpheus says:

        Actually, BLR is just one of MANY β-lactam (= a certain 4-member ring found in penicillin-like drugs incl. most cephalosporins) resistance mechanisms. IIRC your description said it bound β-lactams for transport or other unknown inactivation, and though it’s been a LONG time since I was an active molecular biologist, I was taught more about β-lactamases (reusable enzymes that degrade β-lactams; one molecule could disable many β-lactam molecules) and other mechanisms. Some of those did indeed quasi-permanently bind β-lactams, but those were better understood than your BLR seems to be. Manufacturing amino acids and then synthesizing them into a big protein to inactivate a single much smaller β-lactam molecule is a pricey cellular investment.

        (That doesn’t make BLR a bad idea: bulletproof vests are 1000x as expensive and difficult as bullets, but they’ll keep you alive until you develop cheap small anti-bullets. β-lactams themselves are more expensive than the NAG/NAM molecules they replace in some bacterial cell walls [to creating a dead end that becomes a hole in the cell wall during growth] –doubly expensive since it can take several holes to kill one bacterium– but still MUCH ‘cheaper’ to make than an enzyme)

        That’s one problem with blindly using a Perl sieve: you don’t see all the antii-β-lactam mechanisms from the list that this strain *does* possess

  6. […] E. coli This post by bunnie gets my vote for blog post of the year. First he shows you where to download the genetic […]

  7. bunnie says:

    I’m surprised nobody caught this before, but I had a pervasive typo in the original post. I had incorrectly named the bug O141:H4, the correct name is O104:H4. That’s fixed now…

  8. Isa says:

    Where do I sign up for one of these “perlfriends?”

  9. tai viinikka says:

    Nice molecular biology primer for techies, bunnie! A bunch of the people I went through my biochem undergrad with have now switched fields and don’t think much about biology any more. :(

    It seems like spending a couple years pondering information flow, feedbacks, transcriptions and translations, and signals and tranducers was a worthwhile preparation for things other than just growing cells and mutating genes.


  10. […] E. coli December 1st, 2011 – Posted by: admin in SMB, tags: coli, Decompiling This post by bunnie gets my vote for weblog post of the year. First he shows you where to download the […]