Archive for the ‘Ponderings’ Category

Towards a More Open Secure Element Chip

Tuesday, December 20th, 2022

Secure Element” (SE) chips have traditionally taken a very closed-source, NDA-heavy approach. Thus, it piqued my interest when an early-stage SE chip startup, Cramium (still in stealth mode), approached me to advise on open source strategy. This blog post explains my reasoning for agreeing to advise Cramium, and what I hope to accomplish in the future.

As an open source hardware activist, I have been very pleased at the progress made by the eFabless/Google partnership at creating an open-to-the-transistors physical design kit (PDK) for chips. This would be about as open as you can get from the design standpoint. However, the partnership currently supports only lower-complexity designs in the 90nm to 180nm technology nodes. Meanwhile, Cramium is planning to tape out their security chip in the 22nm node. A 22nm chip would be much more capable and cost-effective than one fabricated in 90nm (for reference, the RP2040 is fabricated in 40nm, while the Raspberry Pi 4’s CPU is fabricated in 28nm), but it would not be open-to-the-transistors.

Cramium indicated that they want to push the boundaries on what one can do with open source, within the four corners of the foundry NDAs. Ideally, a security chip would be fabricated in an open-PDK process, but I still feel it’s important to engage and help nudge them in the right direction because there is a genuine possibility that an open SDK (but still closed PDK) SE in a 22nm process could gain a lot of traction. If it’s not done right, it could establish poor de-facto standards, with lasting impacts on the open source ecosystem.

For example, when Cramium approached me, their original thought was to ship the chip with an ARM Cortex M7 CPU. Their reasoning is that developers prize a high-performance CPU, and the M7 is one of the best offerings in its class from that perspective. Who doesn’t love a processor with lots of MHz and a high IPC?

However, if Cramium’s chip were to gain traction and ship to millions of customers, it could effectively entrench the ARM instruction set — and more importantly — quirks such as the Memory Protection Unit (MPU) as the standard for open source SEs. We’ve seen the power of architectural lock-in as the x86 serially shredded the Alpha, Sparc, Itanium and MIPS architectures; so, I worry that every new market embracing ARM as a de-facto standard is also ground lost to fully open architectures such as RISC-V.

So, after some conversations, I accepted an advisory position at Cramium as the Ecosystem Engineer under the condition that they also include a RISC-V core on the chip. This is in addition to the Cortex M7. The good news is that a RISC-V core is royalty-free, and the silicon area necessary to add it at 22nm is basically a rounding error in cost, so it was a relatively easy sell. If I’m successful at integrating the RISC-V core, it will give software developers a choice between ARM and RISC-V.

So why is Cramium leaving the M7 core in? Quite frankly, it’s for risk mitigation. The project will cost upwards of $20 million to tape out. The ARM M7 core has been taped out and shipped in millions of products, and is supported by a billion-dollar company with deep silicon experience. The VexRiscv core that we’re planning to integrate, on the other hand, comes with no warranty of fitness, and it is not as performant as the Cortex M7. It’s just my word and sweat of brow that will ensure it hopefully works well enough to be usable. Thus, I find it understandable that the people writing the checks want a “plan B” that involves a battle-tested core, even if proprietary.

This will understandably ruffle the feathers of the open source purists who will only certify hardware as “Free” if and only if it contains solely libre components. I also sympathize with their position; however, our choices are either the open source community somehow provides a CPU core with a warranty of fitness, effectively underwriting a $20 million bill if there is a fatal bug in the core, or I walk away from the project for “not being libre enough”, and allow ARM to take the possibly soon-to-be-huge open source SE market without challenge.

In my view it’s better to compromise and have a seat at the table now, than to walk away from negotiations and simply cede green fields to proprietary technologies, hoping to retake lost ground only after the community has achieved consensus around a robust full-stack open source SE solution. So, instead of investing time arguing over politics before any work is done, I’m choosing to invest time building validation test suites. Once I have a solid suite of tests in hand, I’ll have a much stronger position to argue for the removal of any proprietary CPU cores.

On the Limit of Openness in a Proprietary Ecosystem

Advising on the CPU core is just one of many tasks ahead of me as their open source Ecosystem Engineer. Cramium’s background comes from the traditional chip world, where NDAs are the norm and open source is an exotic and potentially fatal novelty. Fatal, because most startups in this space exit through acquisition, and it’s much harder to negotiate a high acquisition price if prized IP is already available free-of-charge. Thus my goal is to not alienate their team with contumelious condescension about the obviousness and goodness of open source that is regrettably the cultural norm of our community. Instead, I am building bridges and reaching across the aisle, trying to understand their concerns, and explaining to them how and why open source can practically benefit a security chip.

To that end, trying to figure out where to draw the line for openness is a challenge. The crux of the situation is that the perceived fear/uncertainty/doubt (FUD) around a particular attack surface tends to have an inverse relation to the actual size of the attack surface. This illustrates the perceived FUD around a given layer of the security hierarchy:

Generally, the amount of FUD around an attack surface grows with how poorly understood the attack surface is: naturally we fear things we don’t understand well; likewise we have less fear of the familiar. Thus, “user error” doesn’t sound particularly scary, but “direct readout” with a focused ion beam of hardware security keys sounds downright leet and scary, the stuff of state actors and APTs, and also of factoids spouted over beers with peers to sound smart.

However, the actual size of the attack surface is quite the opposite:

In practice, “user error” – weak passwords, spearphishing, typosquatting, or straight-up fat fingering a poorly designed UX – is common and often remotely exploitable. Protocol errors – downgrade attacks, failures to check signatures, TOCTOUs – are likewise fairly common and remotely exploitable. Next in the order are just straight-up software bugs – buffer overruns, use after frees, and other logic bugs. Due to the sheer volume of code (and more significantly the rate of code turnover) involved in most security protocols, there are a lot of bugs, and a constant stream of newly minted bugs with each update.

Beneath this are the hardware bugs. These are logical errors in the implementation of a function of a piece of hardware, such as memory aliasing, open test access ports, and oversights such as partially mutable cryptographic material (such as an AES key that can’t be read out, but can be updated one byte at a time). Underneath logical hardware bugs are sidechannels – leakage of secret information through timing, power, and electromagnetic emissions that can occur even if the hardware is logically perfect. And finally, at the bottom layer is direct readout – someone with physical access to a chip directly inspecting its arrangement of atoms to read out secrets. While there is ultimately no defense against the direct readout of nonvolatile secrets short of zeroizing them on tamper detection, it’s an attack surface that is literally measured in microns and it requires unmitigated physical access to hardware – a far cry from the ubiquity of “user error” or even “software bugs”.

The current NDA-heavy status quo for SE chips creates an analytical barrier that prevents everyday users like us from determining how big the actual attack surface is. That analytical barrier actually extends slightly up the stack from hardware, into “software bugs”. This is because without intimate knowledge of how the hardware is supposed to function, there are important classes of software bugs we can’t analyze.

Furthermore, the inability of developers to freely write code and run it directly on SEs forces more functionality up into the protocol layer, creating an even larger attack surface.

My hope is that working with Cramium will improve this situation. In the end, we won’t be able to entirely remove all analytical barriers, but hopefully we arrive at something closer to this:

Due to various NDAs, we won’t be able to release things such as the mask geometries, and there are some blocks less relevant to security such as the ADC and USB PHY that are proprietary. However, the goal is to have the critical sections responsible for the security logic, such as the cryptographic accelerators, the RISC-V CPU core, and other related blocks shared as open source RTL descriptions. This will allow us to have improved, although not perfect, visibility into a significant class of hardware bugs.

The biggest red flag in the overall scenario is that the on-chip interconnect matrix is slated to be a core generated using the ARM NIC-400 IP generator, so this logic will not be available for inspection. The reasoning behind this is, once again, risk mitigation of the tapeout. This is unfortunate, but this also means we just need to be a bit more clever about how we structure the open source blocks so that we have a toolbox to guard against potential misbehavior in the interconnect matrix.

My personal goal is to create a fully OSS-friendly FPGA model of the RISC-V core and their cryptographic accelerators using the LiteX framework, so that researchers and analysts can use this to model the behavior of the SE and create a battery of tests and fuzzers to confirm the correctness of construction of the rest of the chip.

In addition to the work advising Cramium’s engagement with the open source community, I’m also starting to look into non-destructive optical inspection techniques to verify chips in earnest, thanks to a grant I received from NLNet’s NGI0 Entrust fund. More on this later, but it’s my hope that I can find a synergy between the work I’m doing at Cramium and my silicon verification work to help narrow the remaining gaps in the trust model, despite refractory foundry and IP NDAs.

Counterpoint: The Utility of Secrecy in Security

Secrecy has utility in security. After all, every SE vendor runs with this approach, and for example, we trust the security of nuclear stockpiles to hardware that is presumably entirely closed source.

Secrecy makes a lot of sense when:

  • Even a small delay in discovering a secret can be a matter of life or death
  • Distribution and access to hardware is already strictly controlled
  • The secrets would rather be deleted than discovered

Military applications check all these boxes. The additional days, weeks or months delay incurred by an adversary analyzing around some obfuscation can be a critical tactical advantage in a hot war. Furthermore, military hardware has controlled distribution; every mission-critical box can be serialized and tracked. Although systems are designed assuming serial number 1 is delivered to the Kremlin, great efforts are still taken to ensure that is not the case (or that a decoy unit is delivered), since even a small delay or confusion can yield a tactical advantage. And finally, in many cases for military hardware, one would rather have the device self-destruct and wipe all of its secrets, rather than have its secrets extracted. Building in booby traps that wipe secrets can measurably raise the bar for any adversary contemplating a direct-readout attack.

On the other hand, SEs like those found in bank cards and phones are:

  • Widely distributed – often directly and intentionally to potentially adversarial parties
  • Protecting data at rest (value of secret is constant or may even grow with time)
  • Used as a trust root for complicated protocols that typically update over time
  • Protecting secrets where extraction is preferable to self-destruction. The legal system offers remedies for recourse and recovery of stolen assets; whereas self-destruction of the assets offers no recourse

In this case, the role of the anti-tamper countermeasures and side-channel minimization is to raise the investment necessary to recover data from “trivial” to somewhere around “there’s probably an easier and cheaper way to go about this…right?”. After all, for most complicated cryptosystems, the bigger risk is an algorithmic or protocol flaw that can be exploited without any circumvention of hardware countermeasures. If there is a protocol flaw, employing an SE to protect your data is like using a vault, but leaving the keys dangling on a hook next to the vault.

It is useful to contemplate who bears the greatest risk in the traditional SE model, where chips are typically distributed without any way to update their firmware. While an individual user may lose the contents of their bank account, a chip maker may bear a risk of many tens of millions of dollars in losses from recalls, replacement costs and legal damages if a flaw were traced to their design issue. In this game, the player with the most to lose is the chipmaker, not any individual user protected by the chip. Thus, a chipmaker has little incentive to disclose their design’s details.

A key difference between a traditional SE and Cramium’s is that Cramium’s firmware can be updated (assuming an updateable SKU is released; this was a surprisingly controversial suggestion when I brought it up). This is thanks in part to the extensive use of non-volatile ReRAM to store the firmware. This likewise shifts the calculus on what constitutes a recall event. The open source firmware model also means that the code on the device comes, per letter of the license, without warranty; the end customer is ultimately responsible for building, certifying and deploying their own applications. Thus, for a player like Cramium, the potential benefits of openness outweigh those of secrecy and obfuscation embraced by traditional SE vendors.

Summary

My role is to advise Cramium on how to shift the norms around SEs from NDAs to openness. Cramium is not attempting to forge an open-foundry model – they are producing parts using a relatively advanced (compared to your typical stand-alone SE) 22nm process. This process is protected by the highly restrictive foundry NDAs. However, Cramium plans to release much of their design under an open source license, to achieve the following goals:

  • Facilitate white-box inspection of cryptosystems implemented using their primitives
  • Speed up discovery of errors; and perhaps more importantly, improve the rate at which they are patched
  • Reduce the risk of protocol and algorithmic errors, so that hardware countermeasures could be the actual true path of least resistance
  • Build trust
  • Promote wide adoption and accelerate application development

Cramium is neither fully open hardware, nor is it fully closed. My goal is to steer it toward the more open side of the spectrum, but the reality is there are going to be elements that are too difficult to open source in the first generation of the chip.

The Cramium chip complements the eFabless/Google efforts to build open-to-the-transistors chips. Today, one can build chips that are open to the mask level using 90 – 180nm processes. Unfortunately, the level of integration achievable with their current technology isn’t quite sufficient for a single-chip Secure Element. There isn’t enough ROM or RAM available to hold the entire application stack on chip, thus requiring a multi-chip solution and negating the HSM-like benefits of custom silicon. The performance of older processes is also not sufficient for the latest cryptographic systems, such as Post Quantum algorithms or Multiparty Threshold ECDSA with Identifiable Aborts. On the upside, one could understand the design down to the transistor level using this process.

However, it’s important to remember that knowing the mask pattern does not mean you’ve solved the supply chain problem, and can trust the silicon in your hands. There are a lot of steps that silicon goes through to go from foundry to product, and at any of those steps the chip you thought you’re getting could be swapped out with a different one; this is particularly easy given the fact that all of the chips available through eFabless/Google’s process use a standardized package and pinout.

In the context of Cramium, I’m primarily concerned about the correctness of the RTL used to generate the chip, and the software that runs on it. Thus, my focus in guiding Cramium is to open sufficient portions of the design such that anyone can analyze the RTL for errors and weaknesses, and less on mitigating supply-chain level attacks.

That being said, RTL-level transparency can still benefit efforts to close the supply chain gap. A trivial example would be using the RTL to fuzz blocks with garbage in simulation; any differences in measured hardware behavior versus simulated behavior could point to extra or hidden logic pathways added to the design. Extra backdoor circuitry injected into the chip would also add loading to internal nodes, impacting timing closure. Thus, we could also do non-destructive, in-situ experiments such as overclocking functional blocks to the point where they fail; with the help of the RTL we can determine the expected critical path and compare it against the observed failure modes. Strong outliers could indicate tampering with the design. While analysis like this cannot guarantee the absence of foundry-injected backdoors, it constrains the things one could do without being detected. Thus, the availability of design source opens up new avenues for verifying correctness and trustability in a way that would be much more difficult, if not impossible, to do without design source.

Finally, by opening as much of the chip as possible to programmers and developers, I’m hoping that we can get the open source SE chip ecosystem off on the right foot. This way, as more advance nodes shift toward open PDKs, we’ll be ready and waiting to create a full-stack open source solution that adequately addresses all the security needs of our modern technology ecosystem.

Book Review: Open Circuits

Wednesday, September 21st, 2022

There’s a profound beauty in well-crafted electronics.

Somehow, the laws of physics conspired with the evolution of human consciousness such that sound engineering solutions are also aesthetically appealing: from the ideal solder fillet, to the neat geometric arrangements of components on a circuit board, to the billowing clouds of standard cells laid down by the latest IC place-and-route tools, aesthetics both inspire and emerge from the construction of practical, everyday electronics.

Eric Schlaepfer (@TubeTimeUS) and Windell Oskay (co-founder of Evil Mad Scientist)’s latest book, Open Circuits, is a celebration of the electronic aesthetic, by literally opening circuits with mechanical cross-sections, accompanied by pithy explanations and illustrations. Their masterfully executed cross-sectioning process and meticulous photography blur the line between engineering and art, reminding us that any engineering task executed with soul and care results in something that can inspire feelings of awe (“wow!”) and reflection (“huh.”): that is art.

The pages of Open Circuits contain ample inspiration for both novices and grizzled veterans alike. Having been in electronics for four decades, I sometimes worry I’m becoming numb and cynical as I watch the world’s landfills brim with cheap electronics, built without care and purchased (and disposed of) with even less thought. However, as I thumb through the pages of Open Circuits, that excitement, that awe which I felt as a youth when I traced my fingers along the outlines of the resistors and capacitors of my first computer returns to me. Schlaepfer and Oskay render even the most mundane artifacts, such as the ceramic disc capacitor, in splendid detail – and in ways I’ve never seen before. Prior to now, I had no intuition for the dimensions of an actual capacitor’s dielectric material. I also didn’t realize that every thick film resistor bears the marks of lasers that trim it to its final value. Or just seeing the cross-section of a coaxial cable, as joined through a connector – all of a sudden, the telegrapher’s equations and the time domain reflectometry graphs take on a new and very tangible meaning to me. Ah, I think, so that’s the bump in the TDR graph at the connector interface!

Also breathtaking is the sheer scope of components addressed by Schlaepfer and Oskay. Nothing is too retro, nothing is too modern, nothing is too delicate: if you’ve ever wanted to see a vacuum tube cut in half, they managed to somehow slice straight through it without shattering the thin glass envelope; likewise, if you ever wondered what your smartphone motherboard might look like, they’ve gone and sliced clear through that as well.

One of my favorite tricks of the authors is when they slice through optoelectronic devices: somehow, they manage to cut through multiple LEDs and leave them in an operable state, leading to stunning images such as a 7-segment LED still displaying the number “5” yet revealed in cross-section. I really appreciate the effort that went into mounting that part onto a beautifully fabricated and polished (perhaps varnished?) copper-clad circuit board, so that not only are you treated to the spectacle of the still-functional cross sectioned device, you have the reflection of the device rippling off of a handsomely brushed copper surface. Like I said: any engineering executed with soul and care is also art.

In a true class act, Schlaepfer and Oskay conclude the book with an “Afterward” that shares the secrets of their cross-sectioning and photography techniques. Adhering to the principle of openness, this meta-chapter breaks down the fourth wall and gives you a peek into their atelier, showing you the tools and techniques used to generate the images within the book. Such sharing of hard-earned knowledge is a hallmark of true masters; while lesser authors would withold such trade secrets, fearing others may rise to compete with them, Schlaepfer and Oskay gain an even deeper respect from their fans by disclosing the effort and craft that went into creating the book. Sharing also plants the seeds for a broader community of circuit-openers, preserving the knowledge and techniques for new generations of electronics aficionados.

Even if you’re not a “hardware person”, or even if you’re “not into tech”, the images in Open Circuits are so captivating that they may just tempt you to learn a bit more about it. Or, perhaps more importantly, a wayward young mind may be influenced to realize that hardware isn’t scary: it’s okay to peel back the covers and discover that the fruits of engineering are not merely functional, but also deeply aesthetic as well. I know that a younger version of me would have carried a copy of this book everywhere I went, poring over its pages at every chance.

While I was only able to review an early access electronic copy of their book, I am excited to get the full-color, hard-cover edition of the book. Having published a couple books with No Starch Press myself, I know the passion with which its founder, Bill Pollock, conducts his trade. He does not scrimp on materials: for The Hardware Hacker, he sprung on silver ink for the endsheets and clear UV spot inks for the cover – extra costs that came out of his bottom line, but made the hardcover edition look and feel great. So, I’m excited to see these wonderful images rendered faithfully onto the pages of a coffee-table companion book that I will be proud to showcase for years to come.

If you’re also turned on to Open Circuits, pre-order it on No Starch Press’ website, with the discount code “BUNNIESTUDIOS25”, to receive 25% off (no affiliate code or trackback in that link – 100% goes to No Starch and the authors). The code expires Tuesday, October 4. Pre-orders will also receive exclusive phone and desktop wallpaper images that are not in the book!

Hydroponics: Growing an Appreciation for Plants

Tuesday, August 9th, 2022

I once heard a saying – “Don’t feel pity on plants because they can’t move. Feel pity on us, because we have to”. I really didn’t have an appreciation for what this meant until the COVID pandemic hit, which restricted my movement for a couple of years, and I decided to spend some of my new-found time at home learning how to raise plants in my little flat in central Singapore. The result is a small hydroponics system that now lines the sunny windows of my place, yielding fresh herbs weekly that I incorporate into my dishes.

For me, hydroponics really drove home how remarkable plants are: from a bin containing nothing but water and salts, a fully-formed plant emerges. No vitamins, amino acids, or other nutrients – just add sunlight, and the plant produces everything it needs starting from a single, tiny seed. The seed encodes every gene it needs to survive and reproduce – our basil plant, for example, is tetraploid, which means it has four copies of every gene. Perhaps this somewhat explains the adaptability of plant clones – it is almost as if every branch on our basil bush has a separate character, each one trying a different angle at survival. Some branches would grow large and leafy, others small and dense, and if you propagate by a cutting, the resulting plant would inherit the character of the cutting. Thus, a lone plant should not be mistaken as lonely: it needs not a mate to create diverse offspring. Every tetraploid cell contains the genetic diversity of two diploids (whereas a human is one diploid), allowing it to adapt without need of sex or seedlings.

I also did an experiment and grew some sage from seed, and planted one set in dirt and another in hydroponics. Even though from the same seed stock, the resulting individuals bore little resemblance to each other. The dirt-grown sage looked much like the herb you’re familiar with in the grocery store – dark green, covered with fine hair, and densely arranged on a stem. The hydroponically grown sage instead grows like a vine, with long thin green stems between each leaf, the leaves themselves having a lighter color and less hair. The flavor is even a bit different; the hydroponic sage emits a slightly sulfurous odor when disturbed, and exhibits a bit more mint on the palate when eaten.

Even more fascinating is how the plants seem to “groom the water”. I’ve noticed that the most successful plants we’ve tried to grow can lower the pH of the water on their own, and regulate it within a fairly consistent band (more on this later!). Furthermore, they seem to have recruited commensurate organisms to live among their roots. The basil grows long white or translucent roots with a pale white mycorrhiza, while the sage has a brownish symbiont and a short, bushy root ball. Thus I only fully replace the water of the hydroponic system as a last resort if a plant seems diseased; normally I will cycle the water by removing about a half of the reservoir and topping it up, so as not to displace the favored microbes from the ecosystem.

The Setup

The initial inspiration to try hydroponics actually came a bit by chance. We bought some locally grown hydroponic lettuce, and noticed that they were packaged as whole plants, complete with roots. We were curious – could we pluck most of the leaves of this lettuce, and then stick the plants in water, and grow another serving of hydroponic lettuce?

Surprisingly enough, it worked! Even with a crude setup consisting of a handful of generic plant fertilizer and a small aquarium bubbler, we were able to take a single plant and grow a couple more servings of lettuce from it. Unfortunately, with time, the plants started to grow very “stemmy” and pale, and eventually they succumbed to tiny mites that infested their leaves.

Inspired by this initial success, I started to read up a bit on how others did hydroponics. One of the top hits is a blog by Kyle Gabriel, detailing how he built an extremely sophisticated system based around a Raspberry Pi, and a multitude of sensors, valves and pumps. It was sort of a nerd’s dream of how farming could be fully automated. I figured I’m pretty handy with a soldering iron, so maybe I could give a go at building a system like his. So, I dug up a spare Raspberry Pi, some solid state relays and white LEDs left over from when I did the house lighting, and put together a simple system that just automated the lighting and took hourly photos of the plants as they grew. The time-lapses were fascinating!

You can’t really watch a plant grow in real time, but, over a period of days one can easily see patterns in how plants grow and adapt.

With this small success, I put my mind toward further automation – adding various pumps and regulators for the system. However, as I started to put together the BOM for this, I realized very quickly that there was going to be no return on investment for building out a system this complicated. Plus, I really didn’t like that the whole system ran on code – I did not relish the idea of coming home to a room flooded with water or a set of rotten plants because my control program hit a segfault.

So, I sat back and thought about things a bit. First, one observation I had was despite providing the plants with a 10,000 lux light source 12 hours a day, they still had a tendency to grow toward the nearby window. As an experiment, I took one bin and removed it from the regulated light source, and just stuck it up against the window. The plant grew much better with natural sunlight, so, I removed all the artificial lighting, unplugged the Raspberry Pi, and just stuck all the plants against the windowsill (it definitely helps that I live one degree off the equator – it’s eternally summer here, with sunrise at 7AM and sunset at 7PM, 365 days a year). I was happy to save the electricity while getting bigger plants in the process.

For water level automation, I replaced the computer with two float switches in series. One switch cuts off the pump if the water level gets too low in the feed reservoir; the other cuts off the pump if the water level gets too high in the plant’s growth bin. You can use the same type of switch for both purposes; by just mounting the switch upsidedown you can invert the function of the switch.

The current “automated” system, consisting of a reservoir on the left, with a peristaltic pump on top of the reservoir bin, and two float switches. The silicone tube that takes the solution from the reservoir to the plant bin is covered by an old sock to prevent algae from growing in the solution when it’s not moving. There is also an aeration pump, not visible in this photo.

The float switch mounted on the top, functioning as a “break-when-full” switch. You can see the plant’s roots have taken over the entire bin! A couple of spacers were also added to adjust the height of the water.

The float switch mounted on the bottom of a tank, functioning as a “break when empty” switch. In order to provide clearance for the switch on the bottom, a couple of wine corks were hot-glued to the bottom of the bin. The switch comes with a rubber o-ring, creating an effective seal and no leakage.

So, with a couple of storage bins from Daiso, two float switches and a peristaltic pump, I’ve constructed a system that automates the care of our plants for up to two weeks at a time for under $40. No transistors required – just old-school technology dating from the 1800’s!

There is one other small detail necessary for hydroponics – an aeration pump. Any aquarium pump will do – although we eventually upgraded to some fancy silent pumps instead of the cheaper but noisier diaphragm based ones. Some blogs say that the “roots need oxygen” to survive, but my suspicion is actually that the pumps mostly serve to circulate the nutrient solution. If you leave the pump off, the roots will rapidly deplete the water around them of nutrients, and without any circulation you’re relying purely on a slow process of diffusion for nutrients to reach the roots. I’ve noticed that on bins with a low air flow, the roots will grow thick and matted, but bins with a faster air flow, the roots barely need to grow at all – my hypothesis is this reflects the plant allocating less resources toward root growth in bins with greater circulation, because fresh solution is always available with faster circulation.

The Tricky Bit

The electronics were actually the easiest part of the whole enterprise; the hardest part was figuring out what, exactly, I had to add to the water to get the plants to flourish. Once I got this right, the plants basically take care of themselves; of course it helps to pick plant varietals that are pest-resistant, and have the innate ability to regulate the pH level around their roots.

When I started, I was naively aware that plants needed nitrogen-bearing fertilizers. Reading the label on packaged fertilizer solutions, they use an “NPK” system, which stands for nitrogen-phosphorous-potassium. OK, sure, so plants need a bit more than just nitrogen. Surely I could just pick up some of this NPK stuff, dissolve it in some water, and we’re good to go…

…but how much of this should I add? This deceptively simple question lead me down a several-month rat-hole that took many failed experiments and daily journals of observations to find an answer. The core problem is that most plant bloggers like to use units like “one handful” as a unit of measure; the more precise ones would write something to the effect of “one capful per gallon”. As an engineer, units of handfuls and capfuls are extremely dissatisfying: how many grams per liter, dammit!

This lead me to research several academic papers about plant nutrition, which lead to reading graphs about plant growth under “controlled” conditions that lead to astonishingly contradictory results to what the plant bloggers would write: the NPK ratios implied by some of the academic works were wildly different from what the plant bloggers relayed in their actual experience.

It turns out the truth is somewhere in between. A big confounding factor is probably the nature of the soil used in the research, versus the base quality of the water used in your hydroponic system. Most of the research I uncovered was written about fertilizing plants grown in soil, and for example “loamy diatomaceous earth” turns out to be quite a complicated mix of nutrients in and of itself.

The most informative bit of research that I uncovered was experiments done where they would ash a plant after it was grown and measure out all the base elements from the resulting dry weight of the plant. It was here that I learned that, for example, molybdenum is absolutely essential to the growth of plants. It’s almost never mentioned in soil cultures, because dirt almost always has sufficient trace quantities of molybdenum to sustain plants, but water cultures quickly become molybdenum-deficient, and the plants will become pale and sickly without a supplement.

I also learned that plants need calcium and magnesium in astonishingly large quantities; as much as they need phosphorous and potassium. Again, these two nutrients are less discussed in soil-based literature because many rocks are basically made of calcium and magnesium, and as such plants have no trouble extracting what they need from the soil.

Finally, there is the issue of iron. Iron turns out to be the hardest nutrient to balance in a hydroponic system. Despite being extremely plentiful on Earth, and indeed, possibly being the penultimate composition of the entire universe, it is extremely scarce as a free atom in the biosphere. This is in part because it gets strongly bound to other molecules. For example, oxygen binds to myoglobin with a log K1 of 6.18, which means that it is a million times more likely to find oxygen bound to myoglobin than unbound in solution. This may sound strong, but EDTA, a chelating agent, has a log K1 of something like 27.7, so it is one octillion (1,000,000,000,000,000,000,000,000,000) times more likely to exist as bound to iron than unbound in equilibrium. In a way, iron is so biologically important that organic life had an arms race to bind free iron, and some ridiculously potent molecules exist to rapidly sweep the tiniest amount of iron out of solution. Fortunately, as long as I (or more conveniently, the plant itself) can keep the pH of the water below 5.5, I can take advantage of the extremely strong binding of EDTA to iron to keep it dissolved in solution and out of reach of other organisms trying to scavenge it out of the water. The plants can somehow take in the bound iron-EDTA complex, degrade the EDTA and extract the iron for its use (this took a long time and many trials with various iron binding agents to figure out how to remedy the chlorosis that would eventually take over every plant I grew).

Alright, now that I have a vague understanding of the atoms that a plant needs to survive, the question is how do I get them to the plant – and in what ratios? The answer to this is equally as vague and frustrating. You can’t simply throw a chunk of magnesium metal into a bin of water and expect a plant to access it. The magnesium needs to be turned into a salt so that it can readily dissolve into the water. One of the easiest versions of this to buy is magnesium sulfate, MgSO4, also known as epsom salts. So, I can just read the blogs and find the ones that tell you how many grams of magnesium sulfate to add per liter of water and be done with it, right?

Wrong again! It turns out that MgSO4 has several “hydration states” (11 total). Even though it looks like a hard, translucent crystal, Epsom salt is actually more water than magnesium by weight, as 7 molecules of water are bound to every molecule of magnesium sulfate in that preparation.

Of course, no plant blogger ever specifies the hydration states of the salts that they use in their preparations; and many on-line listings for agricultural-grade salts also fail to list the exact hydration state of their salts. Unfortunately this means there can be extremely large deviations in actual nutrient availability if you purchase a dissimilar hydration state from that used by the plant blogger.

That left me with purchasing a set of salts and trying to calculate, from first principles, the ratios that I needed to add to my hydroponics bins. The salts I finally decided on purchasing are:

  • Monopotassium phosphate (anhydrous) K2PO4
  • Potassium sulfate (anhydrous) K2SO4
  • Calcium nitrate Ca(NO3)2•4H2O – hygroscopic
  • Magnesium sulfate MgSO4•7H2O

Plus a pre-mixed micronutrient from a local hydroponics shop that contains the remaining essential elements in the following ratios:

  • Iron as EDTA chelate 21.25 mg/mL
  • Manganese 5.684 mg/mL
  • Boron 0.483 mg/mL
  • Zinc 0.617 mg/mL
  • Copper 0.267 mg/mL
  • Molybdenum 0.471 mg/mL

For the salts, I computed a matrix that allows me to solve for the amount of nutrient I want in solution, by taking the mass fraction of each nutrient available, writing it in matrix form, and then inverting it (had to crack open my linear algebra book from high school to remember what determinants were! Who knew that determinants could be useful for farming…).

You can make the matrix yourself by expressing the ratio of the milligrams of nutrient (as derived by the atomic weight of the nutrient) per milligram of compound (as derived by summing up the weight of all the atoms in the molecular formula, including the hydration state), and putting it into a matrix form like this:

And then taking the coefficients into an inverse matrix calculator and deriving a final format that allows you to plug in your desired NPK ratio and compute the mass of the salts you need to dissolve in water to achieve that:

As a sanity check, I plug the calculated weights back into the forward matrix to make sure I didn’t mess up the math, and I also add up all the dissolved solids to a TDS (total dissolved salts) number, so I can cross-check the resulting solution easily using a cheap TDS meter (link without referral code). In case you want to start from a template, you can download the spreadsheet. The template contains the pre-computed ratio that I currently use for growing all my herbs with compounds that I can source easily from the local market, and it seems to work fairly well for plants ranging from brazilian spinach to basil to sage.

As a side note, calcium nitrate is pretty tricky to handle. It’s very hygroscopic, so if left in ambient humidity, it will absorb water from the atmosphere and “melt away” into a concentrated, syrupy liquid. I usually add a few percent extra by weight over the formula to compensate for the excess water it accumulates over time. Also, I store the substance in an air-tight bag, and I always wear nitrile gloves while handling the compound to avoid damaging my hands.

For the micronutrients, it’s a bit trickier to dose correctly. Fortunately, I have a micropipette set that can measure out solutions in the range from 1uL to 200uL, from back when I did some genetic engineering in my kitchen (pipettes are also surprisingly cheap (without referral code) now). Again, the blogs are not terribly helpful about dosing – you get advice along the lines of “one drop per bucket” or something like that. What’s a drop? What’s a bucket? The exact volume of a drop depends on the surface tension and viscosity of the liquid, but I went with the rule of thumb that one drop is 50uL (20 drops per mL) as a starting point.

Initially, I tried 60uL of micronutrients per 1.5L of solution, but the plants started to show evidence of boron poisoning (this is a great guide for diagnosing plant nutritional problems based on the appearance of the leaves), so after a few iterations and replacements of the water to flush out the excess accumulated micronutrients, I settled on 30uL of micronutrients per 1.5L of solution, with a 15uL per week bump for iron-hungry species like spinach.

At a microliters-per-week consumption rate, even the smallest bottle of 150mL micronutrient solution will last years, but the tricky part is storing it. In order to avoid contaminating the bottle, I aliquot the solution every couple months into a set of 1.5mL eppendorfs which I keep in my wine fridge, alongside the original bottle. Even though I try my best to avoid contaminating the eppendorfs, after a couple weeks a pellet forms at the bottom from some process that is causing the micronutrients to come out of solution, so I typically end up discarding the aliquot before it is entirely used up.

The Final Result

It’s pretty neat to go from a pile of salts to delicious herbs. About a gram of salts go in, and a week later a couple dozen grams of leaves come out!


In go salts…


Out comes basil!

Basil in particular has been a real champ at growing in our hydroponics bins – we are at the point now where between two plants, we’re regularly giving it away to friends because it yields more than we can eat, even though I cook Italian food almost every other night. A handful of basil, a bit of salt, olive oil, tomatoes and garlic and we have a flavorful bruschetta to kick off a meal! Our other favorite is sage, it’s great for flavoring pork and poultry, but it’s very hard to find for some reason in Singapore. So, having a bit of fresh sage around is convenient and saves us a bit of money since it can be quite expensive to buy in specialty stores.

It’s been less practical to grow bulk vegetables, such as spinach. Brazilian spinach has been fairly successful in terms of growth, but it takes about a month for a cutting to grow to maturity, and we need about four plants to make a salad, so we’d need several racks of bins to make a dent in our vegetable consumption. Also, in general our herbs have had less pests than leafy green vegetables; maybe their strong flavor comes from compounds they produce that also serves to repel bugs? So, in addition to being great flavor for our sauces, the herbs have required no pesticides.

Overall, it was satisfying to learn about plant biology while developing a better connection to my food through technology. It was also a calming way to pass time during the pandemic; agriculture requires patience and time, but the reward is visceral. Having kept a miniature farmer’s almanac to decode missing pieces of information from the blogosphere, I have an new appreciation for how such personal journals could lead to scientific discoveries. And, I’m a much better chef than I was a couple years ago. Somehow, just having the fresh herbs around inspired explorations into new and exciting pairings; it gave me a whole new way to think about food.

Rust: A Critical Retrospective

Thursday, May 19th, 2022

Since I was unable to travel for a couple of years during the pandemic, I decided to take my new-found time and really lean into Rust. After writing over 100k lines of Rust code, I think I am starting to get a feel for the language and like every cranky engineer I have developed opinions and because this is the Internet I’m going to share them.

The reason I learned Rust was to flesh out parts of the Xous OS written by Xobs. Xous is a microkernel message-passing OS written in pure Rust. Its closest relative is probably QNX. Xous is written for lightweight (IoT/embedded scale) security-first platforms like Precursor that support an MMU for hardware-enforced, page-level memory protection.

In the past year, we’ve managed to add a lot of features to the OS: networking (TCP/UDP/DNS), middleware graphics abstractions for modals and multi-lingual text, storage (in the form of an encrypted, plausibly deniable database called the PDDB), trusted boot, and a key management library with self-provisioning and sealing properties.

One of the reasons why we decided to write our own OS instead of using an existing implementation such as SeL4, Tock, QNX, or Linux, was we wanted to really understand what every line of code was doing in our device. For Linux in particular, its source code base is so huge and so dynamic that even though it is open source, you can’t possibly audit every line in the kernel. Code changes are happening at a pace faster than any individual can audit. Thus, in addition to being home-grown, Xous is also very narrowly scoped to support just our platform, to keep as much unnecessary complexity out of the kernel as possible.

Being narrowly scoped means we could also take full advantage of having our CPU run in an FPGA. Thus, Xous targets an unusual RV32-IMAC configuration: one with an MMU + AES extensions. It’s 2022 after all, and transistors are cheap: why don’t all our microcontrollers feature page-level memory protection like their desktop counterparts? Being an FPGA also means we have the ability to fix API bugs at the hardware level, leaving the kernel more streamlined and simplified. This was especially relevant in working through abstraction-busting processes like suspend and resume from RAM. But that’s all for another post: this one is about Rust itself, and how it served as a systems programming language for Xous.

Rust: What Was Sold To Me

Back when we started Xous, we had a look at a broad number of systems programming languages and Rust stood out. Even though its `no-std` support was then-nascent, it was a strongly-typed, memory-safe language with good tooling and a burgeoning ecosystem. I’m personally a huge fan of strongly typed languages, and memory safety is good not just for systems programming, it enables optimizers to do a better job of generating code, plus it makes concurrency less scary. I actually wished for Precursor to have a CPU that had hardware support for tagged pointers and memory capabilities, similar to what was done on CHERI, but after some discussions with the team doing CHERI it was apparent they were very focused on making C better and didn’t have the bandwidth to support Rust (although that may be changing). In the grand scheme of things, C needed CHERI much more than Rust needed CHERI, so that’s a fair prioritization of resources. However, I’m a fan of belt-and-suspenders for security, so I’m still hopeful that someday hardware-enforced fat pointers will make their way into Rust.

That being said, I wasn’t going to go back to the C camp simply to kick the tires on a hardware retrofit that backfills just one poor aspect of C. The glossy brochure for Rust also advertised its ability to prevent bugs before they happened through its strict “borrow checker”. Furthermore, its release philosophy is supposed to avoid what I call “the problem with Python”: your code stops working if you don’t actively keep up with the latest version of the language. Also unlike Python, Rust is not inherently unhygienic, in that the advertised way to install packages is not also the wrong way to install packages. Contrast to Python, where the official docs on packages lead you to add them to system environment, only to be scolded by Python elders with a “but of course you should be using a venv/virtualenv/conda/pipenv/…, everyone knows that”. My experience with Python would have been so much better if this detail was not relegated to Chapter 12 of 16 in the official tutorial. Rust is also supposed to be better than e.g. Node at avoiding the “oops I deleted the Internet” problem when someone unpublishes a popular package, at least if you use fully specified semantic versions for your packages.

In the long term, the philosophy behind Xous is that eventually it should “get good enough”, at which point we should stop futzing with it. I believe it is the mission of engineers to eventually engineer themselves out of a job: systems should get stable and solid enough that it “just works”, with no caveats. Any additional engineering beyond that point only adds bugs or bloat. Rust’s philosophy of “stable is forever” and promising to never break backward-compatibility is very well-aligned from the point of view of getting Xous so polished that I’m no longer needed as an engineer, thus enabling me to spend more of my time and focus supporting users and their applications.

The Rough Edges of Rust

There’s already a plethora of love letters to Rust on the Internet, so I’m going to start by enumerating some of the shortcomings I’ve encountered.

“Line Noise” Syntax

This is a superficial complaint, but I found Rust syntax to be dense, heavy, and difficult to read, like trying to read the output of a UART with line noise:

Trying::to_read::<&'a heavy>(syntax, |like| { this. can_be( maddening ) }).map(|_| ())?;

In more plain terms, the line above does something like invoke a method called “to_read” on the object (actually `struct`) “Trying” with a type annotation of “&heavy” and a lifetime of ‘a with the parameters of “syntax” and a closure taking a generic argument of “like” calling the can_be() method on another instance of a structure named “this” with the parameter “maddening” with any non-error return values mapped to the Rust unit type “()” and errors unwrapped and kicked back up to the caller’s scope.

Deep breath. Surely, I got some of this wrong, but you get the idea of how dense this syntax can be.

And then on top of that you can layer macros and directives which don’t have to follow other Rust syntax rules. For example, if you want to have conditionally compiled code, you use a directive like

#[cfg(all(not(baremetal), any(feature = “hazmat”, feature = “debug_print”)))]

Which says if either the feature “hazmat” or “debug_print” is enabled and you’re not running on bare metal, use the block of code below (and I surely got this wrong too). The most confusing part of about this syntax to me is the use of a single “=” to denote equivalence and not assignment, because, stuff in config directives aren’t Rust code. It’s like a whole separate meta-language with a dictionary of key/value pairs that you query.

I’m not even going to get into the unreadability of Rust macros – even after having written a few Rust macros myself, I have to admit that I feel like they “just barely work” and probably thar be dragons somewhere in them. This isn’t how you’re supposed to feel in a language that bills itself to be reliable. Yes, it is my fault for not being smart enough to parse the language’s syntax, but also, I do have other things to do with my life, like build hardware.

Anyways, this is a superficial complaint. As time passed I eventually got over the learning curve and became more comfortable with it, but it was a hard, steep curve to climb. This is in part because all the Rust documentation is either written in eli5 style (good luck figuring out “feature”s from that example), or you’re greeted with a formal syntax definition (technically, everything you need to know to define a “feature” is in there, but nowhere is it summarized in plain English), and nothing in between.

To be clear, I have a lot of sympathy for how hard it is to write good documentation, so this is not a dig at the people who worked so hard to write so much excellent documentation on the language. I genuinely appreciate the general quality and fecundity of the documentation ecosystem.

Rust just has a steep learning curve in terms of syntax (at least for me).

Rust Is Powerful, but It Is Not Simple

Rust is powerful. I appreciate that it has a standard library which features HashMaps, Vecs, and Threads. These data structures are delicious and addictive. Once we got `std` support in Xous, there was no going back. Coming from a background of C and assembly, Rust’s standard library feels rich and usable — I have read some criticisms that it lacks features, but for my purposes it really hits a sweet spot.

That being said, my addiction to the Rust `std` library has not done any favors in terms of building an auditable code base. One of the criticisms I used to leverage at Linux is like “holy cow, the kernel source includes things like an implementation for red black trees, how is anyone going to audit that”.

Now, having written an OS, I have a deep appreciation for how essential these rich, dynamic data structures are. However, the fact that Xous doesn’t include an implementation of HashMap within its repository doesn’t mean that we are any simpler than Linux: indeed, we have just swept a huge pile of code under the rug; just the `collection`s portion of the standard library represents about 10k+ SLOC at a very high complexity.

So, while Rust’s `std` library allows the Xous code base to focus on being a kernel and not also be its own standard library, from the standpoint of building a minimum attack-surface, “fully-auditable by one human” codebase, I think our reliance on Rust’s `std` library means we fail on that objective, especially so long as we continue to track the latest release of Rust (and I’ll get into why we have to in the next section).

Ideally, at some point, things “settle down” enough that we can stick a fork in it and call it done by well, forking the Rust repo, and saying “this is our attack surface, and we’re not going to change it”. Even then, the Rust `std` repo dwarfs the Xous repo by several multiples in size, and that’s not counting the complexity of the compiler itself.

Rust Isn’t Finished

This next point dovetails into why Rust is not yet suitable for a fully auditable kernel: the language isn’t finished. For example, while we were coding Xous, a thing called `const generic` was introduced. Before this, Rust had no native ability to deal with arrays bigger than 32 elements! This limitation is a bit maddening, and even today there are shortcomings such as the `Default` trait being unable to initialize arrays larger than 32 elements. This friction led us to put limits on many things at 32 elements: for example, when we pass the results of an SSID scan between processes, the structure only reserves space for up to 32 results, because the friction of going to a larger, more generic structure just isn’t worth it. That’s a language-level limitation directly driving a user-facing feature.

Also over the course of writing Xous, things like in-line assembly and workspaces finally reached maturity, which means we need to go back a revisit some unholy things we did to make those critical few lines of initial boot code, written in assembly, integrated into our build system.

I often ask myself “when is the point we’ll get off the Rust release train”, and the answer I think is when they finally make “alloc” no longer a nightly API. At the moment, `no-std` targets have no access to the heap, unless they hop on the “nightly” train, in which case you’re back into the Python-esque nightmare of your code routinely breaking with language releases.

We definitely gave writing an OS in `no-std` + stable a fair shake. The first year of Xous development was all done using `no-std`, at a cost in memory space and complexity. It’s possible to write an OS with nothing but pre-allocated, statically sized data structures, but we had to accommodate the worst-case number of elements in all situations, leading to bloat. Plus, we had to roll a lot of our own core data structures.

About a year ago, that all changed when Xobs ported Rust’s `std` library to Xous. This means we are able to access the heap in stable Rust, but it comes at a price: now Xous is tied to a particular version of Rust, because each version of Rust has its own unique version of `std` packaged with it. This version tie is for a good reason: `std` is where the sausage gets made of turning fundamentally `unsafe` hardware constructions such as memory allocation and thread creation into “safe” Rust structures. (Also fun fact I recently learned: Rust doesn’t have a native allocater for most targets – it simply punts to the native libc `malloc()` and `free()` functions!) In other words, Rust is able to make a strong guarantee about the stable release train not breaking old features in part because of all the loose ends swept into `std`.

I have to keep reminding myself that having `std` doesn’t eliminate the risk of severe security bugs in critical code – it merely shuffles a lot of critical code out of sight, into a standard library. Yes, it is maintained by a talented group of dedicated programmers who are smarter than me, but in the end, we are all only human, and we are all fair targets for software supply chain exploits.

Rust has a clockwork release schedule – every six weeks, it pushes a new version. And because our fork of `std` is tied to a particular version of Rust, it means every six weeks, Xobs has the thankless task of updating our fork and building a new `std` release for it (we’re not a first-class platform in Rust, which means we have to maintain our own `std` library). This means we likewise force all Xous developers to run `rustup update` on their toolchains so we can retain compatibility with the language.

This probably isn’t sustainable. Eventually, we need to lock down the code base, but I don’t have a clear exit strategy for this. Maybe the next point at which we can consider going back to `nostd` is when we can get the stable `alloc` feature, which allows us to have access to the heap again. We could then decouple Xous from the Rust release train, but we’d still need to backfill features such as Vec, HashMap, Thread, and Arc/Mutex/Rc/RefCell/Box constructs that enable Xous to be efficiently coded.

Unfortunately, the `alloc` crate is very hard, and has been in development for many years now. That being said, I really appreciate the transparency of Rust behind the development of this feature, and the hard work and thoughtfulness that is being put into stabilizing this feature.

Rust Has A Limited View of Supply Chain Security

I think this position is summarized well by the installation method recommended by the rustup.rs installation page:

`curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`

“Hi, run this shell script from a random server on your machine.”

To be fair, you can download the script and inspect it before you run it, which is much better than e.g. the Windows .MSI installers for vscode. However, this practice pervades the entire build ecosystem: a stub of code called `build.rs` is potentially compiled and executed whenever you pull in a new crate from crates.io. This, along with “loose” version pinning (you can specify a version to be, for example, simply “2” which means you’ll grab whatever the latest version published is with a major rev of 2), makes me uneasy about the possibility of software supply chain attacks launched through the crates.io ecosystem.

Crates.io is also subject to a kind of typo-squatting, where it’s hard to determine which crates are “good” or “bad”; some crates that are named exactly what you want turn out to just be old or abandoned early attempts at giving you the functionality you wanted, and the more popular, actively-maintained crates have to take on less intuitive names, sometimes differing by just a character or two from others (to be fair, this is not a problem unique to Rust’s package management system).

There’s also the fact that dependencies are chained – when you pull in one thing from crates.io, you also pull in all of that crate’s subordinate dependencies, along with all their build.rs scripts that will eventually get run on your machine. Thus, it is not sufficient to simply audit the crates explicitly specified within your Cargo.toml file — you must also audit all of the dependent crates for potential supply chain attacks as well.

Fortunately, Rust does allow you to pin a crate at a particular version using the `Cargo.lock` file, and you can fully specify a dependent crate down to the minor revision. We try to mitigate this in Xous by having a policy of publishing our Cargo.lock file and specifying all of our first-order dependent crates to the minor revision. We have also vendored in or forked certain crates that would otherwise grow our dependency tree without much benefit.

That being said, much of our debug and test framework relies on some rather fancy and complicated crates that pull in a huge number of dependencies, and much to my chagrin even when I try to run a build just for our target hardware, the dependent crates for running simulations on the host computer are still pulled in and the build.rs scripts are at least built, if not run.

In response to this, I wrote a small tool called `crate-scraper` which downloads the source package for every source specified in our Cargo.toml file, and stores them locally so we can have a snapshot of the code used to build a Xous release. It also runs a quick “analysis” in that it searches for files called build.rs and collates them into a single file so I can more quickly grep through to look for obvious problems. Of course, manual review isn’t a practical way to detect cleverly disguised malware embedded within the build.rs files, but it at least gives me a sense of the scale of the attack surface we’re dealing with — and it is breathtaking, about 5700 lines of code from various third parties that manipulates files, directories, and environment variables, and runs other programs on my machine every time I do a build.

I’m not sure if there is even a good solution to this problem, but, if you are super-paranoid and your goal is to be able to build trustable firmware, be wary of Rust’s expansive software supply chain attack surface!

You Can’t Reproduce Someone Else’s Rust Build

A final nit I have about Rust is that builds are not reproducible between different computers (they are at least reproducible between builds on the same machine if we disable the embedded timestamp that I put into Xous for $reasons).

I think this is primarily because Rust pulls in the full path to the source code as part of the panic and debug strings that are built into the binary. This has lead to uncomfortable situations where we have had builds that worked on Windows, but failed under Linux, because our path names are very different lengths on the two and it would cause some memory objects to be shifted around in target memory. To be fair, those failures were all due to bugs we had in Xous, which have since been fixed. But, it just doesn’t feel good to know that we’re eventually going to have users who report bugs to us that we can’t reproduce because they have a different path on their build system compared to ours. It’s also a problem for users who want to audit our releases by building their own version and comparing the hashes against ours.

There’s some bugs open with the Rust maintainers to address reproducible builds, but with the number of issues they have to deal with in the language, I am not optimistic that this problem will be resolved anytime soon. Assuming the only driver of the unreproducibility is the inclusion of OS paths in the binary, one fix to this would be to re-configure our build system to run in some sort of a chroot environment or a virtual machine that fixes the paths in a way that almost anyone else could reproduce. I say “almost anyone else” because this fix would be OS-dependent, so we’d be able to get reproducible builds under, for example, Linux, but it would not help Windows users where chroot environments are not a thing.

Where Rust Exceeded Expectations

Despite all the gripes laid out here, I think if I had to do it all over again, Rust would still be a very strong contender for the language I’d use for Xous. I’ve done major projects in C, Python, and Java, and all of them eventually suffer from “creeping technical debt” (there’s probably a software engineer term for this, I just don’t know it). The problem often starts with some data structure that I couldn’t quite get right on the first pass, because I didn’t yet know how the system would come together; so in order to figure out how the system comes together, I’d cobble together some code using a half-baked data structure.

Thus begins the descent into chaos: once I get an idea of how things work, I go back and revise the data structure, but now something breaks elsewhere that was unsuspected and subtle. Maybe it’s an off-by-one problem, or the polarity of a sign seems reversed. Maybe it’s a slight race condition that’s hard to tease out. Nevermind, I can patch over this by changing a <= to a <, or fixing the sign, or adding a lock: I’m still fleshing out the system and getting an idea of the entire structure. Eventually, these little hacks tend to metastasize into a cancer that reaches into every dependent module because the whole reason things even worked was because of the “cheat”; when I go back to excise the hack, I eventually conclude it’s not worth the effort and so the next best option is to burn the whole thing down and rewrite it…but unfortunately, we’re already behind schedule and over budget so the re-write never happens, and the hack lives on.

Rust is a difficult language for authoring code because it makes these “cheats” hard – as long as you have the discipline of not using “unsafe” constructions to make cheats easy. However, really hard does not mean impossible – there were definitely some cheats that got swept under the rug during the construction of Xous.

This is where Rust really exceeded expectations for me. The language’s structure and tooling was very good at hunting down these cheats and refactoring the code base, thus curing the cancer without killing the patient, so to speak. This is the point at which Rust’s very strict typing and borrow checker converts from a productivity liability into a productivity asset.

I liken it to replacing a cable in a complicated bundle of cables that runs across a building. In Rust, it’s guaranteed that every strand of wire in a cable chase, no matter how complicated and awful the bundle becomes, is separable and clearly labeled on both ends. Thus, you can always “pull on one end” and see where the other ends are by changing the type of an element in a structure, or the return type of a method. In less strictly typed languages, you don’t get this property; the cables are allowed to merge and affect each other somewhere inside the cable chase, so you’re left “buzzing out” each cable with manual tests after making a change. Even then, you’re never quite sure if the thing you replaced is going to lead to the coffee maker switching off when someone turns on the bathroom lights.

Here’s a direct example of Rust’s refactoring abilities in action in the context of Xous. I had a problem in the way trust levels are handled inside our graphics subsystem, which I call the GAM (Graphical Abstraction Manager). Each Canvas in the system gets a `u8` assigned to it that is a trust level. When I started writing the GAM, I just knew that I wanted some notion of trustability of a Canvas, so I added the variable, but wasn’t quite sure exactly how it would be used. Months later, the system grew the notion of Contexts with Layouts, which are multi-Canvas constructions that define a particular type of interaction. Now, you can have multiple trust levels associated with a single Context, but I had forgotten about the trust variable I had previously put in the Canvas structure – and added another trust level number to the Context structure as well. You can see where this is going: everything kind of worked as long as I had simple test cases, but as we started to get modals popping up over applications and then menus on top of modals and so forth, crazy behavior started manifesting, because I had confused myself over where the trust values were being stored. Sometimes I was updating the value in the Context, sometimes I was updating the one in the Canvas. It would manifest itself sometimes as an off-by-one bug, other times as a concurrency error.

This was always a skeleton in the closet that bothered me while the GAM grew into a 5k-line monstrosity of code with many moving parts. Finally, I decided something had to be done about it, and I was really not looking forward to it. I was assuming that I messed up something terribly, and this investigation was going to conclude with a rewrite of the whole module.

Fortunately, Rust left me a tiny string to pull on. Clippy, the cheerfully named “linter” built into Rust, was throwing a warning that the trust level variable was not being used at a point where I thought it should be – I was storing it in the Context after it was created, but nobody every referred to it after then. That’s strange – it should be necessary for every redraw of the Context! So, I started by removing the variable, and seeing what broke. This rapidly led me to recall that I was also storing the trust level inside the Canvases within the Context when they were being created, which is why I had this dangling reference. Once I had that clue, I was able to refactor the trust computations to refer only to that one source of ground truth. This also led me to discover other bugs that had been lurking because in fact I was never exercising some code paths that I thought I was using on a routine basis. After just a couple hours of poking around, I had a clear-headed view of how this was all working, and I had refactored the trust computation system with tidy APIs that were simple and easier to understand, without having to toss the entire code base.

This is just one of many positive experiences I’ve had with Rust in maintaining the Xous code base. It’s one of the first times I’ve walked into a big release with my head up and a positive attitude, because for the first time ever, I feel like maybe I have a chance of being able deal with hard bugs in an honest fashion. I’m spending less time making excuses in my head to justify why things were done this way and why we can’t take that pull request, and more time thinking about all the ways things can get better, because I know Clippy has my back.

Caveat Coder

Anyways, that’s a lot of ranting about software for a hardware guy. Software people are quick to remind me that first and foremost, I make circuits and aluminum cases, not code, therefore I have no place ranting about software. They’re right – I actually have no “formal” training to write code “the right way”. When I was in college, I learned Maxwell’s equations, not algorithms. I could never be a professional programmer, because I couldn’t pass even the simplest coding interview. Don’t ask me to write a linked list: I already know that I don’t know how to do it correctly; you don’t need to prove that to me. This is because whenever I find myself writing a linked list (or any other foundational data structure for that matter), I immediately stop myself and question all the life choices that brought me to that point: isn’t this what libraries are for? Do I really need to be re-inventing the wheel? If there is any correlation between doing well in a coding interview and actual coding ability, then you should definitely take my opinions with the grain of salt.

Still, after spending a couple years in the foxhole with Rust and reading countless glowing articles about the language, I felt like maybe a post that shared some critical perspectives about the language would be a refreshing change of pace.

Fixing a Tiny Corner of the Supply Chain

Tuesday, December 14th, 2021

No product gets built without at least one good supply chain war story – especially true in these strange times. Before we get into the details of the story, I feel it’s worth understanding a bit more about the part that caused me so much trouble: what it does, and why it’s so special.

The Part that Could Not Be Found
There’s a bit of magic in virtually every piece of modern electronics involved in the generation of internal timing signals. It’s called an oscillator, and typically it’s implemented with a precisely cut fleck of quartz crystal that “rings” when stimulated, vibrating millions of times per second. The accuracy of the crystal is measured in parts-per-million, such that over a month – about 2.5 million seconds – a run-of-the-mill crystal with 50ppm accuracy might drift by about two minutes. In mechanical terms it’s like producing 1kg (2.2 pound) bags of rice that have precisely no more and no less than one grain of rice compared to each other; or in CS terms it’s about 15 bits of precision (it’s funny how one metric sounds hard, while the other sounds trivial).

One of the many problems with quartz crystals is that they are big. Here’s a photo from wikipedia of the inside of a typical oscillator:


CC BY-SA 4.0 by Binarysequence via Wikipedia

The disk on the left is the crystal itself. Because the frequency of the crystal is directly related to its size, there’s a “physics limit” on how small these things can be. Their large size also imposes a limit on how much power it takes to drive them (it’s a lot). As a result, they tend to be large, and power-hungry – it’s not uncommon for a crystal oscillator to be specified to consume a couple milliamperes of current in normal operation (and yes, there are also chips with built-in oscillator circuits that can drive crystals, which reduces power; but they, too, have to burn the energy to charge and discharge some picofarads of capacitance millions of times per second due to the macroscopic nature of the crystal itself).

A company called SiTime has been quietly disrupting the crystal industry by building MEMS-based silicon resonators that can outperform quartz crystals in almost every way. The part I’m using is the SiT8021, and it’s tiny (1.5×0.8mm), surface-mountable (CSBGA), consumes about 100x less power than the quartz-based competition, and has a comparable frequency stability of 100ppm. Remarkably, despite being better in almost every way, it’s also cheaper – if you can get your hands on it. More on that later.

Whenever something like this comes along, I always like to ask “how come this didn’t happen sooner?”. You usually learn something interesting in exploring that question. In the case of pure-silicon oscillators, they have been around forever, but they are extremely sensitive to temperature and aging. Anyone who has designed analog circuits in silicon are familiar with the problem that basically every circuit element is a “temperature-to-X” converter, where X is the parameter you wish you could control. For example, a run of the mill “ring oscillator” with no special compensation would have an initial frequency accuracy of about 50% – going back to our analogies, it’d be like getting a bag of rice that nominally holds 1kg, but is filled to an actual weight of somewhere between 0.5kg and 1.5kg – and you would get swings of an additional 30% depending upon the ambient temperature. A silicon MEMS oscillator is a bit better than that, but its frequency output would still vary wildly with temperature; orders of magnitude more than the parts-per-million specified for a quartz crystal.

So, how do they turn something so innately terrible into something better-than-quartz? First I took a look at the devices under a microscope, and it’s immediately obvious that it’s actually two chips that have been bonded face-to-face with each other.


Edge-on view of an SiT8021 already mounted on a circuit board.

I deduced that a MEMS oscillator chip is nestled between the balls that bond the chip to the PCB. I did a quick trawl through the patents filed by SiTimes, and I’m guessing the MEMS oscillator chip contains at least two separate oscillators. These oscillators are intentionally different, so that their frequency drift with temperature also have different, but predictable, curves. They can use the relative difference of the frequencies to very precisely measure the absolute temperature of the pair of oscillators by comparing the instantaneous difference between the two frequencies. In other words, they took the exact problem that plagues silicon designs, and turned it into a feature: they built a very precise temperature sensor out of two silicon oscillators.

With the temperature of the oscillators known to exquisite precision, one can now compensate for the temperature effects. That’s what the larger of the two chips (the one directly attached to the solder balls) presumably does. It computes an inverse mapping of temperature vs. frequency, constantly adjusting a PLL driven by one of the two MEMs oscillators, to derive a precise, temperature-stable net frequency. The controller chip presumably also contains a set of eFuses that are burned in the factory (or by the distributor) to calibrate and set the initial frequency of the device. I didn’t do an acid decap of the controller chip, but it’s probably not unreasonable for it to be fabricated in 28nm silicon; at this geometry you could fit an entire RISC-V CPU in there with substantial microcode and effectively “wrap a computer” around the temperature drift problem that plagues silicon designs.

Significantly, the small size of the MEMS resonator compared to a quartz crystal, along with its extremely intimate bonding to the control electronics, means a fundamentally lower limit on the amount of energy required to sustain resonance, which probably goes a long way towards explaining why this circuit is able to reduce active power by so much.

The tiny size of the controller chip means that a typical 300mm wafer will yield about 50,000 chips; going by the “rule of thumb” that a processed wafer is roughly $3k, that puts the price of a raw, untested controller chip at about $0.06. The MEMs device is presumably a bit more expensive, and the bonding process itself can’t be cheap, but at a “street price” of about $0.64 each in 10k quantities, I imagine SiTime is still making good margin. All that being said, a million of these oscillators would fit on about 18 wafers, and the standard “bulk” wafer cassette in a fab holds 25 wafers (and a single fab will pump out about 25,000 – 50,000 wafers a month); so, this is a device that’s clearly ready for mobile-phone scale production.

Despite the production capacity, the unique characteristics of the SiT8021 make it a strong candidate to be designed into mobile phones of all types, so I would likely be competing with companies like Apple and Samsung for my tiny slice of the supply chain.

The Supply Chain War Story
It’s clearly a great part for a low-power mobile device like Precursor, which is why I designed it into the device. Unfortunately, there’s also no real substitute for it. Nobody else makes a MEMS oscillator of comparable quality, and as outlined above, this device is smaller and orders of magnitude lower power than an equivalent quartz crystal. It’s so power-efficient that in many chips it is less power to use this off-chip oscillator, than to use the built-in crystal oscillator to drive a passive crystal. For example, the STM32H7 HSE burns 450uA, whereas the SiT8021 runs at 160uA. To be fair, one also has to drive the pad input capacitance of the STM32, but even with that considered you’re probably around 250uA.

To put it in customer-facing terms, if I were forced to substitute commonly available quartz oscillators for this part, the instant-on standby time of a Precursor device would be cut from a bit over 50 hours down to about 40 hours (standby current would go from 11mA up to 13mA).

If this doesn’t make the part special enough, the fact that it’s an oscillator puts it in a special class with respect to electromagnetic compliance (EMC) regulations. These are the regulations that make sure that radios don’t interfere with each other, and like them or not, countries take them very seriously as trade barriers – by requiring expensive certifications, you’re able to eliminate the competition of small upstarts and cheap import equipment on “radio safety” grounds. Because the quality of radio signals depend directly upon the quality of the oscillator used to derive them, the regulations (quite reasonably) disallow substitutions of oscillators without re-certification. Thus, even if I wanted to take the hit on standby time and substitute the part, I’d have to go through the entire certification process again, at a cost of several thousand dollars and some weeks of additional delay.

Thus this part, along with the FPGA, is probably one of the two parts on the entire BOM that I really could not do without. Of course, I focused a lot on securing the FPGA, because of its high cost and known difficulty to source; but for want of a $0.68 crystal, a $565 product would not be shipped…

The supply chain saga starts when I ordered a couple thousand of these in January 2021, back when it had about a 30 week lead time, giving a delivery sometime in late August 2021. After waiting about 28 weeks, on August 12th, we got an email from our distributor informing us that they had to cancel our entire order for SiT8021s. That’s 28 weeks lost!

The nominal reason given was that the machine used to set the frequency of the chips was broken or otherwise unavailable, and due to supply chain problems it couldn’t be fixed anytime soon. Thus, we had to go to the factory to get the parts. But, in order to order direct from the factory, we had to order 18,000 pieces minimum – over 9x of what I needed. Recall that one wafer yields 58,000 chips, so this isn’t even half a wafer’s worth of oscillators. That being said, 18,000 chips would be about $12,000. This isn’t chump change for a project operating on a fixed budget. It’s expensive enough that I considered recertification of the product to use a different oscillator, if it weren’t for the degradation in standby time.

Panic ensues. We immediately trawl all the white-market distributor channels and buy out all the stock, regardless of the price. Instead of paying our quoted rate of $0.68, we’re paying as much as $1.05 each, but we’re still short about 300 oscillators.

I instruct the buyers to search the gray market channels, and they come back with offers at $5 or $6 for the $0.68 part, with no guarantee of fitness or function. In other words, I could pay 10x of the value of the part and get a box of bricks, and the broker could just disappear into the night with my money.

No deal. I had to do better.

By this time, every distributor was repeating the “18k Minimum Order Quantity (MOQ) with long lead time” offer, and my buyers in China waved the white flag and asked me to intervene. After trawling the Internet for a couple hours, I discover that Element14 right here in Singapore (where I live) claims to be able to deliver small quantities of the oscillator before the end of the year. It seems too good to be true.

I ask my buyers in China to place an order, and they balk; the China office repeats that there is simply no stock. This has happened before, due to trade restrictions and regional differences the inventory in one region may not be orderable in another, so I agree to order the balance of the oscillators with a personal credit card, and consign them directly to the factory. At this point, Element14 is claiming a delivery of 10-12 weeks on the Singapore website: that would just meet the deadline for the start of SMT production at the end of November.

I try to convince myself that I had found the solution to the problem, but something smelled rotten. A month later, I check back into the Element14 website to see the status of the order. The delivery had shifted back almost day-for-day to December. I start to suspect maybe they don’t even carry this part, and it’s just an automated listing to honeypot orders into their system. So, I get on the phone with an Element14 representative, and crazy enough, she can’t even find my order in her system even though I can see the order in my own Element14 account. She tells me this is not uncommon, and if she can’t see it in her system, then the web order will never be filled. I’m like, is there any way you can cancel the order then? She’s like “no, because I can’t see the order, I can’t cancel it.” But also because the representative can’t see the order, it also doesn’t exist, and it will never be filled. She recommends I place the order again.

I’m like…what the living fuck. Now I’m starting to sweat bullets; we’re within a few weeks of production start, and I’m considering ordering 18,000 oscillators and reselling the excess as singles via Crowd Supply in a Hail Mary to recover the costs. The frustrating part is, the cost of 300 parts is small – under $200 – but the lack of these parts blocks the shipment of roughly $170,000 worth of orders. So, I place a couple bets across the board. I go to Newark (Element14, but for the USA) and place an order for 500 units (they also claimed to be able to deliver), and I re-placed the order with Element14 Singapore, but this time I put a Raspberry Pi into the cart with the oscillators, as a “trial balloon” to test if the order was actually in their system. They were able to ship the part of the order with the Raspberry Pi to me almost immediately, so I knew they couldn’t claim to “lose the order” like before – but the SiT8021 parts went from having a definitive delivery date to a “contact us for more information” note – not very useful.

I also noticed that by this time (this is mid-October), Digikey is listing most of the SiT8021 parts for immediately delivery, with the exception of the 12MHz, 1.8V version that I need. Now I’m really starting to sweat – one of the hypothesis pushed back at my by the buyer in China was that there was no demand for this part, so it’s been canceled and that’s why I can’t find it anywhere. If the part’s been canceled, I’m really screwed.

I decide it’s time to reach out to SiTime directly. Through hook and crook, I get in touch with a sales rep, who confirms with me that the 12MHz, 1.8V version is a valid and orderable part number, and I should be able to go to Digikey to purchase it. I inform the sales rep that the Digikey website doesn’t list this part number, to which they reply “that’s strange, we’ll look into it”.

Not content to leave it there, I reach out to Digikey directly. I get connected to one of their technical sales representatives via an on-line chat, and after a bit of conversing, I confirmed that in fact the parts are shipped to Digikey as blanks, and they have a machine that can program the parts on-site. The technical sales rep also confirms the machine can program that exact configuration of the part, but because the part is not listed on the website I have to do a “custom part” quotation.

Aha! Now we are getting somewhere. I reach out to their custom-orders department and request a quotation. A lady responds to me fairly quickly and says she’ll get back to me. About a week passes, no response. I ping the department again, no response.

“Uh-oh.”

I finally do the “unthinkable” in the web age – I pick up the phone, and dial in hoping to reach a real human being to plead my case. I dial the extension for the custom department sales rep. It drops straight to voice mail. I call back again, this time punching the number to draw a lottery ticket for a new sales rep.

Luckily, I hit the jackpot! I got connected with a wonderful lady by the name of Mel who heard out my problem, and immediately took ownership for solving the problem. I could hear her typing queries into her terminal, and hemming and hawing over how strange it is for there to be no order code, but she can still pull up pricing. While I couldn’t look over her shoulder, I could piece together that the issue was a mis-configuration in their internal database. After about 5 minutes of tapping and poking, she informs me that she’ll send a message to their web department and correct the issue. Three days later, my part (along with 3 other missing SKUs) is orderable on Digikey, and a week later I have the 300 missing oscillators delivered directly to the factory – just in time for the start of SMT production.

I wrote Mel a hand-written thank-you card and mailed it to Digikey. I hope she received it, because people like here are a rare breed: she has the experience to quickly figure out how the system breaks, the judgment to send the right instructions to the right groups on how to fix it, and the authority to actually make it happen. And she’s actually still working a customer-facing job, not promoted into a corner office management position where she would never be exposed to a real-world problem like mine.

So, while Mel saved my production run, the comedy of errors still plays on at Element14 and Newark. The “unfindable order” is still lodged in my Element14 account, probably to stay there until the end of time. Newark’s “international” department sent me a note saying there’s been an export compliance issue with the part (since when did jellybean oscillators become subject to ITAR restrictions?!), so I responded to their department to give them more details – but got no response back. I’ve since tried to cancel the order, got no response, and now it just shows a status of “red exclamation mark” on hold and a ship date of Jan 2022. The other Singapore Element14 order that was combined with the Raspberry Pi still shows the ominous “please contact us for delivery” on the ship date, and despite trying to contact them, nobody has responded to inquiries. But hey, Digikey’s Mel has got my back, and production is up and (mostly) running on schedule.

This is just is one of many supply chain war stories I’ve had with this production run, but it is perhaps the one with the most unexpected outcome. I feared that perhaps the issue was intense competition for the parts making them unavailable, but the ground truth turned out to be much more mundane: a misconfigured website. Fortunately, this small corner of the supply chain is now fixed, and now anyone can buy the part that caused me so many sleepless nights.