## Archive for the ‘open source’ Category

### Adding a ChaCha Cipher to Precursor’s TRNG

Tuesday, June 22nd, 2021

This is the second post of a two-part series on Betrusted/Precursor’s True Random Number Generator (TRNG).

A bulletproof source of random numbers is a key component of any cryptosystem, so we’ve done an extensive, months-long characterization of Precursor’s dual, redundant TRNG sources, which consists of an avalanche noise generator and a ring oscillator. We’ve found them to produce passable raw entropy, but you don’t have to take my word for it. You can download our raw data and run your on analysis on it if you like (at least until our ISP cuts us off for serving multiple 10GiB files filled with random data).

However, the system is still missing two features that are generally considered to be best practice:

1. Independent, on-line health monitors of the raw TRNG outputs, discussed in our previous post.
2. Conditioning of the raw data.

Because Precursor uses an FPGA for its SoC, we can add new features to the hardware “on the fly”, and in this post will focus on adding the hardware for the conditioning of raw data. The post is a bit of a slog to read through; I’ll try not to do too many in this style. But occasionally, I think it’s good to have a “reality check” that conveys the depth of difficulties encountered while implementing a feature, instead of making everything look easy or cramming all the details into a tweet. This post will hopefully also serve as a detailed reference for the handful of future developers who want to implement similar features, and would like to avoid some of the mistakes I’ve made.

Despite best efforts to make TRNGs unbiased and flawless, they are really hard to get right. Furthermore, they are only capable of producing high-quality entropy at a limited data rate. Thus, most practical systems take a TRNG output and run it through a cryptographic stream cipher to generate a final datastream; this conditioning of the raw data simultaneously protects against minor flaws in the TRNG while improving the availability of entropy.

So in Precursor, we’d like to add some conditioning on the collected entropy. To be clear, we will always make the raw entropy pool available for inspection — this is absolutely necessary for testing and CI coverage — but it’s a good idea to throw the output of your TRNG into a cryptographically sound stream cipher of some type, so that in the case that your TRNG suffers a sporadic drop-out, you don’t have a disastrous, instantaneous collapse of entropy. Depending on how you configure the conditioning, you can also increase the effective availability of random numbers.

After poking around a bit on the Internet, it seems popular to feed a seed of entropy into the ChaCha20 stream cipher (I refer to it as a “cipher” in this post, but I think more technically because of the way I’m using it, it should be referred to as a CSPRNG – Cryptographically Secure Pseudo Random Number Generator). This is done in the Linux kernel as well as by cryptech.is’s HSM and a few other implementations. The devil, of course, is always in the details. Cryptech.is takes the output of their TRNGs, hashes them with a SHA2 block, and then feeds it into a ChaCha20 stream cipher. Linux maintains an entropy pool that is collected from a variety of low-and-high-quality sources, blends them using some fast and slow techniques based upon the nature of the source, runs it through SHA1, and then into ChaCha20 to create the final stream.

In our implementation, we’re skipping the SHA pre-hash of the entropy pool, for the following reasons:

1. We don’t have the space (as in FPGA gates) or computational resources (as in CPU cycles for software emulation) to do this continuously
2. The raw entropy sources themselves are fairly high-quality (as opposed to in the case of Linux they are taking things like time stamps and network traffic patterns, so using SHA-1 to blind that is quite reasonable); and, we have two entropy sources of different architectures, each source consisting of multiple raw entropy generation elements. In other words, we’re starting from a pretty good place already.
3. The ChaCha20 cipher is mainly to protect against transient drop-out or unpredicted aging issues in the TRNG.
4. The ChaCha20 cipher itself is considered to be cryptographically secure; augmenting it with a SHA pre-hash does not seem to lend any clear benefit in our case.
5. We will prefer to reseed the ChaCha20 cipher at a fairly high rate, rather than seed it occasionally through a more expensive SHA operation.

Also in reading some of the docs out there, it seem that 20 is the minimum number of rounds considered to be cryptographically secure for ChaCha, but “more rounds is better”. Thus, while tweaking the Secworks open-source Chacha implementation to fit in Precursor, I think I’ll also create an option to let the ChaCha rounds function run to at least a minimum number of rounds but otherwise “free run” when the system is idle. This makes it useless as a cipher, but if the power impact is minimal, this should help the diffusion of entropy across all the bits of the cipher and make it more robust against attempts to back-track its state.

The diagram above gives an “artist’s impression” of the architecture of the system. It’s an early sketch, before I fully understood everything going on. The final details came out a little different, but the big ideas are there. Not shown is logic that ensures the ChaCha20 cipher has completed its minimum amount of rounds between each output sample, and automatic re-seeding logic that pulls in 32 bits of entropy from the TRNGs on a regular basis at some re-seeding interval that can be programmed by software.

After installing the tooling necessary to build a Precursor/Betrusted SoC and run simulations, I started writing the code.

Here’s the general method I use to develop code:

1. Think about what I’m trying to do. See above.
2. Write the smaller submodules.
3. Wrap the smaller modules into a simulation framework that shakes most of the skeletons out of the closet.
4. Repeat 1-3, working your way up the chain until you arrive at your full solution.
5. Write drivers for your new feature
6. Slot your feature into the actual hardware
7. Test in real hardware
8. Continuously integrate, if possible, either by re-running your sim against every repo change or better yet recompiling and re-running your test on actual hardware.

The key to this loop is the simulation. The better your simulation, the better your outcome. By “better simulation”, I mean, the less assumptions and approximations made in the test bench. For example, one could simulate a module by hooking it up to a hand-rolled set of Verilog vectors that exercises a couple read and write cycles and verifies nothing explodes; or, one could simulate a module by hooking it up to a fully simulated CPU, complete with power-on reset and multiple clock phases, and using a Rust-based framework to exercise the same reads and writes. The two test benches ostensibly achieve the same outcome, but the latter checks much more of the hairy corner cases.

For Betrusted/Precursor, we developed a comprehensive simulation framework that achieves the latter scenario as much as possible. We simulate a full, gate-level VexRISCV CPU, running a Rust-built BIOS, employing as many of the Xilinx-provided hardware models as we can for things like the PLL and global power-on reset. You can make a new testbench by running the “new_sim.py” script in the deps/gateware/sim directory, and it will automatically copy over all the basics so you can just instantiate your core module and start running simulation code.

Another thing I learned perhaps too late was to pick a good IDE. Python sucks to develop. It sucks even more without an IDE. For a long time I was writing in vanilla emacs. The game-changer for me was being able to control-click through modules and find their definitions; most IDEs support this.

### Writing The Block

I noted in Cryptech.is’s presentations that they were using a 24-round variant with a potential to go to 32; and even though cryptanalysis shows its solid with 20 rounds, perhaps there is still a potential that maybe some knowledge of the output could be used to infer something about the next set of outputs.

So, I still made a very light-fingered change, where the core round function could be advanced by one during idle periods based on a force_round input. In any case, the output of the ChaCha cipher is not used until at least 20 rounds in any case, but on average it will go many more rounds (thousands of rounds) before the output is used, and the number of rounds advanced depends upon the exact timing of the code that accesses the TRNG.

Thus, the overall flow of events for the conditioning block are as follows:

1. On boot, we pull in 384 bits of “key” state (256 bits key + 64 bits iv + 64 bits nonce) from the TRNG, and then initialize the cipher.
2. We also pull in 512 bits of “input” state. This is a fixed quantity that is XOR’d with the ChaCha block state. I worry that tying this to a fixed value could lead to leakage of the ChaCha state to the output, thus, on every boot we load in a new random input value.
3. Once the cipher is initialized, a counter keeps track of how many blocks we’ve generated. Once we have passed a threshold, the conditioner automatically requests another 32-bit word of entropy from the TRNG and does a rotate-and-XOR of the data into the “key” state. The key state is re-incorporated on the next “advance” call of the block.
4. Every “selfmix” interval, the cipher will also advance its rounds counter by one (this is the supplemental entropy/scrambling as noted previously)
5. The user can also supply seeding data at any time, which is folded in with a rotate-and-XOR function. Note for every seed provided, it’s best practice to read 16 words of TRNG output, to ensure that the seed state is incorporated into the TRNG as a whole. It doesn’t harm things to not do that, just, if you kept on applying the same seed over and over again it’ll eventually just XOR into and out of the key pool without doing anything useful.
6. The output of the cipher is latched into a 512-bit register, and urandom data outputs are multiplexed to the kernel and userspace through the CSR. A valid bit is provided, and it should be checked prior to every read of data from the Conditioned TRNG output.

### Building, and Failing

So your code is all writ, now time to try a quick “smoke test” to see if the build fails. And, sure enough, even though the build completes, I’m not meeting timing. We’re using a -1L variant of the Spartan 7 part, which is the lowest power model (important for battery life) but also the slowest performer in the group. So, even though Cryptech.is advertised hitting 100MHz performance on the block, I did note they were using a -3 variant (one of the fastest models) of a fairly high-end Artix FPGA.

So, back to the drawing board — it’s time to drop the clock frequency on the core to 50MHz. This is probably fine from a performance standpoint, as it could still yield 1.6Gbps burst rate of random numbers, but it does introduce some potentially thorny issues with clock domain crossings. In particular, I don’t get to be lazy and just use a bunch of Migen’s BusSynchronizer primitives to bridge 512-bit data between domains. That would be…expensive, in terms of logic resources. Instead, I’m going to have to be careful and use multi-cycle paths with explicit timing exceptions to make sure this works and without consuming undue amounts of resources.

What — you mean you thought getting code to compile was the end of the story? In hardware, every flip flop is a race condition against a clock — just think of it that way. Every time you make a register, you now have a potential concurrency problem. Now scale that up to the tens of thousands of registers in an FPGA, and you get an idea of why correct compilation is only a small portion of the whole story.

### Failing More, in Order to Get Better

After reducing the frequency of the core to 50MHz and fighting for a half day with the Vivado constraints syntax to convince it not to spend half an hour routing cross-clock-domain that are actually relaxed, I finally have the design building with timing closure. The next step is going back into the simulation bench and checking all the corner cases.

I modified the existing trng_managed simulation bench to add a series of tests that would exercise all the new features — including checking the rate of read, all the dual port options, and deliberately forcing some reads to fail by bypassing certain checks on validity of data to really push the engine to its limit. In the process of running these benches, I found more new corner cases, for example, the ChaCha engine would stall if the raw entropy FIFO ran out of seeding data; it would just wait for reboot of the avalanche+ring oscillator TRNGs, a process that can take milliseconds. This shouldn’t happen on the urandom engine — it is the “unlimited” random data engine after all, so some fixes were made to keep the block state evolving even if we ran out of seeding data. However, I have to emphasize this really is a corner case, you have to basically configure the system to constantly reseed and pull entropy out at a rate much higher than you can do anything useful with it to hit that limit. But, that’s the point of testing — to break your system before users do.

### Rinse, Lather, Repeat

Now that the simulation runs, we repeat the loop and see how badly timing is broken. It turns out with all the tweaks in place, I’ve introduced an accidental critical path. While the design can meet timing with this, it takes almost 3x the time to build, which means we’re probably going to start failing builds pretty soon. I should probably fix this problem now, while the design details are fresh in my head.

I open up the design in the Vivado tools and have it plot a timing histogram for me, below is an example of the output:

Above: Example of a slack histogram generated by Vivado. Green bars toward the left indicate paths with less timing slack, to the right have more. So for example, the left-most bin has 206 paths with 0.041ns-0.227ns of slack. The more “left-heavy” the histogram is, the harder the design is to place & route, and the more likely the process will not complete with perfect timing closure.

The left-most bin are the most over-constrained elements, and I can see that the trngmanaged_holding_buf registers are just barely passing.

I generate a schematic view like this inside the Vivado design tool, and backtracing the inputs to the holding_buf registers I can see the signal eventually goes through an enormous combinational path that includes the VexRiscV’s address decoders. This means that we have a critical path that extends from inside the VexRiscV load/store unit all the way to the 512-bit shifter of the ChaCha block.

If it were the case that we actually needed to pull data cycle-by-cycle out of the ChaCha block and somehow change the behavior of the system within a single cycle, then this path would make sense. But in practice, the actual rate is far lower than that — the tightest, most unrolled read loops take at least 2 cycles to execute; but more typically 10-12 cycles per iteration if you’re doing any sort of useful load or store with the data and running it from a rolled-up loop. Furthermore, the ChaCha block is basically a “read-only” block — there’s nothing the CPU is going to do to reconfigure it on the fly that would prevent us from
pipelining this.

So, I go back and add a “depth 1 FIFO” to the output, to break the shift register out of the critical path while still preserving the read-invalidate semantics required by the CSR interface (that is, the contract we have with the CSR interface is the cycle after a random number is read, either the next value should be a valid, new random number; or it can be an invalid number, but the “valid” bit must be cleared that very cycle). Using a simple pipeline register won’t work, because it introduces an extra cycle delay on the read-invalidate semantics and it’s possible, although very unlikely, that a code loop could have back-to-back loads and it will grab a stale “valid” state.

After running another simulation just to make sure we haven’t introduced any new bugs, I run a compilation again and hope this greatly relaxes the timing. It didn’t, but long story short, I had accidentally leaked one path in the FIFO through as a combinational element. After spending a couple hours tracing that issue out, I finally met with some success.

After applying these fixes, the compilation time is back down to its normal value, and we can see that the timing bins at the very far left of the histogram have a fraction of paths as before — 69, down from 206. It looks like the holding_buf register is still causing some troubles, but at least we’ve taken pressure off of the actually most difficult paths, which are inside the Curve 25519 Engine, and not some made-up timing emergencies inside the shift register at the output stage of the TRNG.

The price of all the timing fixes is increased resource utilization. Initial estimates showed about a 6% area cost for the ChaCha block without the timing fixes, but we ended up with around 10.5% area cost once all is said and done. The SoC uses around 68.2% of the FPGA now, which is still fine for routability and gives some space for a few more smaller blocks, but it’s probably at the upper limit of what we can reasonably expect to fit while having short (~8-9 minute) compilation cycles. I’m also making a mental note to myself that as part of the histogram analysis it looks like the next critical path to knock back is in the CSR address decoders. Adding the dozens of registers to read out the results from the TRNG health tests has probably forced an extra layer of LUTs to be instantiated inside the core arbitration logic; thus, pruning some of these registers if they end up being superfluous could be a way to win back some timing margin if I end up in a jam later.

Well, that took some time in simulation. But we’re not done yet! It’s time to run this code on real hardware, and then upgrade the kernel and services of Xous to use it — and let’s not forget CI integration as well.

First things first, let’s just take the design, as-is, with no mods to Xous and see if it boots, and what the power draw change is. The good news is that things boot, and the system draws about 5 mA extra, or about 5% more idle power than it did before. The ChaCha whitener is in the always-on domain, because it has to be able to run while the kernel is idling to produce random data. In particular, the core can “wake up” the avalanche generator and ring oscillators while the CPU is idle to summon more TRNG data even when the CPU is asleep; as such its clock can’t be gated off, but it is better than the alternative of running the CPU to manage these processes. A quick test tweaking the knobs on the duty cycle of the self-mixing for the ChaCha core shows it doesn’t have a measurable impact on the power, which means that probably most of the extra power is just the fixed cost of clocking that much more logic on the fabric.

The next step is to switch over the kernel and core services from using raw random numbers to the new ChaCha conditioned “urandom” source. This should have been easy in theory since everything worked in simulation, but of course — it didn’t work. I was greeted with a full system hang on the first go. This kind of bug is dreaded, because it’s some issue in the full system integration that exists outside of the testbenches, so you have to resort to long compile cycles and “printf” debugging.

Some further poking revealed that even worse yet, on some boots, the system worked, on others, it didn’t. However, the pattern was enough for me to suspect an issue in the reset logic. Going back into my code, I did find a subtle bug where I forgot to assign the reset line for the ChaCha core to the 50MHz clock domain (it was still tied to the reset for the 100MHz clock domain). For good measure, I also added a reset pulse extender, just to make sure it saw a long, solid reset. The polarity of the ChaCha core reset is opposite from what the rest of the system uses, so it’s also possible there is some glitches due to that. Anyways, after another four hours of tweaking and testing these fixes, re-running simulations, and cross-checking documentation about the reset process to really make sure we found a root cause, the system seems to come up as expected from the “urandom” source.

### For TRNGs, The Work is Never Done

The final phase of modifying any TRNG is continuous integration. This is extremely important because you can only get a hint if a TRNG is mis-behaving in subtle ways after collecting dozens of gigabytes of data from it over dozens of runs; and due to the limited rate of entropy, gathering such quantity of data can take weeks. Thus, we have a Precursor unit dedicated to doing nothing but rotating through its entropy sources and feeding its data into a set of test suites to try and uncover any subtle biases in the generators. You can browse the raw data and results of our CI bench here.

Now that we have this conditioned output, we need to integrate that into the test bench. It’s also going to be important to do a few “cold boot with recorded data” runs as well, so we can compare boot-to-boot to make sure that we have actually seeded the generator with random data and we’re not just accidentally running a ChaCha cipher (which should have excellent statistical randomness) with a fixed set of keys (which would have an identical stream boot-to-boot despite the excellent statistics).

In the process of re-integrating these tests into the CI bench, I discovered a number of problems with our original tool for shuttling random numbers to the host for analysis. For a number of reasons, the original CI tool relied on a fairly complicated Rust program with multiple threads, which was getting hard to maintain and the Rust wasn’t adding a lot of value; in fact I found a subtle integration bug that was causing all the tests for the past month to be run only on the ring oscillator!

Since we had a pretty decent Python script for doing USB updates over USB, I modified that into a script currently living in xous-core/tools/trng_test.py. This script is matched to a thread that runs inside the trng server on the Xous side when it is built with one of the TRNG tester options. On the Precursor side, a pair of 512kiB buffers are filled and the host is told which of the two to read from. The host-side Python script then reads the buffers and echos its contents out as binary data to stdout, which can then by piped into an analysis program (such as dieharder) or recorded to disk for later analysis. This whole mess is then wrapped in some other shell-script Frankenstein that’s fired off by a cron job which manages the rebuild, reflashing, and logging of data via a Raspberry Pi that can keep an eye on the serial console for errors during the CI process. It feels like a house made of playing cards, but it works!

### Closing Thoughts

If you made it this far, congratulations. Hopefully it conveys, with some accuracy, the range of difficulties one might encounter when implementing a “typical” feature like this for an FPGA-based SoC, and it may even be useful to the handful of developers who want to actually attempt something like this!

### What’s the Value of Hackable Hardware, Anyway?

Friday, December 11th, 2020

There is plenty of skepticism around the value of hackable products. Significantly, hackability is different from openness: cars are closed-source, yet support vibrant modding communities; gcc is one of the “real OG”s of open source, but few users find it easy to extend or enhance. Is it better to have a garden planted by the most knowledgeable botanists and maintained by experienced gardeners, or an open plot of land maintained by whoever has the interest and time?

Above left: Walled garden of Edzell Castle; above right: Thorncliffe Park community garden.

In the case of hardware products, consumer behavior consistently affirms a strong preference for well-curated gardens. Hardware is hard – not only is it difficult to design and validate, supply chains benefit from economies of scale and predictable user demand. The larger a captive audience, the more up-front money one can invest into developing a better hardware product. However, every decision to optimize comes with inherent trade-offs. For example, anytime symmetry is broken, one must optimize for either a right-handed or a left-handed version.

Above: touching the spot indicated by the red arrow would degrade antenna performance on an iPhone 4. This spot would naturally rest on the palm of a left-handed user. Image adapted from “iPhone 4” by marc.flores, licensed under CC BY 2.0.

Some may recall a decade ago when the iPhone 4 was launched, left-handed people noticed the phone would frequently drop calls. It turned out the iPhone 4 was designed with a critical antenna element that would fail when held naturally by a left-handed person. The late Steve Jobs responded to this problem by telling users to “just avoid holding it that way”. Even if he didn’t mean it, I couldn’t help but feel like he was saying the iPhone 4 was perfect and left-handers like me were just defective humans who should be sent to re-education camps on how to hold things.

Of course, as a hardware engineer, I can also sympathize with why Steve Jobs might have felt this way – clearly, a huge amount of effort and thought went into designing a technical masterpiece that was also of museum-quality construction. It’s frustrating to be told, after spending years and billions of dollars trying to build “the perfect product” that they somehow got it wrong because humans aren’t a homogeneous population. Rumors have it Apple spent tens of millions of dollars building micron-precision production jigs out of injection-molding grade tooling to ensure the iPhone4 was simply perfect in terms of production tolerances; duplicating all of those to make a mirror-image version for left-handers that make up 10% of the market size just made no business sense. It proved to be cheaper and easier, ultimately, to take full refunds or to give out rubber bumpers to the users who requested them.

I do think there is such a thing as “over-designing” a product. For example, contemporary “high concept” smartphone design is minimalist – phone companies have removed headphone jacks, hidden the front camera, and removed physical buttons. There is clearly no place for screws in this world; the love affair of smartphones and adhesives has proven to be … sticky. Adhesives, used in place of screws in modern smartphones, are so aggressive that removing them either requires additional equipment, such as a hot plate and solvents, or simply destroying the outer bezel by breaking the outer case off in pieces and replacing it with an entirely new bezel upon re-assembly. In other words, hacking a modern smartphone necessarily implies the destruction or damage of adhesive-bound parts.

With Precursor, I’m bringing screws back.

Precursor’s screws are unapologetic – I make no attempt to hide them or cover them with bits of tape or rubber inserts. Instead, I’ve sourced custom-made Torx T3 metric screws with a black oxide finish that compliments the overall color scheme of Precursor. Six of them line the front, as a direct invitation for users to remove them and see what’s inside. I’ve already received ample criticism for the decision to show screws as “primitive”, “ugly”, “out of touch with modern trends” — but in the end, I feel the visual clutter of these six screws is a small price to pay for the gain in hackability.

Of course, the anti-screw critics question the value of hackability. Surely, I am sacrificing mass-market appeal to enable a fringe market; if hackability was so important, wouldn’t Apple and Google already have incorporated it into their phones? Wouldn’t we see more good examples of hackability already?

This line of questioning is circular: you can’t get good examples of hacks until you have made hackable products wide-spread. However, the critics are correct, in a way: in order to bootstrap an ecosystem, we’re going to need some good examples of why hackability matters.

In the run-up to crowdfunding Precursor, I was contemplating a good demo hack for Precursor. Fortuitously, a fellow named Matt Campbell opened a GitHub issue requesting a text-to-speech option for blind users. This led me to ask what might be helpful in terms of a physical keyboard design to assist blind people. You can read the thread for yourself, but I’ll highlight that even the blind community itself is divided on whether or not there is such a thing as the “blind ghetto” — an epithet applied by some users who feel that blindness-specific products tend to lag behind modern smartphones, tablets, and laptops. However, given that most modern gadgets struggle to consider the needs of 10% of the population that’s left-handed, I’m readily sympathetic to the notion that gadgets make little to no concession to accommodate the even smaller number of blind users.

Matt was articulate in specifying his preferred design for a pocketable keyboard. He referred me to the “Braille ‘n Speak” (shown above) as an example of an existing braille keyboard. Basically, it takes the six dots that make up braille, and lines them up horizontally into three left and three right sets of buttons, adding a single button in the middle that functions as a space bar. Characters are entered by typing chords that correspond to the patterns of the six dots in the braille alphabet. Not being a braille user myself, I had to look up what the alphabet looked like. I made the guide below based on a snippet from Wikipedia to better understand how such a keyboard might be used.

Ironically, even though Matt had linked me to the picture of the Braille n’ Speak, it still took a while to sink in that a braille variant of Precursor did not need a display. I’m a bit ashamed to admit my first sketches involved trying to cram this set of switches into the existing keyboard area of the Precursor, without first removing the display entirely. I had to overcome my own prejudice about how the world should look and it took me time and empathy to understand this new perspective.

Once I had a better grasp of Matt’s request, I set about designing a customized braille variant. Precursor was designed for this style of hacking: the keyboard is a simple 2-layer PCB that’s cheap and easy to re-design, and the front bezel is also a PCB, which is a bit more expensive to redesign. Fortunately, I was able to amortize setup costs by bundling the braille front bezel variant with another variant that I had to fabricate anyways for the crowdfunding campaign. Beyond that, I also had to come up with some custom key caps to complement the switches.

The major challenge in designing any type of mobile-friendly keyboard is always a trade-off between the hand feel of the switches, versus thinness of the overall design. On one side of the spectrum, full-travel mechanical switches have a wonderful hand feel, but are thicker than a sausage. On the other side of the spectrum, dome switches and printed carbon ink patterns are thinner than a credit card, but can feel mushy and have a limited “sweet spot” — the region of a key switch with optimal tactile feedback and operational force curves. The generally well-regarded Thinkpad keyboards go with a middle-ground solution that’s a few millimeters thick, using a “scissor” mechanism to stabilize the key caps over a silicone dome switch, giving individual keys a bit of travel while ensuring that the “sweet spot” covers the entire key cap. Optimizing key switch mechanisms is hard: some may recall the controversy over Apple’s re-design of the MacBook’s keyboard to use a “butterfly” mechanism, which shaved a couple mm of thickness, but led to lawsuits over a defect where the keyboard allegedly stopped working when small bits of dust or other particles got trapped under it.

Given the short time frame and a shoestring budget, I decided to use an ultra-thin (0.35mm) tactile switch that I could buy off-the-shelf from Digikey and create custom key caps with small dimples to assist users in finding the relatively small sweet spots typical of such switches. I have sincere hopes this is a pretty good final solution; while it lacks a scissor mechanism to spread off-centered force, the simple mechanism meant I didn’t have to stick with a square key cap and could do something more comfortable and ergonomic to focus forces into the sweet spot. At the very least, the mechanism would be no worse than the current mechanism used in Precursor’s existing keyboard for sighted users, which is similarly a dome switch plus a hybrid-polymer key film.

Next, I had to figure out where to place the switches. To assist with this, I printed a 1:1 scale version of the Precursor case, dipped my fingertips in ink, and proceeded to tap on the printout in what felt like a natural fashion.

I then took the resulting ink spots and dimensioned their centers, to place the centroid of each key cap. I also asked my partner, who has smaller hands, to place her fingers over the spots and noted the differences in where her fingers lay to help shape the final key caps for different-sized hands.

Next, using the “master profile” discussed in the previous post on Precursor’s mechanical design, I translated this into a sketch to help create a set of key caps based on splines that matched the natural angle of fingers.

Above, you can see an early sketch of the key caps, showing the initial shape with dimples for centering the fingers.

Before moving ahead and spending a few hundred dollars to build a functional prototype, I decided to get Matt’s feedback on the design. We submitted the design to Shapeways and had a 3D print sent to Matt, which he graciously paid for. After receiving the plastic dummy, his feedback was that the center space bar should be a single key, instead of two separate keys, so I merged the two separate key caps of the space bar together into a single piece, while retaining two separate switches wired in parallel under the space bar. I felt this was a reasonable compromise that would allow for a “sweet spot” that serviced lefties as well as righties.

I then re-designed the keyboard PCB, which was a fairly simple task, because the braille keyboard consists of only eight switches. I just had to be careful to pick row/column pairs that would not conflict during chording and be sure to include the row/column pairs necessary to turn Precursor on after being put to sleep. I also redesigned the bezel; eliminating the display actually makes manufacturing a little bit easier because it also removes a beveling step in the manufacturing process. I kept the RF antenna in exactly the same location, as its design was already well-characterized and it takes a lot of effort to tune the antenna. Finally, I decided to manufacture the key switches out of aluminum. The switches have a lot of fine features and I needed a stiff material that could translate off-target force to the key switches to enlarge the sweet spot as much as possible.

Above: The prototype of Precursor with braille keyboard.

About three weeks later, all the parts for the braille keyboard had arrived. I decided to use purple anodization for the key switches which, combined with the organic key shapes, gives the final design a bit of a Wakanda-esque “Black Panther” aesthetic, especially when mounted in a brass case. The key switch feel is about in line with what I imagined, with the caveat that one of the switches feels a little less nice than the rest, but I think that’s due to a bad solder job on the switch itself. I haven’t had a chance to trace it down because…well, I’ve had to write a bunch of posts like this to fund Precursor. I have also been working with Xobs to refactor Xous in hopes of putting together enough code to send Matt a prototype he can evaluate without having to write gobs of embedded hardware driver code himself.

Above is a quick photo showing the alignment of fingers to keys. Naturally, it’s perfect for my hand because it was designed around it. I’m looking forward to hearing Matt’s opinion about the feel of the keys.

Above is a photo of the custom parts for the braille keyboard. At the top, you can see the custom bezel with key caps and the RF antenna matching circuitry on the top right. On the bottom, you can see the custom keyboard PCB mounted onto a Precursor motherboard. The keyboard PCB is mostly blank and, because of the small number of keys and the flexibility of the FPGA, there’s an option to mount more peripherals on the PCB.

Despite not being yet finalized, I hope this exercise is sufficient to demonstrate the potential value of hackable products. The original design scope for Precursor (née Betrusted) did not explicitly include a braille keyboard option, but thanks to modular design principles and the use of accessible construction materials, I was able to produce a prototype in about a month that has a similar fit and finish as the mainstream product.

As long as humans choose to embrace diversity, I think hackability will have value. A notional “perfect” product implies there’s such a thing as a “perfect” user. However, in reality, even the simple conundrum of left- or right-handedness challenges the existence of a singular “perfect” product for all of humanity. Fortunately, accommodating the wonderfully diverse, quirky, and interesting range of humanity implicates just a few simple engineering principles, such as embracing screws over adhesives, openness, and modularity. That we can’t hack our products isn’t a limitation of physics and engineering. Precursor demonstrates one can build something simultaneously secure and hackable, while being compact and pocketable. This suggests the relative lack of hackable products on the market isn’t a fundamental limitation. Maybe we just need a little more imagination, maybe we need to be a little more open-minded about aesthetics, and maybe companies need to be willing to take brave steps toward openness and inclusivity.

For Apple, true “courage to move on and do something new that betters all of us” was to remove the headphone jack, which resulted in locking users deeper into a walled-garden ecosystem. For hackers like myself, our “courage” is facing blunt criticisms for making “ugly” products with screws in order to facilitate mods, such as braille keyboards, in order to expand the definition of “all of us” beyond a set of privileged, “perfect” users.

I hope this braille keyboard is just the first example of many mods for Precursor that adapt the product for unique end-users, bucking the trend of gaslighting users to mold their behavior and preferences to fit the product. If you’ve got an itch to develop your own yet-to-be-seen feature in a mobile device, please visit our crowdfunding campaign page to learn more about Precursor. We’re close to being funded, but we’ve only a few days left in the campaign. After the campaign concludes on December 15th, the limited edition will no longer be available, and pricing of the standard model goes up. If you like what you see, please consider helping us to bring Precursor to life!

### What is a System-on-Chip (SoC), and Why Do We Care if They are Open Source?

Tuesday, November 10th, 2020

Note: This post originally appeared as an update in the crowdfunding campaign for Precursor.

Modern gadgets are typically built around a single, highly integrated chip, known as a “System on Chip” (SoC). While the earliest home computer motherboards consisted of around a hundred chips, Moore’s Law pushed that down to just a handful of chips by the time 80286 PC/AT clones were mainstream, and the industry has never looked back. Now, a typical SoC integrates a CPU core complex, plus dozens of peripherals, including analog, RF, and power functions; there are even “System in Package” solutions available that package the SoC, RAM, and sometimes even the FLASH die into a single plastic package.

Modern SoCs are exceedingly complex. The “full user’s manual” for a modern SoC is thousands of pages long, and the errata (“bug list”) – if you’re allowed to see it – can be hundreds of pages alone. I put “full user’s manual” in quotations because even the most open, well-documented SoCs (such as the i.MX series from NXP) require a strict NDA to access thousands of pages of documentation on third party Intellectual Property (IP) blocks for functions such as video decoding, graphics acceleration, and security. Beyond the NDA blocks, there is typically a deeper layer of completely unpublished documentation for disused silicon, such as peripherals that were designed-in but did not make the final cut, internal debugging facilities, and pre-boot facilities. Many of these disused features aren’t even well-known within the team that designed the chip!

Disused silicon is a thing because building chips is less like snapping together Legos, and more like a sculptor chiseling away at a marble block: adding a circuit is much harder than deactivating a circuit. Adding a circuit might cost around $1 million in new masks, while delaying the project by about 70 days (at a cost of 100,000 man-hours worth of additional wages); with proper planning, deactivating a circuit may be as simple as a code change, or a small edit to a single mask layer, at a cost of perhaps$10,000 and a few days (assuming wafers were held at intermediate stages to facilitate this style of edit).

Thus a typical SoC mask set starts with lots of extra features, spare logic, and debug facilities that are chiseled away (disused) until the final shape of the SoC emerges. As Michelangelo once said “every block of stone has a statue inside it, and it is the task of the sculptor to discover it,” we could say “every SoC mask set has a datasheet inside it, and it is the task of the validation team to discover it”. Sometimes the final chisel blow happens at boot: an errant feature may be turned off or patched over by pre-boot code that runs even before the CPU executes its first instruction. As a result, even the best documented SoCs will have a non-trivial fraction of transistors that are disused and unaccountable, theoretically invisible to end users.

From a security standpoint, the presence of such “dark matter” in SoCs is worrisome. Forget worrying about the boot ROM or CPU microcode – the BIST (Built in Self Test) infrastructure has everything you need to do code injection, if you can just cajole it into the right mode. Furthermore, SoC integrators all buy functional blocks such as DDR, PCI, and USB from a tiny set of IP vendors. This means the same disused logic motifs are baked into hundreds of millions of devices, even across competing brands and dissimilar product lines. Herein lies a hazard for an unpatchable, ecosystem-shattering security break!

Precursor sidesteps this hazard by implementing its SoC using an FPGA. FPGAs are user-reconfigurable, drastically changing the calculus on the cost of design errors; instead of chiseling away at a block of marble, we are once again building with a Lego set. Of course, this flexibility comes at a cost: an FPGA is perhaps 50x more expensive than a feature-equivalent SoC and about 5-10x slower in absolute MHz. However, it does mean there is no dark matter in Precursor, as every line of code used to describe the SoC is visible for inspection. It also means if logic bugs are found in the Precursor SoC, they can be patched with an update. This drastically reduces the cost to iterate the SoC, making it more economically compatible with an open source approach. In an ideal world, the Precursor SoC design will be thoroughly vetted and audited over the next couple of years, converging on a low-risk path toward a tape out in fixed silicon that can reduce production costs and improve performance all while maintaining a high standard of transparency.

LiteX: The Framework Behind Precursor’s SoC

Precursor’s SoC is built using LiteX. LiteX is a framework created by Florent Kermarrec for defining SoCs using the Migen/MiSoC FHDL, which itself is written in Python 3.6+. The heart of LiteX is a set of “handlers” that will automatically adapt between bus standards and widths. This allows designers to easily mix and match various controllers and peripherals with Wishbone, AXI, and CSR, bus interconnect standards. It is pretty magical to be able to glue an extra USB debug controller into a complex SoC with just a few lines of code, and have an entire infrastructure of bus arbiters and adapters figure themselves out automatically in response. If you want to learn more about LiteX and FPGAs, a great place to start is Florent’s “FPGA_101” mini-course.

A Brief Tour of Precursor’s SoC

Above is a block diagram of Precursor’s SoC, as of October 2020. It’s important to pay attention to the date on documentation, because an FPGA-based SoC can and will change over time. We generally eschew pretty, hand-drawn block diagrams like this because they are out of date almost the day they are finished. Instead, our equivalent of a “programmer’s manual” is dynamically generated by our CI system with every code push, and for Rust programmers we have a tool, svd2utra that automatically translates SVD files generated by LiteX into a Rust API crate. With an open source FPGA-based SoC, automated CI isn’t merely best practice, it’s essential, because small but sometimes important patches in submodule dependencies will regularly affect your design.

Core Complex

The “Core Complex” currently consists of one RISC-V core, implemented using Charles Papon’s VexRiscV. We configured it to support the “RV32IMAC” instruction subset, gave it an MMU, and beefed up the caches. The VexRiscV limits cache size to 4kiB, but effective capacity can be increased by upping the cache associativity. We get about a 10% performance boost by tuning the core to have a two-way I-cache, and a four-way D-cache. We also provision a 32 kiB boot ROM, which currently holds three instructions, but will someday be expanded to include signature checks on code loaded from external memory and a 128kiB on-board SRAM for tightly coupled/higher security operations. The CPU core is adapted to, and arbitrated into, a multi-controller Wishbone bus by LiteX and further adapted into a CSR bus by a dedicated CSR bridge that has been configured to automatically space peripherals on 4-kiB page boundaries, so that they can be individually remapped with the MMU. There’s also an IRQ handler that manages interrupts originating from peripherals sprinkled around the chip.

The Core Complex also includes a set of mostly boilerplate CSRs which perform the following functions:

• “Reboot” allows us to specify a new location for the reset vector
• “Ctrl” allows us to issue a soft reset
• “Timer 0” is the default timer provided by LiteX. It is a high resolution 32-bit timer clocked at the same frequency as the CPU core.
• “CRG” is an interface to control the FPGA’s clock generator. Right now we don’t do much with it, but eventually this is going to play a central role in power management and extending battery life.
• “Git Info” is a static register that provides information about the state of the git repo from which Precursor was built.
• “BtSeed” is a 64-bit number that can be randomized to force entropy into the place-and-route process, in case the end user desires a final FPGA netlist unique to their device without having to modify the code (otherwise the builds are entirely reproducible).
• “Litex ID” is a human-readable text string that identifies the SoC design.
• “TickTimer” is a low-resolution, 64-bit timer clocked in 1 ms increments. It serves as a source of time for the Xous OS.

Debug Block

Adjacent to the Core Complex is a Debug block. The Debug block features a full speed USB MAC/PHY that can tunnel Wishbone packets and serve as an alternate Wishbone controller to the CPU. We use this to drive the debug interface on the CPU, thus allowing GDB to connect to Precursor over USB even when the CPU is halted. In fact, one could build Precursor with no RISC-V CPU and just tunnel Wishbone packets over USB for debug and driver development. The debug block also includes a small CSR peripheral called the “Messible”, which is a 64-entry by 8-bit wide FIFO, useful as a mailbox/scratchpad during debugging.

Memory Mapped and CSR I/O

The memory space of the RISC-V CPU is mapped onto various peripherals and memory blocks via a Wishbone bus. For traditional SoC designers, Wishbone is kind of like AXI, but open source. Wishbone supports fancy features like multiple masters, pipelining, and block transfers. A portion of the Wishbone bus space is further mapped onto a bus called the Configuration and Status Register (CSR) bus.

While Wishbone is high-performance, it requires more interface logic and is happiest when the peripheral’s bit width matches the bus width. CSRs are area-efficient and gracefully accommodates registers of arbitrary bit-width from both a hardware and software API standpoint, but are lower performance. Thus CSRs are ideal for low-to-medium speed I/O tasks (such as the eponymous configuration and status registers), whereas Wishbone is ideal for memory-mapped I/O where improved bandwidth and latency are worth the area overhead.

From a design process, most peripherals start life mapped to CSR space, and are then upgraded to a memory-mapped implementation to meet performance demands. Thus, it’s no coincidence that most peripherals on Precursor are CSR-only devices. Here is a brief description of each CSR peripheral. As a reminder, you can always consult our reference manual for more details.

• “COM SPI” is the SPI bus that connects to the Embedded Controller (EC) SoC. It’s a 20MHz SPI peripheral that has a fixed transfer width of 16-bits. This block is targeted for an upgrade to a memory mapped I/O block.
• “I2C” is an I2C bus controller. Currently, only a real time clock (RTC) chip and an audio CODEC chip are are connected to this I2C bus.
• “BtEvents” is a catch-all block for handling various external real-time interrupt sources. Currently it handles interrupts from the EC and RTC chips.
• “KeyScan” is the keyboard controller. It’s designed to scan a 9×10 keyboard matrix for key hits, using a slow external 32kHz clock source. By decoupling the keyboard scanner from the system core clock, the system can go to a lower power state while waiting for keyboard presses, extending the number of days that Precursor can go between charges.
• “BtPower” is a set of GPIOs dedicated specifically to managing power. It can turn the audio and discrete TRNG on and off, override the EC’s power control commands, activate boost mode for the USB type C port (allowing operation as a DFP or “host”), and engage the self-destruct mechanism.
• “JTAG” is a set of GPIOs looped back to the FPGA’s JTAG pins. These are used in combination with our eFuse API drivers to self-provision AES bitstream encryption fuses on the 7 Series FPGA.
• “XADC” is the interface for the 7-Series XADC block, which is a 12-bit, multi-channel ADC. This is primarily used for the self monitoring of system voltages. In the final production revision, at least one channel of the ADC will also be available as a configuration option on the GPIO internal header so that users have an easier path to integrating analog sensors into Precursor.
• “UART” is a simple 115200, 8-N-1 serial interface which is connected to the debug header for console I/O.
• “BtGpio” is a straight-forward digital I/O block for driving the pins on the GPIO internal header. Note that due to the nature of the FPGA’s implementation, it’s not possible to switch between a digital GPIO function and an analog GPIO function without updating the bitstream.

In addition to the CSR I/Os, a few I/O devices are memory-mapped for high performance:

• “External SRAM” is a 32-bit wide, asynchronous interface that memory-maps 16 MiB of external SRAM. The SRAM is battery-backed so that it can retain state while the SoC is powered off. The intention is to optimize power by reducing sleep/wake overhead. However, this also means that the self-destruct procedure must first clear sensitive data from SRAM before activating the final blow that knocks out the SoC, as the self-destruct circuitry is also powered by the SRAM’s backup power supply. The External SRAM block also has a CSR interface to read out the configuration mode of the SRAM.
• “Audio” is an I2S interface to an external audio CODEC. In addition to a CSR block that configures the I2S interface, it also includes a pair of 256×16 entry memory-mapped sample FIFOs.
• “SPI OPI” is a high-speed SPI-like interface to external FLASH storage that memory-maps 12 8MiB of non-volatile storage. The “O” in OPI stands for octal – it’s an 8-bit bus that runs at 100MHz DDR speeds. It also includes a pre-fetcher that can hold several cache line’s worth of code, optimizing the case of straight-line code execution. High performance on this bus is essential, since the intention is for the CPU to run most code as XIP out of FLASH. It also features a CSR interface to control operations like block erase and page programming.
• “MemLCD” is the frame buffer for the LCD. The Sharp Memory LCD contains its own internal memory, which allows it to retain an image even when the host is powered off. The MemLCD frame buffer is thus a cache for the LCD itself. It manages which lines of the LCD are dirty and will flush only the dirty lines to the LCD upon requests made via the CSR. This improves the perceived update rate of the LCD, which is limited to 10 Hz if the entire screen is being updated, but improves inversely proportional to the fraction of the screen that is static.

Cryptography Complex

All the features described thus far consume about 20% of the FPGA’s logic; the majority of the logic in Precursor’s FPGA is dedicated to the Cryptography Complex.

Above is an amoeba plot that visualizes the relative size of various functions within the Precursor SoC design. Some blocks, such as the semi-redundant SHA-512 and SHA-2 accelerators, are currently included simply because we could fit both of them in the FPGA, and not because we strictly needed both of them. Fortunately, removing the SHA-2 block is as easy as commenting out four lines of code, saving about 2800 SLICE LUTs or about 9% of the device’s resources. LiteX and the svd2rust scripts take care of everything else!

Here’s a quick run-down of the blocks inside the Cryptography Complex:

• “Engine25519” is an arithmetic accelerator for operations in the prime field 2^255-19. It’s a microcoded, 256-bit arithmetic engine capable of computing a 256-bit multiply plus normalization in about one microsecond, about a 30x speedup over running the equivalent code on the RISC-V CPU. It consumes a huge amount of resources, but was deemed essential because the Betrusted secure communications application is built around the Double-Ratchet Algorithm, which relies heavily on this type of math. The CI documentation is probably the best starting point to understand more about the Engine25519 implementation. The block is big enough that later on it will get an entire post dedicated to explaining its function.
• “SHA-512” and “SHA-2” are hardware-accelerated SHA hash blocks. They are derived from Google’s OpenTitan SystemVerilog source code. The SHA-2 block is directly from OpenTitan and included mostly because it was easy to integrate. The SHA-512 block is our own adaptation of the SHA-2 block. This is the historical reason for why we have both in the current build of Precursor, even though most applications will only need one hash or the other to be hardware accelerated.
• “AES” is an AES accelerator also lifted directly from the Google OpenTitan project. It is capable of doing AES 128, 192, and 256, and supports encryption and decryption in ECB, CBC, and CTR modes.
• “KeyROM” is a 256×32 ROM implemented using fixed-location LUTs in the FPGA. Since the ROM’s location is fixed, we can use PrjXray to determine the location of the KeyROM bits in the FPGA’s bitstream. This allows us to edit the key ROMs directly into the FPGA bitstream, thus enabling a transfer of trust from the low-level eFuse AES key into the higher-level functions of the Precursor SoC. We will discuss more about some important, recently-discovered vulerabilities in the FPGA eFuse AES key in a post coming soon.
• “TRNG” is an on-chip, ring oscillator-based TRNG. It uses multiple small rings to collect entropy which are then merged into a single large ring for final measurement. The construction and validation of Precursor’s TRNGs will also get their own post at some point down the road.
• “ICAPE2” is an explicit tie-down for an unused internal debug port in the FPGA fabric. ICAPE2 is Xilinx’s way of allowing an FPGA to introspect and access internal configuration state. We explicitly tie it down so that no other functions can try to claim it. Also, since the ICAPE2 is at a well-known location in the bitstream, it is possible to write a tool that does post-compilation inspection of the bitstream to verify that the ICAPE2 block is in fact deactivated.

Parting Thoughts

That’s it for our whistle-stop tour of the Precursor SoC! We’ve sculpted in the parts that are essential to functionality and security and hope the development community will add more. By commenting out a few lines of code, you can clear out unnecessary blocks and make space for your own creations. Precursor’s code base is entirely open and available for inspection – no hidden test logic or microcode blobs and no NDA required to trace an unambiguous, cycle-accurate path from the release of reset to the execution of the first instruction. This lack of “dark matter” and total transparency of design adds yet another argument in the evidence-based case to trust Precursor’s hardware with your private matters.

If you enjoyed this post, please check out Precursor’s campaign page for more details and project updates!

### Guided Tour of the Precursor Motherboard

Friday, September 25th, 2020

We talk a lot about “verifiable hardware”, but it’s hard to verify something when you don’t know what you’re looking at. This post takes a stab at explaining the major features of the Precursor motherboard by first indicating the location of physical components, then by briefly discussing the rationale behind their curation.

Above is a photo of a pre-production version of Precursor, annotated with the location of key components. Like software, hardware has revisions too. So, when verifying a system, be sure to check the revision of the board first. The final production units will have a clear revision code printed on the back side of every board and we’ll tell you where to look for the code once the location is finalized. There will be a few changes to the board before production, which we’ll talk about later on.

But what do all the components do, and how are they connected? Above is a block diagram that tries to capture the relationship between all the components.

### Trusted and Untrusted Domains

First and foremost, you’ll notice that the design is split into two major domains: the “T-domain” and the “U-domain”. “T” stands for “Trusted”; “U” stands for “Untrusted”. A simplified diagram like this helps to analyze the security of the system, as it clearly illustrates what goes into and out of the T-domain; in other words, it defines the hardware attack surface of the trusted domain. Of course, not shown explicitly on the diagram are the side-channels, such as RF emissions and power fluctuations, which can be used to exfiltrate secret data. Very briefly, RF emissions are mitigated by enclosing the entire T-domain in a Faraday cage. Meanwhile, power fluctuations are mitigated partially through local filtering and partially through the use of constant-time algorithms to perform sensitive computations.

As the “Trusted” name implies, the T-domain is where the secrets go, while the U-domain acts as a first-level firewall to the untrusted Internet. The U-domain is explicitly designed for very low power consumption, so that it can be “always on” while still providing several days of standby time. We refer to the FPGA inside the U domain as the Embedded Controller (EC), and the FPGA inside the T-domain as the System on Chip (SoC) or sometimes simply as “the FPGA”.

### Power Management and the Embedded Controller (EC)

The intention is that the always-on EC listens for incoming wifi packets; only once a valid packet is received will the T-domain be powered on.

Using a low-power EC separate from the SoC allows power-hungry processing to be done in bursts, after which the T-domain powers itself off. Thanks to the “memory LCD” that we have chosen, the display can appear persistently even when the T-domain is powered down. Of course, leaving data on the screen while the T-domain is powered down is a potential security risk, but users can adjust the power policy to trade off between security and battery life based on their particular use case and threat scenario. We anticipate that the T-domain running full bore with no power management would exhaust an 1100 mAh battery in about 6-7 hours. Any time spent in an idle state will greatly extend the battery life; thus for a hypothetical messaging application where the CPU is only active during periods of typing and data transfer, one should be able to achieve a full day of use on a single charge.

### Mapping the T-Domain Attack Surface

Extending the boundary of trust to include human-facing I/O is a core tenet of the Precursor secure design philosophy. Thus, the T domain also includes the keyboard, LCD, and audio elements. This is because deferring the rendering of messages to an untrusted display means that any cryptography used to secure messages can be trivially defeated by a screen scraper. Delegating keystrokes to an untrusted touch controller likewise offers a quick work-around for capturing outgoing secrets through a keyboard logger. To mitigate/prevent this, Precursor incorporates an LCD that can be verified with an optical microscope and a physical keyboard that is trivial to verify with the naked eye. Precursor also forgoes an integrated microphone and instead favors a 3.5mm headphone jack, thus putting users solidly in control of when the device may or may not have the ability to record a conversation.

The green boxes in the block diagram above are connectors. These are items that plug into components that are not integrated into the mainboard. With this in mind, we can define the attack surface of the T-domain. We can see that we expose GPIO, USB, and JTAG to external connectors. We also have a bus to the U-domain that we call the COM bus, as well as a pair of quasi-static pins to communicate power state information and a set of pins to monitor the keyboard for user wake up events. Let’s explore each of these attack surfaces in a little more detail.

1. JTAG A user is required to glue shut the JTAG port when the system needs to be sealed and secrets made inaccessible. This is done by placing a metal shield can over the T-domain and dabbing a specially formulated epoxy into the holes. This simultaneously completes the Faraday cage which reduces side band emissions while making the JTAG port more difficult to access.
2. GPIOs and USB In its default configuration, the GPIOs are inert, and thus a difficult attack surface. We also advocate leaving the USB pins disconnected for secure applications; however, developers may opt to wire them up inside the FPGA, at the risk of opening up the expansive USB attack surface.
3. Raw Power Input The primary postulated attack surface resulting from the raw power input are glitches. Denial of service is of course also an issue, by removing power or by destroying the system by applying too high a voltage; but these are beyond the scope of this discussion. The primary countermeasure against raw power input glitches is a reset monitor that will extend any glitch into a several-millisecond long reset signal if the voltage drops below a prescribed level. Furthermore, local filtering, regulation and power storage removes very short glitches. All T-domain power signals are routed so they are fully contained within the T-domain shield can. No T-domain power signals are exposed as outer-layer traces or vias on either the top or back side of the PCB outside of the T-domain shield.
4. Power State Pins The power state pins allow the EC to coordinate with the FPGA SoC on the current power state. They are structured as “read only” from the SoC, and are also considered to be “advisory”. In other words, the SoC is capable of independently forcing its own power into the on-state; therefore the EC is only able to shut down power to the SoC when it is explicitly allowed by the T-domain. This minimizes the risk of the EC attempting to perform a glitch attack against the SoC by manipulating its access to power.
5. Keyboard Wakeup Pins In order for the EC to know when to power on the system, the EC also has access to a pair of row/column pins on the keyboard matrix. This enables the EC to respond to a two-key chord to wake the system from sleep; however, it also means the EC can potentially monitor a few keys on the keyboard, leading to a potential information leakage. This is mitigated by a set of hardware isolation switches which the SoC uses to deny EC access to the keyboard matrix once the system is powered on.
6. Audio is rendered by way of a CODEC chip. The DVT prototype shown in the photo above uses the LM49352, but a few months ago it was announced to be end-of-life by the vendor, TI. For production, we plan on employing the TLV320AIC3100, a functionally equivalent CODEC which will hopefully have a longer production lifetime. The CODEC chip integrates all the circuitry necessary to amplify the microphone, drive a pair of headphones, and also drive a small speaker for notifications. While it is possible to bury implants within the audio chip, it’s thought that any implant large enough to either record a useful amount of conversation or to do speech-to-text processing of the conversation would create an easily detectable size or power signature, or both. The headphone jack is wired for optimum compatibility with headsets from the Android ecosystem.
7. COM bus Finally, the COM bus is an SPI interface used by the T-domain to talk to the rest of the world. It is directly connected to the EC. The COM bus is structured so that the SoC is the sole controller of the SPI bus; the EC is not able to send data to the SoC unless the SoC allows it. Further packet-level and protocol-level countermeasures are required on the COM bus to harden its attack surface, but at the end of the day, this is the primary pathway for data to reach the T-domain from the outside world, and therefore it should be the primary focus of any software-oriented attack surface analysis.

It is important that COM bus packets be authenticated, encrypted, and serialized prior to hand-off to the EC; the EC can only put T domain data into the appropriate envelopes for routing on the Internet and no more. This allows us to safely delegate to the EC the job of mapping COM bus packets onto a given network interface.

### COM Connects to the Internet

Secure software running on the T-domain should be as oblivious as practical as to what type of Internet connection is implemented by the EC. Thus whether the EC routes COM packets to wifi, LTE, bluetooth, or Ethernet should have no bearing on the security of the T-domain.

For Precursor, we have chosen to add a Silicon Labs WF200 wifi chip to the EC as a primary means of Internet connectivity. The Silicon Labs WF200 contains a substantial amount of un-trustable code and circuitry; however, because the WF200 is in the Untrusted domain, we have no need to trust it, just as we have no need to trust the cable modem or the core network routers on the Internet.

Thus we can safely leverage the substantial co-processing within the WF200 to handle the complications of associating with WAPs, as well as other MAC/PHY-level nuances of wireless Ethernet. This allows us to substantially reduce the power requirements for the system during “screen off” time when it is mainly waiting to receive incoming messages. Furthermore, the WF200 has a well-characterized low power mode which agrees well with bench measurements. This is different from the ESP32, which as of a year ago when the evaluation was done, advertises low power but suffers from power-state transition nuances that prevent a practical system from achieving overall low power consumption.

The EC takes care of uploading firmware to the WF200, as well as servicing its interrupts and transcribing received packets to the T-domain. In addition to these responsibilities, the EC can detect if the system has been physically moved during standby by polling an IMU, and it also manages the battery charger and gas gauge. It also provides a ~1Hz square wave to the LCD that is required by the LCD during standby to continue displaying messages properly.

### Random Number Generators

The T-domain includes a discrete TRNG. This is meant to complement a TRNG integrated into the SoC itself. The benefit of a discrete TRNG is that it can be verified using common lab equipment, such as an oscilloscope; the drawback of a discrete TRNG is that an attacker with physical possession of the device could manipulate its output by drilling through the RF shield and dropping a needle onto millimeter-scale component pads.

The integrated TRNG inside the SoC is less vulnerable to attack by a physically present attacker, but at the expense of being difficult to manually verify. Thus, we provision both discrete and integrated TRNGs, and recommend that developers combine their outputs prior to use in secure applications.

### Keeping Time

A sense of time is important in many cryptographic protocols, thus a Real Time Clock (RTC) is a security-critical element. We chose an RTC that integrates both the crystal and the clock chip into a single hermetically sealed package to reduce the attack surface available to a physically present attacker to manipulate time. The chosen RTC also incorporates basic clock integrity checking, which helps to mitigate simple glitch attacks against the RTC.

### RAM: Why 16MiB?

We provide 16MiB of battery-backed SRAM for secure computations. We made it battery-backed so as to reduce the standby/resume overhead of the system, at the expense of creating a potential attack surface for physically present attackers to recover data from the system.

The choice of 16MiB of SRAM was deliberate and motivated by several factors:

1. Power A larger DRAM would have required using the DRAM PHY on the SoC. This interface is extremely power hungry and would have more than doubled the amount of power consumed when the system is on. Furthermore, keeping the DRAM in self-refresh mode would disallow powering down the FPGA entirely, meaning that the substantial standby leakage power of the SoC would count against the “screen-off” time.
2. Code complexity Precursor is a spin-off from the Betrusted project. One of Betrusted’s goals is to build a codebase that could be audited by an individual or small group within a reasonable amount of time. Choosing a small amount of RAM is the equivalent of burning the boats before a battle to force an advancing army into a win-or-die situation; it confines every choice made in the OS and application layers to prefer simpler, less complex implementations at the expense of more development time and fewer features.
3. Roadmap Eventually, we would like to fit the entire T-domain of Precursor into the footprint of a single chip. Incorporating hundreds of megabytes of RAM on-chip is impractical, even in aggressive process nodes. In a more realistic 28 or 40nm node, we estimate 4-16MiB is a potentially practical amount of RAM to incorporate in a low-cost, low-power, mass-market implementation. Provisioning Precursor with a similar amount of RAM helps to ensure code developed for it will have a migration path to more highly integrated solutions down the road.

### Self-Destruct Mode

Finally, we have provisioned a “self-destruct” feature for users that opt to use battery-backed AES keys to protect their FPGA image. The “self-destruct” mechanism consists of a latch built using discrete transistors. During normal power-on, the system latches into a “normal” mode of operation. However, when the SoC asserts the “KEY_KILL” pin, the latch switches into the “kill” mode of operation. Once in the “kill” mode, power is cut to the T-domain – including the power that backs up the AES key. There is also a set active pull-downs which rapidly discharge the relevant voltage rails to ensure the power lines drop to a level suitable for data erasure in a matter of milliseconds. Although the data erasure only takes a fraction of a second, the only way to get out of “kill” mode is to remove the battery or to wait for the battery to fully discharge.

That wraps up our whirlwind tour of the Precursor motherboard. This post introduced all of the major design features of the Precursor motherboard and briefly summarized the rationale for each choice. The system architecture minimizes the attack surface of trusted components. Furthermore, component choice was guided by the principles of simplicity and transparency while trying to provide a complete but auditable solution for security-sensitive applications. Finally, the mainboard was designed with components only on one side, and all security-critical components are contained within a well defined area, with the hope that this makes it easier to visually inspect and verify units upon receipt by end users.

Liked this post? Sign up to the Precursor funding campaign mailing list to be notified when new posts go live!

### Introducing Precursor

Saturday, September 19th, 2020

Precursor (pre·​cur·​sor | \ pri-ˈkər-sər):
1. one that precedes or gives rise to; a predecessor; harbinger
2. a pocketable open development board

Precursor is a mobile, open source electronics platform. Similar to how a Raspberry Pi or an Arduino can be transformed into an IoT gadget with the addition of a couple breakout boards, some solder, and a bit of code, Precursor is a framework upon which you can assemble a wide variety of DIY mobile applications.

Precursor is unique in the open source electronics space in that it’s designed from the ground-up to be carried around in your pocket. It’s not just a naked circuit board with connectors hanging off at random locations: it comes fully integrated—with a rechargeable battery, a display, and a keyboard—in a sleek, 7.2 mm (quarter-inch) aluminum case.

### Precursor → Betrusted

Followers of my blog will recognize the case design from Betrusted, a secure-communication device. It’s certainly no accident that Precursor looks like Betrusted, as the latter is built upon the former. Betrusted is a great example of the kind of thing that you (and we) might want to make using Precursor. Betrusted is a huge software project, however, and it will require several years to get right.

Precursor, on the other hand, is ready today. And it has all of the features you might need to validate and test a software stack like the one that will drive Betrusted. We are also using the FPGA in Precursor to validate our SoC design, which will eventually give us the confidence we need to tape out a full-custom Betrusted ASIC, thereby lowering production costs while raising the bar on hardware security.

In the meantime, Precursor gives us a prototyping platform that we can use to work through user-experience challenges, and it gives you a way to implement projects that demand a secure, portable, trustable communications platform but that might not require the same level of hardware tamper resistance that a full-custom ASIC solution could provide.

And for developers, the best part is that Betrusted is 100% open source. As we make progress on the Betrusted software stack, we will roll those improvements back into Precursor, so you can count on a constant stream of updates and patches to the platform.

### Hackable. In a Good Way.

Precursor is also unique in that you can hack many aspects of the hardware without a soldering iron. Instead of a traditional ARM or AVR “System on Chip” (SoC), Precursor is powered by the software-defined hardware of a Field Programmable Gate Array (FPGA). FPGAs are a sea of basic logic units that users can wire up using a “bitstream”. Precursor comes pre-loaded with a bitstream that makes the FPGA behave like a RISC-V CPU, but you’re free to load up (or code up) any CPU you like, be it a 6502, an lm32, an AVR, an ARM, or something else. It’s entirely up to you.

This flexibility comes with its own set of trade-offs, of course. CPU speeds are limited to around 100 MHz, and complexity is limited to single-issue, in-order microarchitectures. It’s faster than any Palm Pilot or Nintendo DS, but it’s not looking to replace your smartphone.

### At Its Core

We describe bitstreams using a Python-based Fragmented Hardware Description Language (FHDL) called Migen, which powers the LiteX framework. (Migen is to LiteX as GNU is to Linux, hence we refer to the combination as Migen/LiteX.) The framework is flexible enough that we can incorporate Google’s OpenTitan SHA and AES crypto-cores (written in SystemVerilog), yet powerful enough that we can natively describe a bespoke Curve25519 crypto engine.

If you’ve ever wanted to customize your CPU’s instruction set, experiment with hardware accelerators, or make cycle-accurate simulations of retro-hardware, Precursor has you covered. And the best part is, thanks to Precursor’s highly integrated design philosophy, you can take all that hard work out of the lab and on the road.

### On the Inside

And if you’re itching for a excuse to break out your soldering iron or your 3D printer, Precursor is here to give you one. While its compact form factor might seem limiting at first, we’ve observed that 80% of projects involve adding just one or two domain-specific sensors or hardware modules to a base platform. And most of those additions come on breakout boards that require only a handful of signal wires.

With eight GPIOs (configurable as three differential pairs and two single-ended lines) connected directly to the FPGA, Precursor’s battery compartment is designed to accommodate breakout boards. It also provides multiple power rails. You will find any number of third-party breakout boards with sensors ranging from barometers to cameras and radios ranging from BLE to LTE. Patch them in with a soldering iron, and you’re all set. The main trade-off is that, the more hardware you add, the less space you have left for your battery. Unless of course you build a bigger enclosure…

### On the Outside

If you need even more space or custom mounting hardware, the case is designed for easy fabrication using an aluminum CNC machine or a resin printer. Naturally, our case designs are open source, and the native Solidworks CAD files we provide are constructed such that the enclosure’s length and thickness are parameterized.

Furthermore, Precursor’s bezel is a plain old FR-4 PCB, so if your application does not require a large display and a keyboard, you can simply remove them and replace the bezel with a full-sized circuit board. By way of example, removing the LCD and replacing it with a smaller OLED module would make room for a much larger battery while freeing up space for the custom hardware you might need to build, say, a portable, trustable, VPN-protected LTE hotspot.

### Come Have a Look!

If you’ve ever wanted to hack on mobile hardware, Precursor was made for you. By combining an FPGA dev board, a battery, a case, a display, and a keyboard into a single thin, pocket-ready package, it makes it easier than ever to go from a concept to a road-ready piece of hardware.

Precursor will soon be crowdfunding on Crowd Supply. Learn more about its specifications on our pre-launch page, and sign up for our mailing list so that you can take advantage of early-bird pricing when the campaign goes live.

### Meta

We’ve decided to do an extended pre-launch phase for the Precursor campaign to gauge interest. After all, we are in the middle of an unprecedented global pandemic, and one of the worst economic downturns in recorded history. It might seem a little crazy to try and fund the project now, but it’s also crazy to try and build trustable hardware that can hold up against state-level adversaries. We make decisions not on what is practical, but what is right.

The fact is the hardware is at a stage where we are comfortable producing more units and getting it into the hands of developers. The open question is if developers have the time, interest, and money to participate in our campaign. Initial outreach indicates there might be, but we’ll only find out for sure in the coming months. Precursor is not cheap to produce; I am prepared to accept a failed funding campaign as a possible outcome.

We’re also carefully considering alternate sources of funding, such as grants from organizations that share our values (such as NLNet) and commercial sponsors that will not attach conditions that compromise our integrity (you’ll notice the Silicon Labs banner at Crowd Supply). This will hopefully make the hardware more accessible, especially to qualified developers in need, but please keep in mind we are not a big corporation. As individual humans like you, we need to put food on the table and keep a roof over our heads. Our current plan is to offer a limited number of early bird units at a low price — so if you’re like me and worried about making ends meet next year, subscribe to our mailing list so you can hopefully take advantage of the early bird pricing. And if you’re lucky enough to be in a stable situation, please consider backing the campaign at a higher pricing tier.

Over the coming months, I’ll be mirroring some of the more relevant posts from the campaign onto my blog, sometimes with additional commentary like this. There’s over two years of effort that have gone into building Precursor, and I look forward to sharing with you the insights and knowledge gained on my journey.