Advice on Reliable SSD Chipset?

I spent the weekend transferring my data and applications to a new Crucial 256 GB M225 SSD, and after about 10 hours of operation, the hard drive simply failed. It failed while “hot” even — the hard drive lite jammed on, my MP3 stopped playing, and the system froze up — like my worst nightmare come true. How is this possible? It’s supposed to be solid state. It’s supposed to have greater than 1,000,000 hours MTBF. Yet, it’s true. No matter what laptop I put it into, I now simply get this on boot:

2100: HDD0 (Hard disk drive) initialization error (1).

The drive isn’t 100% dead per se. There is a little switch on the drive that puts it into configuration mode, and it will identify itself as a Yatapdong Barefoot device (instead of a Crucial device) in this mode. So presumably, the embedded controller is still alive and kicking, but even in this mode I can’t seem to reflash the drive’s firmware or re-initialize it to a state where the normal mode is functional. When I switch it back out of configuration mode, the BIOS refuses to enumerate the drive, and without enumeration I can’t even run a diagnostic on the device. It’s definitely not an OS issue — BIOS-level diagnostics simply refuse to recognize the drive.

Searching around in the Crucial Forum, it seems these drives “have a high failure rate”. So much for a million-hour MTBF. I’m not sure exactly what’s causing it, because it is entirely solid state, and the SMT process used to build these drives are a very robust and well-proven technology. My guess is it’s got something to do with the firmware or the controller chip; they already have three firmware releases out for this drive, and disturbingly you can reflash these from inside the native OS — seems like a great candidate method for deeply embedding malware into a PC. Well, hopefully I can just return this drive for a full refund, since it failed so quickly.

At any rate, I’m wondering if anyone can give some advice on a good, reliable brand of SSD to use. The few hours I did spend with the SSD were quite positive; the performance boost is excellent, and a large number of my common work applications greatly benefited from the extremely fast access times. I’m still a bit spooked by the idea that these drives can fail so easily, but then again, fundamentally if I were in a jam I am equipped with the tools to recover the data — these devices use simple TSOP flash memory, so I suppose in the worst case I could dump the ROMs. Fortunately this one failed young, so the quickest solution is to just return for a refund and try something else.

Poking around a bit on-line, it seems that this Crucial drive uses the same Indilinx controller and firmware as the OCZ Vertex, the Patriot Torx, and the Corsair Extreme series (based on the “Yatapdong Barefoot” ID in configuration mode). On Amazon.com, I can see other users are experiencing exactly the same issue with a Corsair Extreme Indilinx series device. So I think it’s safe to say I’d like to steer clear of a solution based on the Indilinx chipset, at least until this issue is patched — ironically, the Indilinx website’s motto is “Beyond the Spin” (who are these guys, Fox News?), but it seems to me like their website’s got a lot more spin than substance. A link to a datasheet or firmware spec would be nicer than the marketing fluff.

Thanks in advance for the advice!

27 Responses to “Advice on Reliable SSD Chipset?”

  1. T says:

    I can recommend the Intel drives, or rebranded as Kingston.
    I had no trouble with several OCZ Indilinx drives, though updating firmware in the Mac is cumbersome.

    Anandtech has endless details on the controllers
    http://anandtech.com/storage/showdoc.aspx?i=3631
    http://www.anandtech.com/storage/showdoc.aspx?i=3531&p=1

  2. Roger says:

    Is there any reason you didn’t go for an Intel drive? It does very well in the various benchmarks around the net and they have detailed how their controller works (yet more commentary around the net). It is more expensive though.

    • bunnie says:

      As far as I know, Intel’s biggest drive is 160 GB — I was also hoping to at least match my current capacity (200 GB and really full, constantly deleting stuff — that’s the problem with carrying around a couple of (very useful) VMware images on your laptop) with the new drive. I did read great reviews on the Intel drive, however…not as concerned about the price, it amortizes fairly well compared to other potential purchases since I use my laptop basically 24/7. Would definitely buy one of those if they had a bigger capacity.

      • niczar says:

        I think you should just forget about SSD for now then.

        • bunnie says:

          It does seem the technology may not yet be mature enough to rely upon. That’s highly unfortunate, I wonder how long it will take.

          Then again, I wouldn’t be tossing my original rotating hard disk, so if the SSD fails I *do* have a hot spare that’s pretty easy to update with one of my snapshot backups.

  3. Fred says:

    The Intel drives are the only ones that have no bullshit controller problems or packaging fuckups.

    If your laptop has an optical drive, ditch it and drop the ssd in its place, with a 2.5″ 500gb hdd in the normal spot. Barring that, if you have a expresscard slot, get an eSATA card for it and use a 2.5″ hdd externally, powered from USB.

  4. Jacob says:

    Intel chipset ‘PC29AS21BA0’ is the one to get, which replaced ‘Intel PC29AS21AA0’. Intel calls ‘PC29AS21BA0’ drives G2/Generation 2. TRIM firmwares will soon become available again too, after their ‘second’ firmware corruption issue is resolved.
    I my self have a X25-E SLC drive which I’m happy with, but it’s rather pricey.

    Intel X18-M G2 160GB
    Intel X18-M G2 80GB
    Intel X25-M G2 160GB
    Intel X25-M G2 320GB
    Intel X25-M G2 80GB

    Kingston SSDNow M Series 2 160GB
    Kingston SSDNow M Series 2 80GB
    Kingston SSDNow V Series 40GB

  5. Nick says:

    “The Intel drives are the only ones that have no bullshit controller problems or packaging fuckups.”

    Only when our drive firmware team doesn’t put out an update that will brick a percentage of our user base.

    The MTBF for flash chips might be enormous, but there isn’t a single manufacturer out there that isn’t at some level taking the most slapdash approach possible to the controller firmware.

  6. John says:

    While Intel may be more reliable per say, they are not without missteps. They have had many issues themselves.

  7. Bloom Berg says:

    I wonder what’s the difficulty in making a working SSD, every manufacturer seems to have trouble in some way. I know there are certain difficulties in wear leveling, capacity balancing, etc. but someone should have figured everything by now.

    Maybe someone should make an open SSD implementation using FPGAs or something.

    BTW, I use a Samsung SSD and recommend it. I think they have a 256 gb model. In my opinion they are extremely fast although they are not the fastest according to this chart: http://www.harddrivebenchmark.net/high_end_drives.html . I think they don’t have as many problems (please correct me if I’m wrong) as other brands.

    • bunnie says:

      This does have me thinking that an Open SSD is not a bad idea. At the price point the SSDs are at right now, plus with the ability to configure your capacity with connectorized add-on boards, maybe that’s not a bad idea, and perhaps even practical.

      Xilinx does offer a SATA host core and reference platform, and there are some for-purchase SATA device cores that even work in cheap parts like the Spartan-3, so at least that’s proof of existence that “it can be done”. I don’t see any home-grown open source SATA device cores though, not sure how hard that’d be to implement.

      Although, that might be a bit of a drastic approach to do full-custom ASIC-level hardware, maybe there is a more discrete, OS community-accessible device solution where we can just throw a fast ARM solution behind a discrete, stand-alone SATA device controller, and then let people hack using C or assembly code all the device interface and block management algorithms. Burning a couple hundred microseconds of latency traversing some software glue stacks wouldn’t be too noticeable, since it seems that you can win a lot of that back in eraseblock, page access, and cache algorithm management.

      Hmm…if only I didn’t have to sleep.

      • Don says:

        If you find or decide to write some SATA device cores you’d be a hero for a lot of people :) I’ve been wanting to write a SATA device core for a while now but I just don’t have enough confidence in my abilities. Thankfully the SATA spec is available for about $25 so it’s at least affordable.

    • bunnie says:

      I’m seriously considering a 256 GB Samsung SSD. They do make the memory chips after all–not that it means anything, Samsung is a massive corporation and I’ve seen first-hand how their CPU division has problems coordinating with their memory division even. But it does seem to be the only solution that might meet my capacity requirement that’s not relying on Indilinx. I also have to remember never to update my firmware on the drive unless I want to risk a brick.

  8. […] Shared Advice on Reliable SSD Chipset?. […]

  9. someone says:

    Take a look at this LWN article:

    http://lwn.net/Articles/353411/

    It advocates the position that SSDs should just expose the raw flash device to the operating system and let it deal with that instead of adding a block device layer.

    • bunnie says:

      That’s an interesting position. In my experience, the whole linux VFS paradigm has troubles integrating with the concept of explicit bad block management (it assumes the hardware is “perfect”), and this is complicated by the fact that ECC and bad block management methods vary across silicon vendors. In chumby, we used to employ raw NAND exposed to the linux kernel, but this has proven to be unportable across even different NAND silicon vendors, and risky (every now and then we lose a device to a corner case in bad block management that the filesystem implementation didn’t anticipate). As a result, in almost all our new platforms, we have adopted the use of managed NAND, in particular, microSD cards. With managed NAND, the concept of bad blocks is abstracted and we can directly implement filesystems such as ext3 onto the hardware, as opposed to relying on filesystems bound through the MTD layer. Interestingly, the cost of a managed NAND is nearly identical to the cost of an equivalent capacity raw NAND flash itself, so this convenience comes at almost no price impact.

      Which leads me to wonder, would it be a bad idea to implement a large-scale SSD using an array of (semi)permanently mounted microSD cards? The bad block management would be local to each card, and wear management would operate hierarchically–the master controller would still attempt to remap blocks at a card level to optimize wear management, but each chip on its own would also effectively manage its own free and bad block list, and its own ECC as well. The parallelism of the approach could address some of the scalability issues facing larger SSDs, perhaps.

      The master controller could then either present the array of microSD cards as simply an abstract volume, and the OS could trust the controller to do the right thing … or the master controller could operate in “JBOD” mode and reveal the structure of each microSD card to the OS and the OS could then optimize file placement, trimming and striping for maximum performance. I guess this would be the moral equivalent of building a mini-RAID array of managed microSD drives, instead of presenting an abstracted, monolithic drive to the OS.

      The theoretical peak bandwidth of a 4-bit microSD device is around 200Mbps, whereas an 8-bit flash device is around 320Mbps — but a microSD device has half the bus width of a flash device. So presuming you can do striping and multi-channel access, you should be able to get similar if not better drive performance out of a microSD based array, although I’m not sure how much worse the SD card protocol is versus the very simple/fast protocol for raw NAND flash.

      Interestingly, the top hit in my AdSense for microSD shows 8GB of microSD going for $16 today, so a 256 GB array would cost only about $512 — a fair bit cheaper than the list price of most SSDs. And, I suppose if you built the array right so the cards are semi-replaceable, and you did something like RAID striping across the microSD cards, even if one of the devices wears out, you can just pull out the card and pop in a new one and your array will regenerate itself.

      Hm.

    • Felix says:

      I tend to disagree.

      jffs2 is crap compared to the sophisticated wear levelling/log system the vendors (like Intel) implement in their managed NAND solutions. I know that jffs2 is better than most of the remaining alternatives (so that’s why I’m using it, too), but it’s far from being perfect. It has reliability problems where it shouldn’t have, it has performance problems which get worked around (“summary blocks”) and has some consistency problems, too. And to be honest, I wouldn’t know how to improve it. So I wouldn’t consider jffs2 as bad, it’s just a tough problem.
      I wouldn’t trust the nand management in most of the cheap cards either. But at least it’s not your fault when they fail.

      Depending on what algorithms are implemented, there is a huge overhead over the data read from nand vs. the actual payload. This overhead should be resolved as soon as possible in the datapath. Advanced ECC algorithms, which you *really* need on MLC (so I was told by someone who designed the wear levelling firmware on a device which probably a large percentage of the readers here own and love) aren’t something you want to do in software on a PC.

      Intel designed a custom controller for their SSD, and they easily saturate a Gen-1 SATA link, and probably already a Gen-2 these days. That’s 300MB/s in random access. On jffs2, I’m happy if my 10MB boot-fs gets mounted in less than a few seconds. Yeah, that’s a totally different hardware. But I simply can’t believe you can run the required algorithms in realtime on a general purpose cpu. Not to speak of the actual implementation.

      So sure, let’s build an OpenSSD. But before we build hardware, simply simulate the whole thing. A NAND is easily modeled in software, including bit errors and failing pages. If you manage to write a translation layer, which provides the required throughput and reliability, then porting that to an FPGA should be one of the easier things. And besides you would have done the embedded linux community a big favor. Switching from RAW nand to managed nand is something that a lot of companies seems doing these days, probably after having switched from NOR to NAND a few years ago.

      To answer the question: I’ve been using the first gen 160GB MLC Intel SSD for 3 months, and it was just amazing, in terms of speed. I can’t say much about reliability though, except that it survived.

      • bunnie says:

        Hi Felix! Thanks for stopping by. Great insight as usual. I agree with you, I think raw NAND is a really hard problem to solve — chumby has moved to managed NAND almost entirely for our new products. I also agree with you about simulating the problem first, although I don’t know how I can get the open source community fired up about writing code for simulations.

        I sprang on a Samsung 256 GB SSD. When I google for X25 failures, I do come up with plenty of posts reporting problems, so I think the fabled Intel SSD reliability could be in part due to good marketing on Intel’s side; in other words, I think everyone is having a tough time at SSD technology. Samsung is probably about as good a brand as Intel as far as these things go, since Samsung is the largest Flash memory producer in the world. Also, the 160 GB capacity of the Intel drives is a deal-killer for me. I’ve been struggling with a 200 GB drive, so I couldn’t come up with a plan to scale back to 160 GB without having to lug an external hard drive around with me everywhere, and also, it seems that SSDs perform much worse when they get very full, so the extra space will be helpful toward that end.

        That being said, I also purchased a 64 GB USB thumb drive, so this solves the problem of backing up my data while I’m traveling for weeks on end in Asia, where the network pipe is pretty thin to my NAS and cloud file resources. I thought about using the 64 GB USB drive plus the 160 GB Intel SSD as a solution, but I’ve found that USB drives have awful performance for “live” data … they are only really useful for backup purposes.

        I was hoping to find a solution that was more reliable in terms of MTBF; I’m guessing SSDs may be a little too immature to call them “reliable”. But, I do have confidence that in fact they are more reliable in the face of environmental challenges, and finding a drive that works despite the rough treatment I put my equipment through traveling deep in Asia was my primary reason for migration, so I’m still willing to give the SSDs a try.

        Hopefully in three months I can also report back that I’m having a decent experience with this new drive as well.

        I just really wish someone would explain what this infant mortality problem is that we’re seeing with SSDs. It seems the majority of SSD failures I’ve read about happen within the first 2-3 weeks of operation. It seems to have nothing to do with the hardware itself. It smells like something really silly or embarrassing, like a memory leak in the bad block management table that causes operational firmware to be nuked after the drive gets a little full. Or, if the SSD controller stashes its operational firmware somewhere inside one of the NAND devices, it could be that a bad block remap or garbage collection algorithm is mangling the controller’s own firmware. Having spent weeks digging around inside the JFFS2 and YAFFS codebase finding obscure bugs in bad block management, I know how ugly these things can get…if I had the money to keep my broken SSD around, it would be fun to snarf the code out of it and try to figure out exactly what did go wrong…!

      • anonymouse says:

        I don’t think jffs2 is a good idea as a general purpose filesystem, especially not on huge SSDs, because it feels like it was designed for much smaller devices. But I think that whatever alternative that people come up with, it will be better than whatever is implemented in the SSD controllers. From what I hear, SSD wear leveling is very rudimentary, and the slowdowns once every block has been written are pretty hard to avoid. The OS has access to much more information, and much more processing power, than the firmware running on some dinky controller chip, and I suspect can manage disk space usage much better than a firmware that only sees writes to individual blocks. Plus, having this as a linux driver lets people experiment with better allocation strategies much more readily, and maybe they’ll come up with better algorithms for real-world use cases. On the other hand, with an SSD, you can just stick it in and pretend it’s a hard drive, which is a big advantage for actually selling the things to people.

  10. Hi,

    The only SSD in the low range worth buying is the Intel X-25M (we last conducted tests a few months ago) we have been using them in our datacenters for a year and apart from the initial batch, they have been wonderful. I also put one in my laptop and it is amazing what a difference it makes. (160GB, I believe 320 GB is coming soon).

    We have some machines where the X-25M have between 1-10 MB/sec of continuous random writes that have been running for 10 months now.

    Artur

  11. someguy says:

    I think the whole SSD business is a scam, as there’s no reason for such high prices and scarcity – that is, unless you’re a “classical” hard disk manufacturer trying to save his old business for as long as he can. Check out some USB flash RAID projects on the net, it’s pathetic how cheap they can be – though, not very reliable with all the connections and stacked protocols/drivers.

    Is it really so hard to implement all this in hardware, hidden behind classical SATA interface? No, it’s not, and here’s proof:
    http://www.techpowerup.com/75490/Sharkoon_Introduces_Flexi-Drive_S2S_DIY_SSD.html

    Darn cheap-stuff manufacturers, how dare they break the rules of the game set by big shots!

  12. Steve Dawkins says:

    Just last week my Crucial 256GB SSD fail as you describe – disk light locked on and laptop dead – only had it two weeks! This is my third SSD device the previous two being OCZ Vertex and they failed after 10m and 3mnths use. I also have a colleague with a Crucial that’s also just died.
    Can anyone show me how to verify that Crucial and OCZ Vertex do use the same firmware?
    I’m very interested in the view that Intel as the ones to go for – does anyone know of any organisation that’s actually stress tested any of their SSD disks?

  13. Andrew Dorn says:

    We have got in your servers SSD 80GB, all stuf works really faster.

  14. the internet is always the source of cheap stuffs, you can buy cheap electronics, cheap softwares and other stuffs “‘~

  15. the internet is always the source of cheap stuffs, you can buy cheap electronics, cheap softwares and other stuffs ‘:.