LogicSmith

Birth of a Hard Drive : An Interview with John Treder

hddes4-1.gif

Cross-section of a Quantum Viking 1 hard drive cut in half to reveal the inner details

hddes4-2.gif

A simple overview of hard drive internals.

AD = Albert Dayes
JT = John Treder

AD: How big is the design team for a hard drive? What type of members comprise a design team? (e.g. Electrical Engineers, Mechanical Engineers, etc.)

JT: The size varies, depending on the product's specifications, whether it's a new design or an update to an existing design, and the urgency. The team for Quantum's Viking 2 could be thought of as more or less typical. There were half a dozen mechanical engineers, two technicians, seven electrical engineers, one and a half circuit board layout designers, a heads and media engineer, eight people who wrote control and interface code, three product support engineers, two product test engineers, and half a dozen managers. Quantum is unusual in assigning product support engineers near the beginning of a program. It helps a lot, because they contribute to making the design robust from the beginning, and they understand the ins and outs very thoroughly, so their support after the drive is in production is based more on knowledge than guesswork.

In addition to the team members, there are purchasing, marketing, documentation, servo-writer development and laboratory specialists who get involved, though they're in "service groups". There can be half a dozen of those people who are "project assigned".

AD: Is price point the most important factor in a new hard drive?

JT: Yes. I suspect it even overrides profitability, though hard drive companies can't afford many "loss leaders".

AD: How long is the development cycle from original design specification to production?

JT: Again, it varies. A "bump" product, changing the amount of data per platter, can be done in as little as 6 months. A design from scratch, starting a new product family, will take from 2 1/2 to 5 years, but if it takes more than 3 years, something went wrong. Viking 1 was a crash program for Quantum, a new design and new product series, and it took 16 months. Viking 2 was a significant enhancement of Viking 1, and it took just over 2 years.

AD: In the late 1980's or early 1990's one hard drive maker (small company from what I recall) sued all of the other hard drive manufacturers over 3.5 inch disk patents which the company owned. There does not seem to be much litigation in the hard drive business today. Is this due only to the extensive patent cross licensing between the hard drive manufacturers?

JT: That was Rodime. Conner settled for an amount rumored within the company (I worked there at the time) to be > $10 million. Quantum fought, and eventually won, but probably spent more money than Conner. I think the major hard drive manufacturers' upper management has recognized that they'll spend less money on the lawyers if they cross-license than if they sue.

Of course, there aren't nearly as many hard drive companies as there were 10 years ago, so there aren't so many opportunities for litigation. The number of manufacturers has decreased enormously--there are only 7 or 8 that I can point to today, where in the mid to late 80's there were at least 50.

AD: Can you describe the HD design process from your (Mechanical Engineer's) point of view?

JT: Before a program formally begins, there's a period of product definition. People study head and disk technology, look in their crystal balls (cracked and cloudy, of course) to guess what the hot product is going to be in two years, and generally try to figure out how many disks, what kind of heads, what RPM and what capacity the drive should have. Engineering gets a hand in the process. Usually marketing sets up an approximate specification and the guffaws from engineering cause changes.

There are usually two or three Engineering builds, then two or three Pre-production builds before mass production begins. Before the first E-build, we often take an existing drive and try one or two ideas on it. We'll only build two or three samples to test the ideas. The first E-build will be 10 or 20 units. The base is either machined from a solid block of aluminum, or modified from an existing base. These units will have the right number of heads and disks, and any key mechanical developments. They'll usually run on a previous product's PCBA, or a new PCBA that's out of form factor and has a socketed processor so the EE's can plug in an emulator (ICE).

Massive changes can happen between the first and second E-builds. Number of disks, RPM, head technology (MR to GMR, for instance), and spindle motor internal design have all changed in my experience. For the second E-build we try to have all the production technologies, though very often the base casting and the actuator won't be made by production methods. At this time, the circuit board will be specific to the product, though it's often still out of form factor and always socketed. The drive will have close to the right TPI, but it usually won't be formatted to full capacity. Data and servo format details change almost weekly at this stage.

There will be maybe 100 drives built for E-2. Mechanically, we'll be measuring runouts, shock and vibration performance, EMI (electromagnetic interference), contamination and sealing issues, acoustics, seek performance, and whatever else we can think of. It's our last chance to find and fix major problems without affecting the program schedule.

The first P-build is critical to a project's success. It's when it all comes together. Mechanically, the castings will be castings (not machined from solid), stampings will be stamped, and the drive will generally look like a production unit. Electrically, they'll have real silicon for the main ASICs. For the first time, we'll try to have at least some of the drives written to full capacity. At Quantum, they build 1,000 or 1,500 drives at this time. Formal product testing begins. We show samples to OEM customers, but don't give them any.

 The second P-build should be a cleanup of problems found so far, and usually doesn't involve major mechanical changes. If big changes are needed in the mechanics, a third P-build will usually have to happen. P-2 drives are given to OEM customers to begin evaluation. They should show 90% of the production units' performance. Code changes come two or three times a week now.

P-2 is usually the darkest time, emotionally. You have a year or more invested, and all you can see are problems. Marketing has been asking why it wasn't ready 6 months ago.

Then you have the mass production launch. There isn't generally much to do, mechanically, at this stage. If there is, you've got BIG trouble! Servo engineers and interface code guys are working like mad to squeeze the performance and get the bugs out.

AD: What is the hardest part of the design and/or the design process?

JT: Inventing a good solution to a key problem. That's also what's the most fun. When you find an answer, it's just wonderful, and usually involves some amount of serendipity. As an example, we were having intermittent spindle motor performance problems with Viking 1, during the E-2 time. Motors that performed poorly also often, but not always, made a buzzing noise. My boss happened to pick up a motor that had been torn apart and twisted the stator, and it came off in his hands. Now, the stator is supposed to be firmly fastened down!

So, over the next month, the motor company engineers and I worked out several ways to improve the stator's fastening to the base. We ended up adding about 30 cents to the motor's cost, but Viking 1, Viking 2 and Atlas 4 have among the quietest motors in the industry.

AD: Are the interface/firmware engineers (I would assume EEs mostly) involved from the beginning of the design or only after the mechanical engineers are done?

JT: The interface guys start a few months after the MEs start. They usually begin to work on some of the E-1 drives and are fully involved by E-2.

AD: What is the most interesting discovery that you made (which is not a trade secret) during the course of your design or re-design?

JT: The one that's most interesting to me is maybe a little bit abstruse. The disks and spindle motor bearings and the base casting all form a complicated set of springs that vibrate with various frequencies as the drive spins. The source of some of the vibration is obvious-imbalance in the disks, irregularities in the ball bearings, torque pulses from the motor, for example. But especially at 7200 RPM and above, there's more vibration amplitude, especially at higher frequencies, than these sources ought to produce.

 It's air turbulence. At 5400 RPM, most of the air flow is laminar. At 7200 RPM, the airflow at the outside of the disks is in the transition range between laminar and turbulent. And that causes odd vibrations to come and go, as the airflow changes back and forth.

AD: Do you use computer simulators for all of your designs? (e.g. hardware-software co-design simulator software similar to the software that CARDtools Systems provides)

JT: The EE and code guys do a lot of simulation. I'm not sure what tools they use. Sometimes it looks a lot like Doom. <vbg> The servo guys use a combination of Matlab and custom-developed software.

Mechanically, we do most of our design these days on a solid modeler. Different companies use different solid modeling software. We also use FEM extensively. One person in each mechanical team is usually proficient with FEM. We also do Monte Carlo analysis of assembly tolerances.

AD: Can you discuss/define the terms FEM and Monte Carlo analysis?

JT: FEM = Finite Element Modeling. You use programs such as Nastran or Ansys or Algor or Fluent to model various mechanical problems--stress and deflection, magnetic fields, fluid flow, vibration modes, and so forth.

To become really proficient at using an FEM program requires working at it full time for a year or two. Once you've learned one of them, you can pick up another in six months or so. But I've never met anyone who could use more than one FEM program at the same time--they're all very finicky and different from each other.

You can get results quite easily. It's very difficult to get meaningful results that correlate well with experiment.

FEM used to be mainframe or mini stuff. It started to run on Unix workstations about 10 years ago, and in the last two or three years it's become practical to run FEM on a high-end Windows NT box. Models run in from a few minutes to overnight. If you have a really big, slow model, it could have taken months to build, and a weekend run doesn't seem all that slow. Big models, in Unix, can usually be parceled out across various machines on your local network, over night or over the weekend. FEM is basically inverting a few xillion enormous matrices. The problems involve ill-conditioning and slow inner loops. That's why it takes an expert to get good results--ill-conditioning, especially, can give very bad answers without being obvious.

Monte Carlo analysis is used for studying the effect of assembly variables. "Monte Carlo" refers to "rolling the dice". Any time you have many statistically independent variables, for each one of which you can propose a statistical model of its values, and for which you can make a mathematical model of how they combine, you can use a Monte Carlo analysis to come up with a statistical model of how the variables might work together. I don't know of any system that can make a general mathematical model of how the variables might combine, so a Monte Carlo analysis requires writing the core engine of a program for each problem.

Here's an absurdly simple example, the sort of thing that's commonly checked out with a spreadsheet.

Say you have a stack of 6 bricks in your assembly. All the bricks are nominally the same thickness, but there are three brick factories where you buy them, and of course, bricks aren't all =exactly= the same thickness, and the bricks from each factory tend to be a little different. You're going to make a million of the assemblies, so you want to know what you can expect the height of the tallest, shortest, and average stack will be. You'd also like to know what the odds are that the stack will be higher than some "magic height" where it won't fit.

You measure a bunch of bricks from each factory, and calculate the mean and standard deviation of each factory's output. If you're clever, you also make a histogram of thicknesses and see if the distribution matches (within reason) the "normal" distribution (Bell curve, Gaussian distribution).

 Then you write a program that takes into account the number of bricks coming from each factory, and each factory's distribution, and roll the dice to make, on paper (or computer, whatever), a large number of brick stacks. Say 20,000 just for laughs. You put the results into a histogram and report the mean, standard deviation, min, max, number over "magic", and so forth.

The advantage of simulation is that you can tinker with the variables.

For Atlas 4, I wrote a Monte Carlo simulation of where the tracks would be on the disk. I used 34 independent variables, and the "assembly" used a lot of trigonometry to account for the angles as the actuator rotates in going across the disk, and for various "tilts" that happen. I used Borland Pascal 7 to do the job, with objects. The simulation ran at about 100 assemblies per second on a Intel Pentium-90, so you could simulate 50,000 assemblies in less than 10 minutes. It took a couple of hours to make sense of the results, of course.

We ran 14 different sets of input variables before we were happy with the answers. It took me a month to write the program (I was actually rewriting a similar one that I did for Viking 1), and about 3 weeks to go through the analysis loops.

The engineer who did a similar analysis for Viking 2 used an Microsoft Excel add-in, and he used to let the program run overnight on a Intel Pentium-200.

AD: Can you discuss what is involved in the testing process for a hard drive? What basic tests are absolutely required to be passed before shipping the drive?

JT: People make careers out of testing hard drives. There are the various engineering and qualification tests that each product has to pass before it's "shippable", then the detailed production tests that each drive has to pass before it's shipped.

Engineering and qualification tests are by no means identical, but I'll lump them together for an SST-altitude view. I'm a mechanical guy, so I may miss some electrical or software testing in this list. It's not that I don't care, just that I'm ignorant of many details outside my specialty.

Operating and non-operating shock and vibration performance. Non-operating tests look for physical damage. Operating tests look for error rates and performance degradation in addition to physical damage.

Four-corner tests. Drive performance is measured at various combinations and rates of change of temperature and humidity, ranging generally about 5C beyond the specified temperatures, and usually some amount beyond the specified humidity (it's a lot harder to control humidity precisely).

Altitude tests. Drive performance is measured from 200 feet below sea level to at least 10,000 feet above sea level. Flying height and flying height variation is particularly scrutinized.

Voltage limits. Drives are typically specified to run at plus or minus 5% of specified voltage. Testing is commonly done to plus or minus at least 10%. There's normally a test to find out how far off you can go before the drive fails. All combinations of high, low and nominal 5V and 12V are tested.

RFI/EMI tests. The drive's electronic emissions are measured, and its susceptibility to external electromagnetic fields is measured.

Start/stop reliability. Samples are started and stopped massive numbers of times. Starting current, error rates and acoustics are measured at intervals. For a 40,000 start/stop spec, about 1000 drives are spun up and down maybe 80,000 to 100,000 times each. The test takes months.

Acoustics, both idle and seeking, both "new condition" and after various torture sessions. If a drive fails, it can be very hard to find out why and what to do about it. I've probably spent a total of 5 years working on acoustical issues, interspersed with other tasks.

Contamination measurements. This is usually done with drives that have been going through 4-corner or some kind of reliability testing. Test results can be incredibly baffling and hard to interpret and hard to figure out what to do.

All the various interface tests (data rates, error rates and so forth). There are many such tests, and I'm afraid I just don't know much about them in detail.

Latch reliability. All modern drives include some kind of a lock to keep the actuator parked in the landing zone while it's stopping and starting. There are several kinds, and many variations of each design. The lock has to keep the actuator parked while spinning down and not allow any combination of shocks and accelerations to let the heads move out of the landing zone while power is off. Both linear and rotational shocks and accelerations are tested. This is one of the most difficult tests to pass and one of the most hated assignments for a mechanical engineer.

TMR measurements and other servo performance measurements. This is a critical item. Servo performance is subject to strange failures on totally unpredictable combinations of seeking and external influences. Servo engineers have as hard a life as mechanical engineers! I've given you a very cursory discussion of TMR, and it shouldn't be hard to dream up dozens of tests from that, if you have a sufficiently evil mind. <g>

Thousands of drives are run for a few thousand hours each and power consumption, data-handling parameters, and reliability are measured. The final reliability test is so stringent that one hard data error in a couple of thousand drives, over a thousand or more hours each, can halt the program. Such a thing may happen once in two out of three development programs. There's hell to pay when it does!

Production testing

A small percentage of new drives are destructively tested for non-operating shock, internal cleanliness, and such things.

Samples are measured for acoustic performance.

The rest of these tests are 100%. It's typical for such tests to take about 8 hours. The 36-GB Quantum Atlas 4 drive needs about 20 hours to do its testing. That's partly because the error testing takes time directly proportional to the number of disks and partly because that drive gets unusually stringent testing because of its intended market.

Every head on every drive is measured for its reading and writing properties (amplitude, resolution, overwrite capability, PW50 [a measure of how cleanly a transition can be read], and nowadays some MR characteristics that I don't recall. The drive maintains tables of these parameters by head and zone on the disk. There are typically 16 data rate zones.

[ Note by JT: Look at the article about Disk Layout for more information about zones. ]

[ Note by AD: The data are kept in a "secret room" which John Treder will explain a bit more about:

A drive might have 4 disks, 8 heads, and 16 "data rate zones", each of which may have a different number of data sectors. Each head and each zone will have some of its critical read/write parameters measured during the factory test and stored away. That means a dozen or so tables of 128 values each (probably integers) to be kept somewhere. You don't want to put it in ROM, because it would be too expensive to have a unique ROM for each unit you build. And a hard drive is designed to hold variable information. So you simply keep all kinds of running and testing data on the drive. You also keep a good deal of the drive's operating code on disk, and page it to the drive's RAM as needed.

All that stuff is stored in "extra" tracks outside the user's data space. That's our "secret room". The extra tracks are formatted exactly the same as regular data space, it's merely on tracks -1 to -28 (or whatever), and they're part of the outermost data rate zone. There are usually 25 or 30 tracks "reserved" for that purpose, so a hard drive has room to store perhaps 20 or 30 MB of private programs and data. ]

Every surface is scanned for media defects and hard error locations are reallocated. There are algorithms used for "scratch-fill" to eliminate sectors between detected errors; those sectors are likely to have errors that just didn't quite get detected. Several hundred hard errors per surface is normal. The test includes deliberately moving the heads off the track center to find scratches or pits "between" the tracks.

Actuator latch opening and closing speed is tested.

Data rates and soft data error rates are measured by head and zone. Even if a drive may pass overall, it can fail on a detail.

Servo parameters such as raw seek times, settling times, stability parameters, and quality of the recording of the servo data are measured. Information about how much current it takes to stay on track, and what the relation is between seek current and acceleration (the torque constant) for several places across the disk is stored in the "secret room".

At Quantum, the production testing is done in an environment that's roughly equivalent to what a typical operating environment might be--ambient temperature around 40C, humidity whatever it is (factories are in Singapore, southern Japan and Ireland, so humidity is generally high), and about 100 drives running in a test cell, 10 drives per shelf, sort of bouncing around.

Criteria to pass the tests are always more stringent than the specs. That's been true everywhere I've worked. I thought about the possibility of trying to give specific numbers for these tests, but as I thought about it I realized that the test criteria change so fast that what I know is certainly already obsolete.

AD: Has there been any consideration to make flash ROM (similar to what most modems have) a standard for hard drives? Or is that considered too dangerous?

JT: During development and often leaking into the early production drives, there's a flashable ROM on the drives. It's replaced as early as possible with hard-coded ROM to save costs. And afterwards, if the ROM needs changing, it's an earthquake-level task.

So it isn't danger, it's a combination of $$$ and tradition.

AD: If cost was not an issue what kind of hard drive would you design for your own use?

JT: 10K RPM, 2 1/2" disks, about 20 GB, with two complete drives, striped, in one housing. It should be able to put about 60 MB/s across the interface, continuously

It won't happen. It's WAY too expensive, but technically pretty easy to do.

AD: Any thoughts on other storage technologies such as CD-R/CD-RW, Magneto-Optical or DVD? Do you think these technologies will replace hard drives as the primary storage medium?

JT: I don't think CD and its derivatives will replace hard drives because the physics in the way they write, especially, is slower than hard drive magnetics. MO has no advantage over hard drives in speed or data density.

However, sometime before 2010, hard drives will hit the wall in terms of data density, and at this time I don't know of a way around it. That's the first time I've had to say that in my hard drive career. In the past, I've been able to perceive one or more ways around some supposed barrier to speed or capacity. When the data density barrier (technically it's called the paramagnetic limit) is reached, the only candidate I see for replacement today is some development of flash ROM. They need to cut cost by an order of magnitude and improve writing speed by a couple of orders of magnitude. Those are formidable challenges!

AD: Any common misconceptions about hard drives that end users have that you would like to clear up?

JT: The biggest one is that you'll wear out the bearings by letting the drive run. If you leave a drive running continuously, there's roughly a 1% chance that a bearing will fail in the first 7 years.

The only other one is the argument about whether to leave it running or shut it off. I just said it won't hurt to leave it running. Well, the standard test for starting and stopping ends up with a 0.3% chance of a drive failing to start in the first 20,000 starts.

 So my advice is, leave it running if you like, shut it off if you like. It doesn't matter. If your drive fails, it isn't because of your choice in that matter.

AD: Did you work a 40 hour work week during the design process?

JT: As I said before, Ha,ha,ha,ha!! During the heart of a project, 60 or 70 hours was pretty much normal. When there's a crisis, or during a build, 80 to 100 hours of actual working (not just being there) is what you do.

It's funny, of course. One guy will be busting butt, and the guy in the next cube will have nothing out of the ordinary on the fire. Yet engineering is essentially an intellectual sport, so you can't just hand off half your work when you have a crisis.

AD: How much documentation was produced for the Atlas 4? An estimate would be fine.

JT: Depends on what "documentation" means--but let me see--print docs, maybe a pile 20 feet high, if you don't count all the drafts.

The hard-copy documents I kept in my file filled an entire file drawer for each of three programs I worked on at Quantum. I threw away a lot more paper than I kept.

There were 200+ mechanical drawings, with 2 to 15 revisions each (average maybe around 5 revisions). I didn't have a complete solid model file of the drive, my solid model directories for Katana (Quantum's internal project name for Quantum Atlas 4 (SCSI) and Fireball +KA (IDE); they're the same except for the interface) never ran more than about 200 MB. Two other engineers kept solid model directories, too.

The firmware manager had a graph on his office wall about size of the code files--I think it peaked around 4 or 5 gigs.

E-mail & phone-mail messages, I have no idea. My E-mail was constantly overflowing my 4 MB allocation--about 500 messages or so. I had to purge it once every couple of months. Managers had more space <g>.

In other words, lots and lots of docs.

AD: What is the estimated cost to bring a new hard disk to market from start to production?

JT: These days, in the range of $50 million.

AD: Can you discuss a bit about the ATM (Automatic Teller Machine) deposit mechanism that you designed (20+ years ago)?

JT: 20 years is a long time! The problem was to accept deposit envelopes of various sizes and shapes and thicknesses, maybe containing coins, print an identification number on them, pass the UL test for theft resistance, work when it's pouring with heavy rain, fit in the available space, be easy to maintain, and cheap. In general, the usual engineering challenges. <g>

The hardest task was to pass the UL break-in test. The tester was a massive, muscular fellow armed with punches, crowbars, sledgehammers, long grabbers, and other tools. He could study the deposit system for as long as he wanted to, inside and out, before he began his attack. He had half an hour of actual "breaking in" time to try to fish out an envelope. That half hour didn't have to be contiguous. He could bang on the depository, then go around and see if his attack was working. Another UL person timed him with a stopwatch. We passed, barely.

It was also difficult to come up with a reliable envelope printer. I eventually designed a sort of rotary rubber stamp that printed the number every couple of inches along the envelope. If there were coins, it seemed from our testing that there was always a way to make out the number, maybe combining a couple of partial printings.

hddes4-3.jpg

This picture of a Corvette was taken in June of '67 at the Corkscrew at Laguna Seca.

AD: Since you have done race car driving via Sports Car Club of America (SCCA) road racing do you ever play with car racing simulations/video games?

JT: I've tried a couple, also tried a couple of coin-op games. They're boring.

One of the arcade games had pretty good visuals, comparable to the in-car cameras you see on TV occasionally. The problem I have with all those things is that you don't get the physical feedback. 1.5G+ cornering forces, 1G+ braking, etc. If you get a flat-spotted tire it can literally cause you to see double. The simulations don't do that stuff. Also, they're generally over way too quick. If you imagine more intensity than an arcade racing game, then have it last for 45 minutes or so, that's what you do. I'm a skinny sort of guy, 5' 8" and 140 lb, and I used to sweat off 5 lb or so in a 40 minute session. It was fun (the most fun you can have while dressed), it was intense, and it required total commitment, not only on the race track, but in preparation too. When I was no longer prepared to give the commitment, I retired.

hddes4-4.jpg

This picture of a Ralt RT-4, was taken in March of '86 at Firebird Raceway in Phoenix.

AD: Thank You.

Mr. John Treder has also written more details on hard drive history, design, and performance issues which are included in the following sections.


 

Copyright © 1999 by Albert Dayes