Tuesday, September 28, 2021

PDP-11/70: In Which Josh Finally Writes Some More

Ahem.  Sorry about that, seems like I forgot what I was doing here.  Where was I?

February 9: The Saga of LOAD ADDR

When we left off, the 11/70 was showing signs of life but wasn't executing instructions and, more pressingly, the "Load Addr" switch would stop working after the system warmed up for a minute or so.

Some experimentation revealed that the switch itself was still functioning (eliminating the front panel as the issue) along with most of the logic behind it:  Toggling Load Addr wouldn't load the data in the data switches into the address register, but it would happily load zero into it.  This was further verified by stepping through the microcode and verifying that the correct branch was taken when the switch was toggled.

Having confirmed that, I tried running the system using the KM-11's RC clock.  The KM-11 has three options for controlling the system's clock:  it can run normally using the 33.3333Mhz crystal on the TIG, it can single step the processor with the toggle switch, or it can run off of an RC-generated, adjustable clock, also provided by the TIG, but only used when debugging with the KM-11.  This clock is adjustable via a small trimpot on the TIG and allows running the clock slower or faster than normal, for margin testing.

With the RC clock selected and running at about 2/3 normal speed the Load Address switch functioned without issues, even after warming up for several minutes.  This would tend to indicate a marginal component that worked properly under-specced, but failed at rated speeds.

Since the issue didn't reproduce at slow speeds, stepping through the microcode to find the culprit wasn't an option.  Before bringing out the Big Guns (i.e. The Logic Analyzer) I traced through the flows diagrams to see if any obvious test points emerged.  

Looking at FLOWS 14, ADR.00 is the only possible culprit (as it's the only microinstruction in the path that modifies PCA, the internal representation of the PC register).  Since the address isn't getting loaded, either PCA <- BR (load the PC with the Bus Register) isn't functioning, or the T5 clock that triggers the former isn't being generated.

The T5 signal looked fine on the scope, so the PCA <-BR logic was the next thing to look at.  The signal that triggers this operation is generated on the DAP (DAta Paths) board below and is called DAPJ CLKPCA H.  It's generated by the combination of the TIGD T5 H (T5 clock) and RACA UPCA H, a signal generated by the microcode PROMs for instructions that need to load PCA.


Putting the scope on pin 6 of the 74S11 at E43 (i.e DAPJ CLKPCA H) showed no output at all.  The T5 clock input was clocking just fine, but the RACA UPCA H input was a flat line, indicating that the fault was elsewhere.

RACA UPCA H is generated on the RAC (ROM and Address Control) board, so the DAP board was reinstalled and the RAC board brought out on the extender for testing.  And after doing so... the problem went away!  I reinstalled the RAC board directly in the backplane and the issue remained fixed.  Despite my best efforts I couldn't get the issue to rear its ugly head again, so I considered it fixed.

This is the first instance of an intermittent backplane connection causing issues, and I suspected it wouldn't be the last.

February 10-12: Making Instructions Execute

Now that it was possible to load addresses reliably via the front panel, it was much easier to toggle in short test programs to debug instruction execution.  As you might recall from the last post, instructions wouldn't execute at all:  Hitting "Start" would immediately return to the Halt state, and the PC would not be incremented.  "Continue" behaved similarly.  

Fortunately, this behavior was traceable using the KM-11 to single-step the microcode.  Tracing through the microcode indicated that the microcode flow was aborting early and returning to the main console loop (via BRK.90) before the expected instruction fetch and PC increment at FET.10.  

Under normal circumstances, the branch to BRK.90 would only be taken if the BRQ signal was set.  BRQ (Brake ReQuest -- literally "put on the brakes, we're stopping this thing") indicates that the processor has been halted for some reason: the halt switch, power failure, an interrupt or a trap, for example, and causes a microcode jump back to the top of the main console loop, aborting the current operation.  In this case, it was short-circuiting the instruction execution before anything at all could happen.

And sure enough, probing the BRQ signal (TMCB BRQ TRUE H), on the TMC (Traps and Misc. Control) board showed that it was stuck high.  

As you can see the TMCB BRQ signals are generated based on a whole slew of inputs there on the left -- interrupt requests, traps, the whole lot -- all being OR'd together.  Tracing backward from the 74S11 at E61, the 74S04 at E55 tested fine but the 74H30 at E70 had an output value stuck at 1.65V regardless of the inputs.  This is not a valid TTL signal level for either a logical 0 or 1, stuck somewhere in no-man's land between the two.  However, the 74S04 (inverter) at E55 thought it was just high enough to count as a "1" and thus provided a nice clean logical 0 at the output on pin 8, resulting in TMCB BRQ TRUE H being stuck high.

Debugging the TMC

 

I replaced the 74H30 and now individual instructions were executing, in a certain sense.  Unfortunately, after each instruction, the PC contained garbage rather than the next address.  

For example, executing a simple HALT instruction at address 1000 should increment the PC to 1002 and then halt the processor.  What I was seeing was a halt with 3602 in the PC instead.  Sometimes this value varied, randomly.

This one was puzzling and it took me a bit of poking around to figure out what was going on.  The PC register (also referred to as R7) in the PDP-11 is actually split into two separate registers within the PDP-11 hardware: PCA and PCB.  These are modified during instruction execution and are used to store intermediate values for various address calculations.  As described in the introduction to Chapter 2 of the 11/70 Processor Manual:

The data paths diagram below outlines how data gets around the processor, as controlled by the microcode.  PCA and PCB are in the middle near the top and you can see that after exiting the ALU, a calculated address goes to PCA and then through to PCB; PCB is used as the source value for operations involving the PC in many places:

There will be a quiz later.

So my suspicion immediately fell upon the poor PCB register, and my initial thought was that the transfer from PCA to PCB wasn't working properly and was picking some bits in the process.  The logic for these two registers is below (well, for bits 6-15, anyway, 0-5 are on another sheet):


PCA and PCB are built out of two sets of three 74S174 hex flip-flops.  The controls to this consist of a "Clear" (which clears the register) and a "Clock", which loads the data present on the data lines into the flip-flops, thus storing a new value.  As you can see, PCA is directly connected to PCB, and a transfer from A to B will occur when DAPJ CLKPCB H goes high, clocking the data from the output of A into B at the request of the microcode.  The output of B connects to a multitude of other places in the processor.  (The input of A comes from the ALU which itself takes inputs from a number of potential sources -- see the earlier paths diagram to see them all.)

If DAPJ CLKPCB H wasn't getting signaled then PCB wouldn't get updated at all from PCA and apart from that there's no other way for data to get into PCB.  But clearly PCB was getting clocked, since the value in PCB was getting updated (just incorrectly).  Was it possible that the outputs from PCA were incorrect?  Unfortunately no -- probing the outputs of PCA showed the correct value.

Hmm.

After some more experimentation, I noticed that the corrupted data was only present in bits 6 through 11 of the final PC value.  Executing a HALT at address 100000 (which should halt at 100002) instead halted at 102602 (or thereabouts -- as before the value was slightly random).  A HALT executed from 0 halted at 2602, and so on. That corresponds nicely to the arrangements of the three 74S174's comprising PCB, each containing 6 bits.  The "middle" 6 bits of PCB were in the '174 at H47 on the DAP board.  I replaced this with a spare and afterwards, single-stepping instructions worked properly! 

 

I tried a variety of different instructions and single-stepping through them appeared to work correctly.  However, at full speed a simple loop -- like a "BR .-1" (000777) -- would not loop for very long.  After running for a few milliseconds at most it would somehow jump past the the branch instruction and end up at the next address.  Other short programs behaved similarly, running for a short time before ending up off in the weeds.

I spent the next few days poking and prodding, but the details of this must wait for the next EXCITING INSTALLMENT of this saga.  Which hopefully I'll get written up before the heat death of the universe but we'll just have to see.  Until then, keep on keepin' on.

Saturday, April 24, 2021

PDP-11/70: Even More Stuff

January 2021: Debugging Commences

In my last post I left off my tale of PDP-11/70 restoration in late January, having just powered up the rebuilt supplies after reinstalling them in the chassis.  The next step was to reinstall the processor, cache, and memory and see what happens:

(not quite) The First Power-up with Stuff Installed

The answer is: not much.  The processor was almost entirely unresponsive: it powered up with the RUN and MASTER lights on and wasn't responding to most input from the front panel.  Toggling the "Halt" switch and hitting "Start" caused the RUN light to go out, but that's the only response I got from the console.

Enter the KM11-A:

How does one debug a processor as complex as the 11/70's?  These days, advanced diagnostic tools like Logic Analyzers and digital storage oscilloscopes are commonplace, but in 1974 they weren't really an option.  DEC's solution to this was the KM11-A "Maintenance Set", a pair of boards with an array of lights and four switches.  The lights were used to monitor device state, and the switches controlled the behavior of the device and allowed for single-stepping processors.  The KM11 could be used to debug a variety of DEC hardware -- various PDP-11 processors and a few different peripherals and device controllers.  My KM11-A is a reproduction, which I built over a decade ago to debug my PDP-11/40, since then I've also used it to repair my PDP-11/05.  And now, it's time for the KM11 to work its magic again.

With the KM11 boardset installed  (you can see it sticking out from the left-hand side in the above picture) I was able to step the processor through micro-instruction execution.  A toggle switch on the KM11 clocks the processor, and the DATA lights on the front panel show the microcode address in the right 8 bits (with the selector knob turned to "uADDRS FPP/CPU).  (In the above picture it's showing address 200 octal).  The PDP-11/70's KB11-C processor is microcoded, using an array of small, high-speed bipolar PROMs to store 256 64-bit microcode words.  These 256 words are interpreted by the hardware to implement the PDP-11 instruction set, address memory and the Unibus, and to interface the processor to the front panel.
 
The KB11-C Engineering Drawings contain 14 pages of "flow diagrams" which detail precisely how the microcode executes.  The "KB11-C Processor Manual" (EK-KB11C-TM-001) provides 376 pages explaining exactly how the hardware works.  A typical flow diagram looks like:
Feel Flows
This is FLOWS 14, which diagrams the Console (front panel) portion of the microcode.  Top-center, you can see a starting bubble labeled "CON.00" which marks the start of the console portion of the microcode.  The box below it represents a single microinstruction, and details the operation of this microinstruction in each of the processor instruction cycle's "T-states."  The arrows coming out of this box indicate branches to other microinstructions, depending on the state of the hardware at the time of the instruction execution.  Branches may also lead to other flows (indicated by diamonds). 

Use of the KM11 indicated that the processor was definitely executing microinstructions, and seemed to be following the flow diagrams in the engineering drawings.  This is excellent -- it indicates that a lot of the hardware is functional.
 
Curiously, left to its own devices the processor didn't seem to be executing microinstructions at all and was stuck at micro-address 200 octal.  This is "ZAP.00" in the flow diagrams and is where the processor starts at power up or after a reset.

In the troubleshooting section of the 11/70 service docs (diagram on p. 5-16) it states:
IF LOAD ADRS DOES NOT WORK AND:
- RUN, MASTER & ALL DATA INDICATORS ARE ON
- uADRS = 200 (ZAP)
THEN MEMORY HAS LOST POWER
Which seems to adequately describe the symptoms I was seeing -- there is power-fail hardware in the processor that forces the microcode address to 200 in the event that power is lost, but the AC and DC LO signals (which are what the power supply uses to tell the processor of such a failure) were all fine (after checking again, just to be sure).   Also if this was the case I wouldn't expect that the KM11 would be able to step the processor at all -- the power fail hardware should force the processor's microcode address to 200 at all times until the power failure is resolved.

Probing the processor clock signal on the backplane with an oscilloscope revealed no clock signal at all, just a flat line.  The clock signal is provided from one of three sources on the "TIG" (Timing Generator) board:  Normally it comes from a 33.3333Mhz clock crystal.  While debugging with the KM11, it can come either from the MAINT STPR switch on the KM11, or from a special diagnostic RC clock network on the TIG board (this latter can be adjusted to a wide range of frequencies for margin testing.)  This lack of a clock signal was definitely an important clue.

Another oddity was revealed after a closer look at the service docs: In Chapter 4 of the Processor Manual, Section 4.1.3 it states:

"The third source of timing [the other two being the crystal clock and a diagnostic R/C network] is the manually-operated, single-step MAINT STPR switch S4, located on the maintenance card.  This switch is only enabled when maintenance card switches S2 and S3 are both set to 1."

Section 4.2.3 confirms this:

"The maintenance card S2 and S1 switches are both set to 1 to allow single timing pulses to be generated by MAINT STPR switch S4.... Removing the S2 or S1 input conditions the MS EN flip-flop to be cleared."

What was interesting about the above is that on my system, switch S4 (MAINT STPR) stepped the processor with switches S1 and S2 set to any configuration.  This being the case, I wondered if the logic that selects the clock source was faulty, and was always selecting the MAINT STPR input.
 
Well, only one way to be sure, and this would require getting the TIG board out on an extender for some extensive probing.  In doing so, I found that no clock signal was being generated by the 33.3333Mhz crystal at all; in fact while probing it one of the legs to the crystal fell right off.  This is usually a sign of a faulty component.

So I placed an order on Digi-Key for a replacement.

But then I got impatient and remembered that the rusty burned-out hulk of a PDP-11/45 I picked up along with the 11/70 was in the garage, and the 11/45 also has a TIG board, very similar to the one in the 11/70, and also using a 33.3333Mhz crystal.  
 
A short while later, the 11/70s TIG had a new, stolen, clock crystal:
Where'd you get that shiny new crystal?

And after reinstalling the TIG back in the backplane and powering up:


It's alive!  A bit.  With a working clock, the processor was able to respond to the front panel and I was able to load addresses and examine and deposit into memory.  However, instructions would not execute -- loading an address and hitting "Start" on the front panel had no effect.  More pressing: after the system warmed up for a minute or two, the "Load Address" switch on the front panel would stop working properly, and would always load "0" rather than what was in the front panel switches.
 
Still, good progress for just a few evenings of research and debugging (and conversing with people on cctalk for advice.)  Over the next few days I started in on investigating these issues... which I'll talk about in my next exciting installment.  Until then... go find something else to read.


Saturday, April 17, 2021

PDP-11/70 Repair: Part One

Looks like a few months have passed since my last post, as seems to be typical.

Never you mind, let's just ask the question: How'd that whole "Restore a rusty soot-covered PDP-11/70" thing turn out?  Well, I don't want to spoil anything.  Let's pick up where we left off.

 

November 2020:Make it Look Good

Looking good!

Well, first I installed the replacement front panel assembly.  That'll get you 90% of the way there, as anyone who restores old computers can tell you.

December 2020, January 2021: Cleaning, Capacitor Reforming, and Fan Replacement

Despite a pretty new face, the computer was still extremely dirty.  When I first brought the system home I'd given the rusty parts a rough sanding to get rid of the grit and loose paint, but the inside of the chassis was still amazingly filthy. 

Dirty Little Fingers

The boards themselves cleaned up quite well: they were all covered in a fine grit of soot and who knows what else, but soaking them in warm soapy water for 10-15 minutes then scrubbing with an old toothbrush eliminated most of the detritus.  After drying in front of a box fan, the gold fingers were cleaned up with liberal use of Scotch Brite(tm) cleaning pads.  

None of the 17 boards that comprise the Processor, Cache, and Memory of the system appeared to be seriously damaged.  That's good!



My next major concern was whether the backplane itself was hiding some corrosion -- corroded pins make poor contact with the boards and might never work reliably.  And these pins aren't exactly trivial to replace -- while it might theoretically be possible to undo the wire wrap to a bad pin, desolder it from the backplane assembly PCB and remove it... no, you know what: it's impossible for all intents and purposes.  If you have a dead pin on a backplane like this, the backplane (and it follows, the computer) is toast.

This weighed fairly heavily on my mind, as soot and moisture could easily have destroyed this computer.


Empty backplane, mostly.
With the boards removed for cleaning the chassis was now empty, with the backplane exposed for easy (if extremely slow and somewhat painful) cleaning. There are 44 slots in the KB11-C backplane, each of which is divided into 6 sections, designated A-F.  Each one of these sections needed to be cleaned.  A strategy I've used in the past is to fold a thin piece of cardboard over a credit card; this jerry-rigged assembly can be dipped in 99% isopropyl alcohol and then used to clean the slot by inserting it and removing it a few times.  Accumulated dirt and light corrosion will be pulled off leaving the slot at least slightly cleaner than it started off.  For good measure, before starting on each slot, I gave it a good dose of contact cleaner to help loosen things up.

Repeat this for all 6 sections of all 44 slots.  I did this over the course of about three weeks, a few slots every night to prevent my fingers from falling off.  Most of the slots were already pretty clean, but on a few the cleaning card came out quite dirty and required a few extra passes.  The last few slots toward the rear of the chassis took the most time to clean.

Had I to do this again, I might have tried removing the entire backplane from the chassis and rinsing it out with water or isopropyl first, but I got pretty good results with this approach.

Reforming Capacitors

In the midst of the above cleaning process, I started in on the power supplies.

The PDP-11/70 gets its power from two H7420a power supplies.  Each H7420 is a large bulky unit with an extremely heavy transformer up front; this transformer provides 30VAC to up to five modular power supply units which take the AC and provide regulated DC.  Different modules provide different voltages; the H745, for example, provides -15V at 10A; the H744 provides +5V at 25A, the 754 gives you +20V and -5V.

Depending on what system you have and what options it has fitted, the H7420 might have a variety of these modules installed.  For the PDP-11/70 system I have, it's entirely H744s -- seven of them, providing a whole lot of +5.  The H7420 itself provides + or -15VDC, as well as 8V and the ACLO and DCLO signals used to let the computer know if it's about to lose power.

A really dirty H744, prior to cleaning.

Over the course of a month, each of the H744s was removed, disassembled and cleaned.  The capacitors were removed and reformed.  Normally, I like to replace capacitors, rather than reforming just for the sake of reliability and peace of mind.  However, I thought I'd give reforming a try this time around, mostly due to cost considerations:  Each H744 contains three large capacitors, and with seven of them to restore (plus two extra for spares that I happened to have lying about), it was looking like I'd be investing about $750 to replace them all.

I won't go into details on capacitor reforming here -- it's well documented all over on the 'net (David Gesswein has a nice write-up here) and it's not all that exciting.  The upshot of my reforming experience was that of the 28 capacitors in the supplies, four of them ended up being marginal, and two were completely dead.  Not too shabby.  

Testing of the supplies (and the capacitors) was done using an electronic load that I bought for the occasion.  It's a lot more convenient than using banks of resistors, and it looks cool too:

Burning in an H744 (middle) on the bench.  Electronic load on the left, H7420 on right.

The electronic load gives me the ability to vary the load while testing, starting with a small amount of load for initial smoke testing, then ramping it up to really soak test the thing.  I let them run for a couple of hours each.  While this is going on, the output of the supply is monitored on an oscilloscope, to check that ripple is within tolerances (about 200mV, max). This is also a good time to do an initial adjustment of the voltage level (each H744 has a small potentiometer exposed on the front that is used for this purpose).

Of the 9 H744s, all but one tested out fine and required no additional repairs.  The one that failed would occasionally make an interesting short squeaking noise, with an associated drop in voltage.  I put that one back on the shelf for a future investigation (which as of this time has not yet occurred).

Fan Replacement

I have neglected to mention the state of the fans in this system: they were bad.  Very bad.  There are 18 fans in the PDP-11/70: 8 in the two H7420s, and 10 in the processor chassis.  Of these 18 fans, only three actually spun freely and even those sounded pretty bad when doing so.

OBEY THE PAPST FAN

I opted to replace these entirely.  While some of them were designed to be disassembled and cleaned, they were all rusty to the point where I just did not want to bother.  I found a decent supply of Papst fans on eBay for a reasonable price.  These are nice fans, well built with metal blades rather than the more common plastic ones.  Heavy.  Elegant.  Subtle.  Hungarian.  I like them.  

I installed 8 of them in the H7420s and reinstalled six of the H744s (and one H7441 that snuck in there while I wasn't looking.  See if you can find it, it's really exciting).

The supplies: all cleaned up and put back together!

At this point, all the wiring was double-checked, both for continuity and also for shorts and breaks in the insulation.  Everything checked out OK, so I fired the system up with an empty processor chassis, while holding my breath:

Hey, not bad.  No smoke or fire or bad smells, and the front panel lit up (all the lights are on by default since it's disconnected from the logic that normally drives it).  All voltages at the backplane were tested, per the service manual:

This is actually kind of a pain in the neck because you're finding tiny pins in a rat's nest of


The wire-wrap side of the 11/70 backplane: Where's Waldo?

and trying not to accidently brush against another pin while doing so.  I used a set of small jumper wires that clip over the ends of the wire-wrap pins to help keep things isolated.  Even so it was kind of nerve-wracking.  Long story short: all voltages were present in the right places on the backplane, and the ACLO and DCLO signals were both high, as they should be.

Conclusion:

As January drew to a close, I had gotten the PDP-11/70 to a point where it was clean and safely powering up.  What would the following months bring?  STAY TUNED TO FIND OUT!