BitRot: September 2021

Ahem. Sorry about that, seems like I forgot what I was doing here. Where was I?

February 9: The Saga of LOAD ADDR

When we left off, the 11/70 was showing signs of life but wasn't executing instructions and, more pressingly, the "Load Addr" switch would stop working after the system warmed up for a minute or so.

Some experimentation revealed that the switch itself was still functioning (eliminating the front panel as the issue) along with most of the logic behind it: Toggling Load Addr wouldn't load the data in the data switches into the address register, but it would happily load zero into it. This was further verified by stepping through the microcode and verifying that the correct branch was taken when the switch was toggled.

Having confirmed that, I tried running the system using the KM-11's RC clock. The KM-11 has three options for controlling the system's clock: it can run normally using the 33.3333Mhz crystal on the TIG, it can single step the processor with the toggle switch, or it can run off of an RC-generated, adjustable clock, also provided by the TIG, but only used when debugging with the KM-11. This clock is adjustable via a small trimpot on the TIG and allows running the clock slower or faster than normal, for margin testing.

With the RC clock selected and running at about 2/3 normal speed the Load Address switch functioned without issues, even after warming up for several minutes. This would tend to indicate a marginal component that worked properly under-specced, but failed at rated speeds.

Since the issue didn't reproduce at slow speeds, stepping through the microcode to find the culprit wasn't an option. Before bringing out the Big Guns (i.e. The Logic Analyzer) I traced through the flows diagrams to see if any obvious test points emerged.

Looking at FLOWS 14, ADR.00 is the only possible culprit (as it's the only microinstruction in the path that modifies PCA, the internal representation of the PC register). Since the address isn't getting loaded, either PCA <- BR (load the PC with the Bus Register) isn't functioning, or the T5 clock that triggers the former isn't being generated.

The T5 signal looked fine on the scope, so the PCA <-BR logic was the next thing to look at. The signal that triggers this operation is generated on the DAP (DAta Paths) board below and is called DAPJ CLKPCA H. It's generated by the combination of the TIGD T5 H (T5 clock) and RACA UPCA H, a signal generated by the microcode PROMs for instructions that need to load PCA.

Putting the scope on pin 6 of the 74S11 at E43 (i.e DAPJ CLKPCA H) showed no output at all. The T5 clock input was clocking just fine, but the RACA UPCA H input was a flat line, indicating that the fault was elsewhere.

RACA UPCA H is generated on the RAC (ROM and Address Control) board, so the DAP board was reinstalled and the RAC board brought out on the extender for testing. And after doing so... the problem went away! I reinstalled the RAC board directly in the backplane and the issue remained fixed. Despite my best efforts I couldn't get the issue to rear its ugly head again, so I considered it fixed.

This is the first instance of an intermittent backplane connection causing issues, and I suspected it wouldn't be the last.

February 10-12: Making Instructions Execute

Now that it was possible to load addresses reliably via the front panel, it was much easier to toggle in short test programs to debug instruction execution. As you might recall from the last post, instructions wouldn't execute at all: Hitting "Start" would immediately return to the Halt state, and the PC would not be incremented. "Continue" behaved similarly.

Fortunately, this behavior was traceable using the KM-11 to single-step the microcode. Tracing through the microcode indicated that the microcode flow was aborting early and returning to the main console loop (via BRK.90) before the expected instruction fetch and PC increment at FET.10.

Under normal circumstances, the branch to BRK.90 would only be taken if the BRQ signal was set. BRQ (Brake ReQuest -- literally "put on the brakes, we're stopping this thing") indicates that the processor has been halted for some reason: the halt switch, power failure, an interrupt or a trap, for example, and causes a microcode jump back to the top of the main console loop, aborting the current operation. In this case, it was short-circuiting the instruction execution before anything at all could happen.

And sure enough, probing the BRQ signal (TMCB BRQ TRUE H), on the TMC (Traps and Misc. Control) board showed that it was stuck high.

As you can see the TMCB BRQ signals are generated based on a whole slew of inputs there on the left -- interrupt requests, traps, the whole lot -- all being OR'd together. Tracing backward from the 74S11 at E61, the 74S04 at E55 tested fine but the 74H30 at E70 had an output value stuck at 1.65V regardless of the inputs. This is not a valid TTL signal level for either a logical 0 or 1, stuck somewhere in no-man's land between the two. However, the 74S04 (inverter) at E55 thought it was just high enough to count as a "1" and thus provided a nice clean logical 0 at the output on pin 8, resulting in TMCB BRQ TRUE H being stuck high.


Debugging the TMC

I replaced the 74H30 and now individual instructions were executing, in a certain sense. Unfortunately, after each instruction, the PC contained garbage rather than the next address.

For example, executing a simple HALT instruction at address 1000 should increment the PC to 1002 and then halt the processor. What I was seeing was a halt with 3602 in the PC instead. Sometimes this value varied, randomly.

This one was puzzling and it took me a bit of poking around to figure out what was going on. The PC register (also referred to as R7) in the PDP-11 is actually split into two separate registers within the PDP-11 hardware: PCA and PCB. These are modified during instruction execution and are used to store intermediate values for various address calculations. As described in the introduction to Chapter 2 of the 11/70 Processor Manual:

The data paths diagram below outlines how data gets around the processor, as controlled by the microcode. PCA and PCB are in the middle near the top and you can see that after exiting the ALU, a calculated address goes to PCA and then through to PCB; PCB is used as the source value for operations involving the PC in many places:

There will be a quiz later.

So my suspicion immediately fell upon the poor PCB register, and my initial thought was that the transfer from PCA to PCB wasn't working properly and was picking some bits in the process. The logic for these two registers is below (well, for bits 6-15, anyway, 0-5 are on another sheet):

PCA and PCB are built out of two sets of three 74S174 hex flip-flops. The controls to this consist of a "Clear" (which clears the register) and a "Clock", which loads the data present on the data lines into the flip-flops, thus storing a new value. As you can see, PCA is directly connected to PCB, and a transfer from A to B will occur when DAPJ CLKPCB H goes high, clocking the data from the output of A into B at the request of the microcode. The output of B connects to a multitude of other places in the processor. (The input of A comes from the ALU which itself takes inputs from a number of potential sources -- see the earlier paths diagram to see them all.)

If DAPJ CLKPCB H wasn't getting signaled then PCB wouldn't get updated at all from PCA and apart from that there's no other way for data to get into PCB. But clearly PCB was getting clocked, since the value in PCB was getting updated (just incorrectly). Was it possible that the outputs from PCA were incorrect? Unfortunately no -- probing the outputs of PCA showed the correct value.

Hmm.

After some more experimentation, I noticed that the corrupted data was only present in bits 6 through 11 of the final PC value. Executing a HALT at address 100000 (which should halt at 100002) instead halted at 102602 (or thereabouts -- as before the value was slightly random). A HALT executed from 0 halted at 2602, and so on. That corresponds nicely to the arrangements of the three 74S174's comprising PCB, each containing 6 bits. The "middle" 6 bits of PCB were in the '174 at H47 on the DAP board. I replaced this with a spare and afterwards, single-stepping instructions worked properly!

I tried a variety of different instructions and single-stepping through them appeared to work correctly. However, at full speed a simple loop -- like a "BR .-1" (000777) -- would not loop for very long. After running for a few milliseconds at most it would somehow jump past the the branch instruction and end up at the next address. Other short programs behaved similarly, running for a short time before ending up off in the weeds.

I spent the next few days poking and prodding, but the details of this must wait for the next EXCITING INSTALLMENT of this saga. Which hopefully I'll get written up before the heat death of the universe but we'll just have to see. Until then, keep on keepin' on.

BitRot

Tuesday, September 28, 2021

PDP-11/70: In Which Josh Finally Writes Some More