After a recent SYSGEN there is a crash of the executive when accessing an RK05 disk. This requires a reboot and on a multi-user system the work of other users could easily get lost. (To avoid lossage, during the 1970's we got in the habit of saving our editor buffers every 10 minutes).
The executive traps into XDT when a known good RK05 disk (device DK) is being accessed. This problem was easy to reproduce. When it occurs there is a prompt at the console from XDT. I/O from other devices and all processing stop:
>all dk2: >ini dk2: OD:120334 XDT> |
This was not a problem on a recently generated RSX-11M V3.2 system. What changed between then and when the problem occurred?
RL01/2 and RP06 (devices DL and DB) disks work without any problem. Therefore I do not suspect F11ACP or the executive. This problem seems to be local to the DK driver.
Initially I suspected that applying autopatch "E" introduced an issue. But inspection showed that the driver source, DKDRV.MAC, was not patched. And examination of DKDRV.MAC seems to indicate that this driver is unchanged since RSX-11M V3.1. This suggessts stable and well tested code.
A few references are needed to understand the problem at a deeper level. These
documents from bitsavers were handy:
* PDP-11 Programming Card (Prog),
* PDP-11/70 Processor Handbook (Proc),
* RSX-11M V3.2 Guide to Writing an I/O Driver (Guide),
* IAS/RSX-11 ODT Reference Manual (ODT),
Since XDT has control of the system, let's use XDT to examine memory and hardware state. (Guide section 3.4: XDT is a subset of ODT).
My crash analysis technique is to quickly gather some info that almost always is needed. With that context I dig deeper. I will show a transcript of the XDT interaction. While using XDT, a <Line-Feed> character (Ctrl-J) can be typed to examine the next location. References to various info will be flagged by ">>> some text". Where ">>>" appears, the >>> and everything that follows on that line are my comments, not part of the XDT interaction.
OD:120334 >>> OD = trap 4, PC of next instruction >>> CPU State XDT>$s/030000 >>> PSW >>> general purpose registers ... XDT>$0/000375 >>> R0 $1 /063760 >>> R1 $2 /177400 >>> R2 $3 /063760 >>> R3 $4 /053504 >>> R4 $5 /053200 >>> R5 $6 /000646 >>> R6: Stack Pointer $7 /120334 >>> R7: PC >>> Examine code at vicinity of failure XDT>120320/005412 >>> neg (r2) 120322 /005742 >>> tst -(r2) 120324 /012700 >>> mov #375,r0 120326 /000375 120330 /105762 >>> tstb -4(r2) 120332 /177774 120334 /100410 >>> bmi 120356 120336 /132761 >>> bitb #4,12(r1) 120340 /000004 120342 /000012 |
Print out when the crash occurs is "OD:120334". (ODT section 5.2: "OD" indicates an odd address or other trap 4; 120334 is the address of the instruction after the one where the trap occurred.) (Prog: Trap 4 is due to Time Out and other errors.)
As expected PSW shows the CPU is executing code in kernel mode (Prog).
R2 contains the address of the first RK11 device register (Prog.). This is unusual. Typically PDP-11 device registers are referenced relative to the address of the Control and Status Register (or CSR) address. For the first RK11 controller in a system, the CSR address is 177404.
Several words of code are examined before and after the PC of the failure. Using (Prog) they are disassembled by hand. Working forward from 120334 is easy. When working backwards, words at 120332 and 120326 are impossible or unlikely to instructions, so try again with the previous word.
To properly interpret addresses, we need to understand how RSX-11M has allocated memory. The lowest memory addresses, 000000 through 117777, (the first 20 KW) are used by trap and interrupt vectors, the kernel mode stack, the executive code and the pool (dynamic storage region). The PAR command shows how the remainder of memory is used. This command was executed just before the crash, but since partitions do not move it could also be done later when analyzing.
>par LDRPAR 00120000 00002400 MAIN TASK TTPAR 00122400 00040000 MAIN TASK DRVPAR 00162400 00031300 MAIN SYS 00162400 00002000 SUB DRIVER -DB: 00164400 00002600 SUB DRIVER -DD: 00167200 00001100 SUB DRIVER -DK: 00170300 00002000 SUB DRIVER -DL: 00172300 00002300 SUB DRIVER -DM: 00174600 00001100 SUB DRIVER -DT: 00175700 00002400 SUB DRIVER -DY: 00200300 00001300 SUB DRIVER -CR: 00201600 00000600 SUB DRIVER -CT: 00202400 00001100 SUB DRIVER -LP: 00203500 00004200 SUB DRIVER -MM: 00207700 00000500 SUB DRIVER -PP: 00210400 00000300 SUB DRIVER -PR: 00210700 00003000 SUB DRIVER -CO: SYSPAR 00213700 00010000 MAIN TASK FCPPAR 00223700 00050000 MAIN SYS 00223700 00047500 SUB (F11ACP) FCSRES 00273700 00020000 MAIN COM RMSSEQ 00313700 00040000 MAIN COM GEN 00353700 16424100 MAIN SYS 00353700 00020000 SUB (...MCR) 00433700 00040000 SUB (COT...) 00533700 00040000 SUB (QMG...) 00573700 00040000 SUB (LPP0 ) > |
(Proc section 5.3, Processor Traps) Trap to vector 4 occurs for Odd Address, UNIBUS Time Out, or Non-Existent Memory. Because this is a mapped system, memory management state is needed to know what address is being accessed by the failing instruction at 120330. (Proc chapter 6) explains the PDP-11/70 memory management.
XDT>177572/000201 >>> MMR0 - Relocation is enabled 177574 /000000 >>> MMR1 - N/A (trap incremented registers) 177576 /050420 >>> MMR2 - N/A (trap PC) XDT>172516/000060 >>> MMR3 - 22 bit, no D-Space, UNIBUS map enabled XDT>172300/077506 >>> KISDR0 read/write, full 4KW 172302 /077506 >>> KISDR1 read/write, full 4KW 172304 /077506 >>> KISDR2 read/write, full 4KW 172306 /077506 >>> KISDR3 read/write, full 4KW 172310 /077506 >>> KISDR4 read/write, full 4KW 172312 /077506 >>> KISDR5 read/write, full 4KW 172314 /077406 >>> KISDR6 read/write, full 4KW 172316 /077506 >>> KISDR7 read/write, full 4KW 172320 /000000 >>> KDSDR0 N/A >>> six N/A registers not shown 172336 /000000 >>> KDSDR7 N/A 172340 /000000 >>> KISAR0 maps 20KW exec + pool in low memory 172342 /000200 >>> KISAR1 maps 20KW exec + pool in low memory 172344 /000400 >>> KISAR2 maps 20KW exec + pool in low memory 172346 /000600 >>> KISAR3 maps 20KW exec + pool in low memory 172350 /001000 >>> KISAR4 maps 20KW exec + pool in low memory 172352 /001672 >>> KISAR5 maps loadable DK driver (576. bytes) 172354 /003564 >>> KISAR6 maps unknown task in partition GEN 172356 /177600 >>> KISAR7 maps 4K peripheral device page 172360 /000000 >>> KDSAR0 N/A >>> seven N/A registers not shown |
(Proc chapter 6): MMR3 shows that 22-bit addresses are in use. MMR3 also
indicates that Data Space is not being used. Since the PSW shows current
mode is kernel, the kernel mapping registers are being used for address
translation. Kernel APRs map as follows.
* APR0 through APR4 map low memory (the executive).
* APR5 maps the DK driver and whatever immediately follows in memory.
* APR6 maps the address space of some task in the GEN partition.
* APR 7 maps the I/O Page.
ADDRESS TRANSLATION per (Proc chapter 6):
* XDT used the same APR5 when examining the code as was in use when
the trap occurred. This assures that the failing instruction at virtual
address 120330 is the "tstb -4(r2)" previously identified.
* The byte being tested is at virtual address (contents of R2) -4 =
177400 - 4 = 177374(8),
* Virtual address 177374 maps through APR7 to physical adddress (PA) 17777374.
* (Proc section 6.7 Non-Existent Memory Errors): - If the high 4 bits of
the PA are one bits, the lower 18 bits are used for a UNIBUS address. Thus
accessing PA 17777374 is mapped to a read of UNIBUS address 777374.
(Proc Appendix B): The CPU Error Register is at PA 17777766. Examining that register confirms the UNIBUS Time Out. This is reasonable. (Prog) and (Proc Appendix B) show the RK11 registers are at UNIBUS addresses 777400 through 777416. The Simp configuration printout shows no devices using 777374.
XDT>177766/000020 >>> CPU Error Register: >>> Only bit 4 is set, thus UNIBUS Time Out. |
Crash definitely occurred in the DKDRV. Looking at the source code to get more insight. Task Builder map of DKDRV.TSK shows there is only one PSECT called ". BLK.". There is only one source module, DKDRV, that supplied all the contents of that PSECT. Addresses in that PSECT have been relocated to virtual 120000.
MEMORY ALLOCATION SYNOPSIS: SECTION TITLE IDENT FILE ------- ----- ----- ---- . BLK.:(RW,I,LCL,REL,CON) 120000 001004 00516. 120000 001004 00516. DKDRV 08 RSX11M.OLB;1 |
Examination of source of DKDRV lets us confirm the manual code disassembly used earlier in the analysis. (Add 120000 to addresses in this listing to get the virtual address where the code will be at runtime.) This shows intent of the code -- this is important because we see the value in R2 gets modified. At line 239 R2 should contain the CSR address of the RK11 controller.
228 000260 016402 000000G MOV S.CSR(R4),R2 ;GET ADDRESS OF CSR 229 000264 016401 000000G MOV S.PKT(R4),R1 ;RETRIEVE ADDRESS OF I/O REQUEST PACKET 230 000270 116464 000000G 000000G MOVB S.ITM(R4),S.CTM(R4) ;SET CURRENT DEVICE TIMEOUT COUNT 231 000276 062702 000006 ADD #6,R2 ;POINT TO DISK ADDRESS REGISTER 232 000302 016112 000034 MOV I.PRM+10(R1),(R2) ;INSERT DISK ADDRESS 233 000306 016542 000002G MOV U.BUF+2(R5),-(R2) ;INSERT BUFFER ADDRESS 234 000312 016542 000000G MOV U.CNT(R5),-(R2) ;INSERT NUMBER OF BYTES TO TRANSFER 235 000316 006012 ROR (R2) ;CONVERT TO WORD COUNT 236 000320 005412 NEG (R2) ;MAKE NEGATIVE WORD COUNT 237 000322 005742 TST -(R2) ;POINT BACK TO CSR 238 000324 012700 000000C MOV #IE.DNR&377,R0 ;ASSUME DRIVE NOT READY 239 000330 105762 177774 TSTB -4(R2) ;IS DRIVE READY? 240 000334 100410 BMI 31$ ;IF MI YES |
CSR address is loaded into R2 by line 228. R2 is modified but when the trap occured during instruction on line 239, R2 again should contain the CSR address. R2 contained 177400; the correct CSR addr is 177404. If R2 contained 177404 the trap would not have occurred.
The wrong CSR address is due to an error made during the prepgen. The
default address provided by SYSGEN.CMD should not have been changed.
Similarly, the TC11 DECtape controller CSR address should not
override the default from SYSGEN.
>; Enter [L/R,] vector, CSR, highest unit number <0 to 7> for: >; >* 6. DK controller 0 [D: 220,177404] [S]: ,177400,7 ... >; Enter [L/R,] vector, CSR, number of drives for: >; >* 3. DT controller 0 [D: 214,177342] [S]: ,177340,8 |
These problems can be fixed by editing LB:[200,200]SYSSAVED.CMD and removing the explicit CSR addresses for the DK and DT devices. Then SYSGEN Phases 1 and 2 can be re-run. This is what was already demonstrated during the most recent SYSGEN. SYSGEN Phase 3 and what followed in that web page do not need to be done again.