Requesting assistance with disabled CPUs on Origin 2000

mcguire

New member
Greetings, learned colleagues. First post here, please forgive any breach of etiquette. I'm writing to request assistance with the resurrection of an Origin 2400.

We at LSSM are trying to awaken this 2400 after a few years' slumber. This system was mine personally from about fifteen years ago. We had it running at the museum at one time, but rotated it out of the exhibit area to make room for new machines about five years ago. It has sat in a climate-controlled warehouse since then. The config is two chassis in a rack, 16x 400MHz R12K, 32GB of RAM.

It was hanging on power-up, with N4 of the top chassis showing 0x9a ("CPU disabled by an environment varaible") for both CPUs. The machine comes up fine if I pull that node board. I've tried cleaning and reseating the DIMMs and the node itself, no joy.

Does anyone have any suggestions as to next steps to diagnose and recover that last node?

I'll be working over there this afternoon and all day tomorrow, and can get additional information if needed. Any and all assistance is greatly appreciated.

Thanks,
-Dave McGuire, LSSM
 
Hi! I'd say enter POD mode and try activating CPUs again. That's probably not going to fix it, but maybe this, and some POD debug flags, lead to other hints about what caused the CPUs to be disabled.
 
Dave, on the off chance that it turns out to be a bad DIMM I just ran across a bunch of Onyx2/Origin 2000 DIMMs that I bought as spares when I had a couple of Onyx2s. I’m assuming the 2400 uses the same DIMMs, so if you need some spares let me know because I have no use for them now.
 
Hi! I'd say enter POD mode and try activating CPUs again. That's probably not going to fix it, but maybe this, and some POD debug flags, lead to other hints about what caused the CPUs to be disabled.

Hi, thanks for the suggestion. The problem with this approach is that the problematic node seems to hang the machine when it's installed; I can't get to a prompt. Is there a way to get it to give me control, so I can enter POD mode?

-Dave
 
Sure, stop booting normally. Assuming you don't actually have a real hardware problem. The issue is likely the way you're entering/booting PROM, if this is all a software issue.

From the L1 console immediately at system AC power, before PROM boot try:

turn off auto power on off with: "autopower off"

Stay in the L1 and issue a "debug 0x10d"

That will stop diags and dump into POD forcibly at future PROM boots, then you can try resetting stuff.

use "serial all" to check your DIMM recognition (not too do bad memory, just present or not)

Then do a manual "pwr up" to boot PROM with above debug changes.

If you get it working, issue "debug 0" to go back to normal boot (from L1). And you can set auto power back to ON, with "autopower on".
 
Sure, stop booting normally. Assuming you don't actually have a real hardware problem. The issue is likely the way you're entering/booting PROM, if this is all a software issue.

From the L1 console immediately at system AC power, before PROM boot try:

turn off auto power on off with: "autopower off"

Stay in the L1 and issue a "debug 0x10d"

That will stop diags and dump into POD forcibly at future PROM boots, then you can try resetting stuff.

use "serial all" to check your DIMM recognition (not too do bad memory, just present or not)

Then do a manual "pwr up" to boot PROM with above debug changes.

If you get it working, issue "debug 0" to go back to normal boot (from L1). And you can set auto power back to ON, with "autopower on".

Hi Weblacky, thanks for the suggestion. Do you mean connecting to the MSC?

-Dave
 
Reawakening this old thread with a report of success. We were building out new exhibits and I saw the '2400, which has always been one of my favorite machines due to some personal connections, sitting on the floor looking sad. I was about to rally the troops to roll it back to the warehouse when I decided to take another crack at it. I did a lot of reading in various places, in particular the "purple book", some old posts from the redoubtable jan jaap and a post on the Higher Intellect wiki.

The machine is an Origin 2000, rack config with an MMSC, two modules, eight dual 400MHz R12K nodes, 32GB of RAM. It has been mine for about twenty years but had originally been installed at NASA Goddard Space Flight Center in Greenbelt, Maryland, USA. It now resides at the Large Scale Systems Museum in Pittsburgh, PA, where it is now an operational and available exhibit system.

It took nearly a week of work, but now it's back up and running in its full configuration. My approach was to break it down to the simplest configuration (one module, one node) and build it up from there, piece by piece, not moving forward until each problem was fully resolved. Broadly, the major issues were:

- a piece of schmutz that somehow made it onto one node's compression connector
- swapped upper/lower cables on the MMSC
- "not quite dead" Dallas chips "sometimes" retaining config information
- uncleared error logs
- poor seating of router boards in top chassis (probably due to movement)
- corrupt PROM on a node board
- dirty SIMM contacts on another node board

All of these factors together basically made the machine a big frustrating mess to troubleshoot, but I worked through it and was rewarded with:

>> hinv
System SGI-IP27
16 400 MHz IP27 Processors
Main memory size: 32768 Mbytes
...

I then used the amazing Love installer to get 6.5.30 onto the machine. Today I'll be installing random software on it and cleaning up the cabling.

-Dave McGuire, LSSM
 
  • Like
Reactions: flexion
been along time, but here goes:
go into pod mode
clearallogs
initalllogs
clear
reset
if that doesn't work, reset memory and try again
if that doesn't work its likely scrap
good luck
 

About us

  • Silicon Graphics User Group (SGUG) is a community for users, developers, and admirers of Silicon Graphics (SGI) products. We aim to be a friendly hobbyist community for discussing all aspects of SGIs, including use, software development, the IRIX Operating System, and troubleshooting, as well as facilitating hardware exchange.

User Menu