Origin 300 cluster power cycle issues.

nondem · Aug 5, 2020

Sorry to have my first post be a cry for help but we are still using a SGI cluster in production to model the weather.

We had to power-cycle the system for the first time in 20 years due to a UPS changeover and I am trying to troubleshoot what appears to be a broken model-run.

Is anyone good with these boxes? We actually have a replacement for this system in test but it's got a few more weeks till it's ready.

Some initial questions:
I did a shutdown command in the OS (Irix) and then after it completed - they moved the power. It came right back up fine and appeared to be working.
This morning - three of the four modules was powered off and the weather model failed to run at midnight. We powered them up but have no way to know it'll work till tonights run happens. Since those modules are powered up now - will they automatically join the cluster and work or is there something I need to enter at the L1: prompt(serial)?

The system is completely undocumented and the online manuals I can find for this system give cluster troubleshooting commands that apparently aren't on this system.

I inherited this system from a past sysadmin that didn't record anything and there is no back/test system to learn on...just this production highly critical box that we use to predict smoke plumes. This is why I've got a replacement system in the works...I just need to limp this one along for another few weeks.

Also, as a side note - when they pulled the power the console actually never powered off....i can't imagine this has a 20+ year old UPS battery in the chassis somewhere that is still good? Any thoughts?

nondem · Aug 5, 2020

As an added note - should I reboot the OS now that I've powered up the cluster modules so it sees them at boot? hinv -vm apparently builds its inventory on boot so that looks like it doesn't include the modules that were powered down.
I could post the output but don't want to flood.

Elf · Aug 6, 2020

Edit: With apologies, I no longer wish to have involvement with SGUG or SGI communities in general,
and have also chosen to remove all of my content. Many things have changed since I co-founded, named, and ultimately
then left SGUG. There are many good people around, to whom I apologize for frustrating by removing these things, and
also many petty people that over the years whittled down both the enjoyment as well as sense of obligation I used to
feel to anyone else regarding what was ultimately just a hobby. Unfortunately one of the latter now writes the rules
and so it is time for me to take my things and go.

This message will replace all of my previous forum posts because deleting threads that I started would have removed
other peoples' posts.

CiaoTime · Aug 6, 2020

Odd stuff! There's no battery backup for L1 on board - but if the system was soft powered down, L1 will continue to run in the background, if that's what you're asking. L1 will always be running as long as the system has some input power on it.

If you're at the L1 prompt in serial, the command to start up all bricks in the system is as follows:

* pwr up

The asterisk is critical: it's the keystroke that tells the main CPU brick to power up every brick that's linked up to it, and without it, only one of the four bricks will come up on reboot. All the bricks in the system should be plugged in and showing the 'Powered Down' state before this command is run. Don't try pressing the power buttons on the front!

Let me know if that's any help!

HAL · Aug 6, 2020

Hi,
first of all I wonder how the O300 bricks are connected to each other ? Is it a cluster setup with 4 bricks where all each of them boots
Irix or are they connected through those thick cables (numalink) with only 1 brick booting and the others provide their resources to a
single system image ?
The available resources are shown in the "hinv" (type hinv -v in a shell) ; so if the hinv only shows 2 or 4 cpus, then you will have to reboot
the bootable brick so it can discover the other bricks and add their resources to the system (this applies to a numalink'ed setup).

nondem · Aug 6, 2020

First THANKS to all of you for the replies!

Here is the current hinv output:

vortex 1# hinv -vm
Location: /hw/module/001c02/node
IP59_4CPU Board: barcode NSG072 part 030-1989-003 rev -B
Location: /hw/module/001c02/IXbrick/xtalk/15
2U_INT_53 Board: barcode NST935 part 030-1809-006 rev -B
Location: /hw/module/001c02/IXbrick/xtalk/15/pci-x/0/1/ioc4
IO9 Board: barcode NSW748 part 030-1771-005 rev -B
4 1.0 GHZ IP35 Processors
CPU: MIPS R16000 Processor Chip Revision: 3.0
FPU: MIPS R16010 Floating Point Chip Revision: 3.0
CPU 0 at Module 001c02/Slot 0/Slice A: 1.0 Ghz MIPS R16000 Processor Chip (enabled)
Processor revision: 3.0. Scache: Size 16 MB Speed 333 Mhz Tap 0x15
CPU 1 at Module 001c02/Slot 0/Slice B: 1.0 Ghz MIPS R16000 Processor Chip (enabled)
Processor revision: 3.0. Scache: Size 16 MB Speed 333 Mhz Tap 0x15
CPU 2 at Module 001c02/Slot 0/Slice C: 1.0 Ghz MIPS R16000 Processor Chip (enabled)
Processor revision: 3.0. Scache: Size 16 MB Speed 333 Mhz Tap 0x15
CPU 3 at Module 001c02/Slot 0/Slice D: 1.0 Ghz MIPS R16000 Processor Chip (enabled)
Processor revision: 3.0. Scache: Size 16 MB Speed 333 Mhz Tap 0x15
Main memory size: 4096 Mbytes
Instruction cache size: 32 Kbytes
Data cache size: 32 Kbytes
Secondary unified instruction/data cache size: 16 Mbytes
Memory at Module 001c02/Slot 0: 4096 MB (enabled)
Bank 0 contains 512 MB (Premium) DIMMS (enabled)
Bank 1 contains 512 MB (Premium) DIMMS (enabled)
Bank 2 contains 512 MB (Premium) DIMMS (enabled)
Bank 3 contains 512 MB (Premium) DIMMS (enabled)
Bank 4 contains 512 MB (Premium) DIMMS (enabled)
Bank 5 contains 512 MB (Premium) DIMMS (enabled)
Bank 6 contains 512 MB (Premium) DIMMS (enabled)
Bank 7 contains 512 MB (Premium) DIMMS (enabled)
Integral SCSI controller 2: Version IDE (ATA/ATAPI) IOC4
CDROM: unit 0 on SCSI controller 2
Integral SCSI controller 0: Version QL12160, low voltage differential
Disk drive: unit 1 on SCSI controller 0 (unit 1)
Integral SCSI controller 1: Version QL12160, low voltage differential
IOC3/IOC4 serial port: tty3
IOC3/IOC4 serial port: tty4
IOC3/IOC4 serial port: tty5
IOC3/IOC4 serial port: tty6
Integral Gigabit Ethernet: tg0, module 001c02, PCI bus 1 slot 4
PCI Adapter ID (vendor 0x10a9, device 0x100a) PCI slot 1
PCI Adapter ID (vendor 0x1077, device 0x1216) PCI slot 3
PCI Adapter ID (vendor 0x14e4, device 0x1645) PCI slot 4
IOC4 firmware revision 83
IOC3/IOC4 external interrupts: 1
HUB in Module 001c02/Slot 0: Revision 2 Speed 200.00 Mhz (enabled)
IP35prom in Module 001c02/Slot n0: Revision 6.210

CiaoTime · Aug 6, 2020

I spat up a little bit of coffee this morning when I read that hinv output, hahah.

nondem said:
IP59_4CPU Board: barcode NSG072 part 030-1989-003 rev -B

What you have there is an Origin 350, not a 300 - with the exceptionally rare IP59 board. Each of the processors on it have double the cache of the lesser models, and run 200mhz faster - it's the fastest mainboard that SGI had ever made! Very desirable. Sounds like you've got a beautiful system to work with - at least for that first brick.

HAL · Aug 6, 2020

ok,
your hinv shows the resources of 1 brick, so in case you have a numalink'ed setup of 4 bricks you will have to make a reboot in order to add the resources of the remaining 3 bricks to the setup. Make a reboot and already during the POST you will see it is running discovery and will hope-
fully detect the other bricks.
The fact that you have an O350 4x1ghz means also that your setup is not 20 years old - that very machine was released around 2004.

nondem · Aug 6, 2020

HAL said:
ok,
your hinv shows the resources of 1 brick, so in case you have a numalink'ed setup of 4 bricks you will have to make a reboot in order to add the resources of the remaining 3 bricks to the setup. Make a reboot and already during the POST you will see it is running discovery and will hope-
fully detect the other bricks.
The fact that you have an O350 4x1ghz means also that your setup is not 20 years old - that very machine was released around 2004.

How do I do the reboot you speak of? I've rebooted the OS but I assume you mean the cluster hardware.

We did a * pwr up at the L1 prompt and the hardware check came back w/this:

Checking hardware inventory ...............
***Warning: Board in module 001c04 is missing or disabled
It previously contained a node-board, barcode RBS343 laser 627d35c3

***Warning: Board in module 001c04 is missing or disabled
It previously contained a IXBRICK board, barcode RBH247 laser 62755bf7

***Warning: Board in module 001c04 is missing or disabled
It previously contained a IXBRICK board, barcode RBH247 laser 62755bf7

***Warning: Board in module 001c10 is missing or disabled
It previously contained a node-board, barcode RBT012 laser 627ddc66

***Warning: Board in module 001c10 is missing or disabled
It previously contained a IXBRICK board, barcode RBX743 laser 6280d943

***Warning: Board in module 001c10 is missing or disabled
It previously contained a IXBRICK board, barcode RBX743 laser 6280d943

***Warning: Board in module 001c08 is missing or disabled
It previously contained a node-board, barcode RBT018 laser 627ddc6c

***Warning: Board in module 001c08 is missing or disabled
It previously contained a IXBRICK board, barcode RBX752 laser 6280d966

***Warning: Board in module 001c08 is missing or disabled
It previously contained a IXBRICK board, barcode RBX752 laser 6280d966
DONE

nondem · Aug 6, 2020

More output that might mean something to you experts:

001c02-L1>reset

returning to console mode 001c02 CPU0, <CTRL_T> to escape to L1
Starting PROM Boot process

IP35 PROM SGI Version 6.210 built 02:33:51 PM Aug 26, 2004
Testing/Initializing memory ............... DONE
Copying PROM code to memory ............... DONE
Discovering local IO ...................... DONE
Discovering NUMAlink connectivity .........
Local hub NUMAlink is down.
*** Local network link down
DONE

nondem · Aug 6, 2020

HAL said:
ok,
your hinv shows the resources of 1 brick, so in case you have a numalink'ed setup of 4 bricks you will have to make a reboot in order to add the resources of the remaining 3 bricks to the setup. Make a reboot and already during the POST you will see it is running discovery and will hope-
fully detect the other bricks.
The fact that you have an O350 4x1ghz means also that your setup is not 20 years old - that very machine was released around 2004.

It's possible that it was upgraded on 2004 - at that time it was totally managed by a vendor.

CiaoTime · Aug 6, 2020

Odd indeed! The L1 controller on c02 is able to see the hardware configuration of c04, c08, and c10 - but then it reports that NUMAlink is down during the normal boot process? That seems like an oxymoron at first glance...

The warnings during the hardware inventory process are nothing to worry about: these systems autoconfigure themselves when they're all set up properly, and those messages are just saying that the current configuration differs from the last known configuration.

Would it be possible to get a picture of the back of the system? I'm curious to see if everything's wired up correctly -- though I do imagine it must be, if only the power setup was altered.

Unxmaal · Aug 6, 2020

From my own bitter experience, I'll recommend that you completely disconnect the NUMALINK cables, then very carefully reconnect them. They can appear to be connected, but are in fact not properly seated. Get a flashlight, get up close, and compare the distances from the metal connector housings to the plastic NUMALINK connectors once plugged in. It should be even all the way across. You should also see green and amber lights on the left side of the connectors.

weblacky · Aug 6, 2020

Nondem,
Ciaotime’s first response was actually the critical one, and perhaps you misunderstood it. Later SGIs have this mini embedded system call the L1, in clusters like yours those mini systems in each brick are controlled by a device in your rack (or a laptop with software) called the L2. All this has nothing to do with Irix.

CiaoTime said under normal circumstances, when you power cycle and boot, only one CPU brick out of groups of up to four will power on, the rest will seem to remain unpowered. You need to use the L1 in the CPU brick that did power up (or I think the L2 controller, someone please look into that for me, I may be wrong). To perform the power up command to bring the remaining bricks online correctly.

The reason you’re getting the errors has to do with serial numbers. All clusters have a serial number layout and remember their former numbers via the L2. The units loose these numbers when powered down. The L2 helps organize and restore these numbers in each brick as it’s booting and puts them back in a single system image cluster.

When you tried to manually power them on, they booted without L2 control, never got a serial number and were rejected by the startup process as unknown bricks.

What was asked was that you go back to the starting point where you power on the rack, see only 1 brick of several start. Go to the L1 of that brick and issue the power up command for the remaining bricks via the command stated earlier, then the main CPU brick will soft-power cycle the other bricks under control of the L2 and reestablish the cluster setup for you.

It sounds dumb that all this isn’t remembered, but it’s not. This process allows easier brick replacement and such. So I believe this is all based on where you were when you initially powered on the system.

You misstepped in the startup process of the other bricks and that’s why you had this error. Go back, use the proper commands to do the two stage power on and then let us know what happened.

I think this can be done from the L2 interface, but someone with my practical experience then me needs to take you through that.

but that’s the reason you’re seeing what’s happening now. You need the main CPU brick to bring up the other bricks for you, don’t try to press the power buttons in them to manually start them. That’s incorrect cluster startup (as CiaoTime mentioned in thefirst response)

Good luck.

CiaoTime · Aug 6, 2020

Ooh, yeah, I hadn't even considered that his setup might have an L2 controller. (It probably doesn't, though.) L2 on the Origin 350 was an optional item: it's an extra piece of hardware sitting on the outside of the chassis that you can connect into for total rack control over LAN or a master serial terminal.

It's usually used for -really- big racks. On a 4 brick system (which likely only has two CPU nodeboards), L2 is overkill; running a proper startup from L1 should be enough to get everything sorted out. L1 controllers can talk to eachother over NUMAlink: connecting to one CPU brick with a serial cable should be able to control all four.

weblacky · Aug 6, 2020

Good to know, seemed like an expensive system (newest hardware) so why not an L2? Would make things easier to admin. It’s just money

.

CiaoTime · Aug 6, 2020

L2's easier in theory, but none of the folks I know with O3x0 hardware have the sweetest clue how to run it. :V

nondem · Aug 6, 2020

I shutdown the OS, unplugged the power - waited till the L1 console reported it was shutting down and I lost connection.
Then I plugged it back in and got this:

001c02-L1>INFO: 001c02 will power up system in 85 seconds...
INFO: 001c02 will power up system in 80 seconds...
INFO: 001c02 will power up system in 75 seconds...
INFO: 001c02 will power up system in 70 seconds...
INFO: 001c02 will power up system in 65 seconds...
INFO: 001c02 will power up system in 60 seconds...
INFO: 001c02 will power up system in 55 seconds...
INFO: 001c02 will power up system in 50 seconds...
INFO: 001c02 will power up system in 45 seconds...
INFO: 001c02 will power up system in 40 seconds...
INFO: 001c02 will power up system in 35 seconds...
INFO: 001c02 will power up system in 30 seconds...
INFO: 001c02 will power up system in 25 seconds...
INFO: 001c02 will power up system in 20 seconds...
INFO: 001c02 will power up system in 15 seconds...
INFO: 001c02 will power up system in 10 seconds...
INFO: 001c02 will power up system in 5 seconds...
INFO: 001c02 powering up the system.

It stopped there and so far none of the bricks have power lights on them.
Is there some point in this process I should press the power buttons on each brick?

weblacky · Aug 6, 2020

Again I’ll wait for someone else to confirm this. Don’t press the power buttons. Use the L1 to power up all. The L1 is always running as long as the power supply has power. If you press the power buttons you’ll be back with the same errors due to improper startup procedure.

So since you see L1 output, I assume you can type input.
Now type the command:

* pwr up

into the L1 and see if everything starts powering up.

nondem · Aug 6, 2020

nondem said:
I shutdown the OS, unplugged the power - waited till the L1 console reported it was shutting down and I lost connection.
Then I plugged it back in and got this:

001c02-L1>INFO: 001c02 will power up system in 85 seconds...
INFO: 001c02 will power up system in 80 seconds...
INFO: 001c02 will power up system in 75 seconds...
INFO: 001c02 will power up system in 70 seconds...
INFO: 001c02 will power up system in 65 seconds...
INFO: 001c02 will power up system in 60 seconds...
INFO: 001c02 will power up system in 55 seconds...
INFO: 001c02 will power up system in 50 seconds...
INFO: 001c02 will power up system in 45 seconds...
INFO: 001c02 will power up system in 40 seconds...
INFO: 001c02 will power up system in 35 seconds...
INFO: 001c02 will power up system in 30 seconds...
INFO: 001c02 will power up system in 25 seconds...
INFO: 001c02 will power up system in 20 seconds...
INFO: 001c02 will power up system in 15 seconds...
INFO: 001c02 will power up system in 10 seconds...
INFO: 001c02 will power up system in 5 seconds...
INFO: 001c02 powering up the system.

It stopped there and so far none of the bricks have power lights on them.
Is there some point in this process I should press the power buttons on each brick?

Just got an L1 prompt and entered * pwr up
came back w/o saying anything:
001c02-L1>* pwr up
001c02-L1>

All of the bricks power lights are still off.
Dunno if I should try to power them w/the buttons on them - and if so - at what point in the process? I know you said not to so I haven't yet.

Origin 300 cluster power cycle issues.

New member

New member

All gone now

Public Enemy Number One

Administrator

New member

Public Enemy Number One

Administrator

New member

New member

New member

Public Enemy Number One

Administrator

Active member

Public Enemy Number One

Active member

Public Enemy Number One

New member

Active member

New member