Sorry to have my first post be a cry for help but we are still using a SGI cluster in production to model the weather.
We had to power-cycle the system for the first time in 20 years due to a UPS changeover and I am trying to troubleshoot what appears to be a broken model-run.
Is anyone good with these boxes? We actually have a replacement for this system in test but it's got a few more weeks till it's ready.
Some initial questions:
I did a shutdown command in the OS (Irix) and then after it completed - they moved the power. It came right back up fine and appeared to be working.
This morning - three of the four modules was powered off and the weather model failed to run at midnight. We powered them up but have no way to know it'll work till tonights run happens. Since those modules are powered up now - will they automatically join the cluster and work or is there something I need to enter at the L1: prompt(serial)?
The system is completely undocumented and the online manuals I can find for this system give cluster troubleshooting commands that apparently aren't on this system.
I inherited this system from a past sysadmin that didn't record anything and there is no back/test system to learn on...just this production highly critical box that we use to predict smoke plumes. This is why I've got a replacement system in the works...I just need to limp this one along for another few weeks.
Also, as a side note - when they pulled the power the console actually never powered off....i can't imagine this has a 20+ year old UPS battery in the chassis somewhere that is still good? Any thoughts?
We had to power-cycle the system for the first time in 20 years due to a UPS changeover and I am trying to troubleshoot what appears to be a broken model-run.
Is anyone good with these boxes? We actually have a replacement for this system in test but it's got a few more weeks till it's ready.
Some initial questions:
I did a shutdown command in the OS (Irix) and then after it completed - they moved the power. It came right back up fine and appeared to be working.
This morning - three of the four modules was powered off and the weather model failed to run at midnight. We powered them up but have no way to know it'll work till tonights run happens. Since those modules are powered up now - will they automatically join the cluster and work or is there something I need to enter at the L1: prompt(serial)?
The system is completely undocumented and the online manuals I can find for this system give cluster troubleshooting commands that apparently aren't on this system.
I inherited this system from a past sysadmin that didn't record anything and there is no back/test system to learn on...just this production highly critical box that we use to predict smoke plumes. This is why I've got a replacement system in the works...I just need to limp this one along for another few weeks.
Also, as a side note - when they pulled the power the console actually never powered off....i can't imagine this has a 20+ year old UPS battery in the chassis somewhere that is still good? Any thoughts?
Last edited: