Over the past few months, the "interim" computer used as emerald in the WIYN 0.9-m control room -- which is the main interface for users to the HDI dataserver, and a workhorse for data reduction -- has been crashing occasionally. During my own run in January, 2017, the computer has crashed once per night during 3 of the 5 nights so far.
Could the problem be bad memory? Last night (Jan 14), at 10:22 PM local time, an error message appeared on all of emerald's terminal windows:
The computer continued to run without a problem after displaying this message, at least for an hour or two.
In order to check for memory errors, I downloaded a copy of the memtest86+ tool. I grabbed a copy of the ISO and burned it to a CD, which is labelled
MEMTEST86+ v 5.01 1/15/2017
and currently sitting in the 0.9-m control room, in the bookcase high on southeastern wall, inside a yellow paper sleeve.
I placed this CD into the drive on emerald and rebooted the machine. It immediately started running the memory test, using the default parameters. As it ran, the program displayed some useful information:
Clock 2667 MHz x64 mode L1 cache 32 K 88893 MB/s L2 cache 256 K 35089 MB/s L3 cache 8192 K 24466 MB/s Memory 4086 M 11348 MB/s RAM 666 MHz (DDR3-1333) Memory SPD info: Slot 0: 1024 MB DDR3-1333 Slot 1: 1024 MB DDR3-1333 Slot 2: 1024 MB DDR3-1333 Slot 3: 1024 MB DDR3-1333
After one hour, the program was still running. It indicated that no errors had occurred. I tried to stop the program by pressing the "Esc" key, as indicated on the screen -- but there was no response. The testing did stop, but the display did not change otherwise. I was unable to cause the computer to take any action; it was frozen.
So, I rebooted by pressing the 'reboot' button on the front of the tower in the computer room.
When the memtest86+ program started to run again, I pressed "F1" to put the program into Fail Safe mode.
After about 20 minutes, the program finished "Pass 0" with no errors. I left it running and went to eat lunch.
When I returned, I found it had frozen again. The display showed "Pass 1: time 23:36", and "Pass 1: Errors 40." A screenshot is below (click on it for a full-resolution version)
As you can see, the memory errors occur at location 1587.8MB, which (I believe) corresponds to the second memory stick, in Slot 1.
I conclude that there is a bad memory stick in the computer, in Slot 1 of the motherboard.
The current memory sticks are type 1024 MB DDR3-1333, according to the memtest86+ program. I do not know how many pins they have, as I have not opened the tower case to check.
The computer site newegg.com does not have any memory sticks of the same size and type for sale, but it does have 2048 MB DDR3-1333 memory for sale. For example, as shown at this link, there are a number of items for sale, at a typical price of $15 each.
After checking to verify that sticks of this type have the proper pin arrangement to fit into emerald's slots, I recommend purchasing 2 sticks of this type, size 2048 MB, and placing them into slots 0 and 1 of the current machine. This would yield 4096 MB by itself, equal to the current total, or 6144 MB if the current sticks are left in slots 2 and 3.