Handy Tech & Associates LLC

 

 

Online spare memory

Online spare memory is a technology that increases availability by enabling an administrator to wait until a scheduled downtime to replace a faulty DIMM.

Description of online spare memory

In earlier-generation servers and in servers without online spare memory, when a memory module experiences an excessive number of correctable single-bit errors, the system issues a prefailure warning and the DIMM continues to function in its degraded state. If this occurs, the recommended procedure is to power off the server as soon as possible, replace the faulty DIMM, and restart the system.

With online spare memory, a memory bank with a faulty DIMM automatically fails over to a spare bank of DIMMs. When the faulty DIMM reaches a predefined single-bit error threshold, the ROM starts copying the contents of the failing bank to the spare bank in 128KB increments. During this time, the failing bank provides all read accesses. Data is written to both banks during the copy process.

After memory copying is complete, the system ROM switches to the spare bank. At that point, no more reads or writes are made to the failing bank. All reads and writes are made to the spare bank.

During a scheduled shutdown, the faulty DIMM is replaced with a functioning DIMM. When the server restarts, the memory banks resume their normal functions.

Special requirements for online spare memory

Online spare memory has the following requirements:

  • Online spare memory must be configured in the ROM-Based Setup Utility (RBSU). If you do not configure the server for online spare memory, all memory slots can be used for main memory up to the system maximum.
  • In early ProLiant Generation 2 (G2) servers, the spare memory bank was set by the system ROM and could not be changed. In later servers, the spare bank was set to the last bank populated.

    Note: Check the Maintenance and Service Guide to determine how spare memory is set for each server. Refer to the User Guide for complete memory requirements.

  • If you are using online spare memory, the spare memory bank is not counted during the power-on self-test (POST) and is not added to the system memory count reported to the operating system.
  • The DIMMs in the spare bank must be the same size as or larger than those in the other banks.

 

Configuring online spare memory

Before you configure an online spare DIMM, HP recommends that you perform the following steps:
 

  1. Test the new memory.
     
    1. From the Advanced Options screen in the RBSU, disable POST speedup.
    2. From the Advanced Memory Protection screen, disable Online Spare with ECC support.
    3. Restart the system to begin testing the memory. This may take a few minutes, depending on how much memory is installed in the system.
  2. Perform these steps after the memory has been tested.
    1. Enable POST speedup for faster system starts.
    2. Power down the system.
    3. Verify that bank C is populated with memory no smaller than either bank A or B.
  3. Configure the online spare.
    1. Power on the server. Online spare memory is disabled by default; therefore, all the memory is initially counted and configured as available primary memory.
    2. At the prompt, press F9 to enter RBSU.
    3. From the Advanced Memory Protection screen, enable Online Spare with ECC support. Press ESC twice to return to the main RBSU menu.
    4. Press F10 to exit RBSU and restart your server. When your server restarts, it will enable online spare memory and display the following message: "xxxxMB System Memory and yyyyMB memory reserved for Online Spare".

Note: If the memory size requirements for proper operation are not met, RBSU will not allow you to enable online spare memory and will display the message: "Caution: Current memory configuration does not support Online Spare."

Single-board mirrored memory

Single-board mirrored memory in select ProLiant servers protects against multiple noncorrectable multibit errors without degrading the performance of the memory system.

A single memory board contains two memory banks. One of the banks is designated as the primary bank and the other as the mirror bank. Data sent to memory is written by the memory controller to both banks simultaneously, but the system reads from the primary bank only.

During a read operation, if a multibit error is detected on one or more DIMMs in the primary bank, the system reads from the mirrored bank instead. This process occurs without service intervention or server interruption. Service personnel can replace the failed DIMM during a regularly scheduled shutdown.

Note: Mirroring protects against multibit errors as long as system and mirrored DIMMs do not fail in the same cache line at the same time (a highly unlikely event).

All read and write operations in a mirrored memory configuration are handled by the memory controller. To ensure that all DIMMs are functioning properly, every 24 hours the system switches the primary and mirror designations and begins reading from the other bank.

Special requirements for mirrored memory

You must meet the following requirements for mirrored memory:

  • Mirrored memory must be configured in the RBSU. If you do not configure the server for memory mirroring, all memory slots on a single board can be used for main memory. If you are using mirrored memory, only half of the physical memory is counted during POST and subsequently reported to the operating system.

  • Each DIMM in a bank must be the same size as its mirror in the other bank.

Implementing mirrored memory

When implementing mirrored memory, follow these steps:

  1. Start the server with POST speedup disabled and memory mirroring disabled. This enables the system to check the status of all the memory.

  2. After the POST process verifies the memory, restart the server and configure memory mirroring and POST speedup.

Hot-plug mirrored memory


Hot-plug mirrored memory is a fault-tolerant memory option that provides a higher level of availability than online spare memory. Hot-plug mirrored memory protects against single-bit and multibit errors. It is targeted for customers who cannot afford to take a server offline to replace a failing DIMM.

Hot-plug mirrored memory works like single-board mirrored memory. The difference is that the primary banks and mirrored banks are located on different memory boards.

To use hot-plug mirrored memory, a server must have two identical memory boards, each containing several banks of DIMMs. The memory controller writes the same data to identically configured banks of DIMMs on both memory boards, but reads data from only one group.

If any bank of DIMMs has a multibit error, the system performs the following actions:

  1. Rereads the correct data from the mirrored bank on the other memory board.

  2. Performs all future reads from the other memory board.

  3. Provides notification of the DIMM failure through the diagnostic display, the memory board LEDs, the front-panel internal health LED, and HP Systems Insight Manager (HP SIM).

If no errors occur, the system periodically switches the set of banks it reads from to ensure that both sets are monitored for memory errors.

Important! If a DIMM exceeds the limit defined by HP for single-bit correctable errors, the system will not fail over to the redundant banks but will notify you of the condition through HP SIM.

Hot-plug mirrored memory also provides hot-plug replacement capability. You do not have to wait for a scheduled shutdown to replace the failed DIMM. You can remove a memory board that has a failed or degraded bank of memory without shutting down the server.

After replacing the bank of DIMMs, you can reinstall the memory board with the server still running. After the board is reinstalled, the system automatically returns to mirrored status.

You can also perform a non-hot-plug replacement of failed or degraded DIMMs during a scheduled shutdown.

Note: The special requirements described in the single-board mirrored memory section also apply to hot-plug mirrored memory. A server such as the ML570 G2 can support both single-board and hot-plug mirrored memory.

Hot-plug mirrored memory offers these advantages over single-board mirrored memory:

  • You can mirror more than one bank of DIMMs.

  • No server downtime is necessary to replace a defective DIMM or bank.

 

RAID memory

RAID memory takes memory protection and correction beyond the capabilities of ECC. With RAID memory, multibit errors and full DRAM failure can be corrected. Errors in the ECC detection system can also be detected, but not corrected.

Note: RAID in regards to memory stands for Redundant Array of Independent DIMMs.

How RAID memory works

Similar to the way data bits are re-created in a RAID 4 drive storage system, an entire data word can be re-created from parity in a RAID 4 memory system.

When a 64-bit cache line is read out of storage into memory, the F8 chipset breaks the cache line into four 16-bit data words. These 16-bit data words are distributed to the four memory controllers on memory cartridges 1 through 4. Before being sent to the memory controllers, the data words go through the chipset hardware exclusive XOR engine for a parity calculation. The parity data is then sent to the memory controller in the fifth cartridge for storage.

During data transactions (whether to or from memory), the chipset performs a check and ensures that each data word is valid. If there is a problem, the chipset can re-create the cache line based on any four of the five data words and the parity from the fifth cartridge.

ECC on the cartridge

ECC protection is implemented on the memory cartridge regardless of the RAID memory configuration. Single-bit errors are corrected at the cartridge and good data is sent downstream to the F8 controller.

There is no AECC protection on the cartridge. Multibit errors from the memory cartridges are passed on to the F8 controller, which drops the associated read from the cartridge and uses the RAID capabilities to re-create good data.

Note: As in any RAID 4 system, if data from more than one of the five sources is corrupt (has multibit errors), the read will fail.

Comparing RAID memory and drive arrays

Although RAID memory is similar in concept to hard disk drive arrays, there are some differences.

RAID memory does not have the mechanical delays of seek time and rotational latency associated with hard disk drive arrays. Storage subsystem arrays use a single bus to write the stripes sequentially across multiple drives. In contrast, RAID memory uses five parallel connections so that data is written simultaneously across multiple DIMMs.

Also, RAID memory eliminates the write bottleneck associated with typical storage subsystem RAID implementations. In a storage array, the RAID controller generally performs a read operation of existing parity before a write operation can be completed. If a dedicated parity drive is being used, a bottleneck occurs. However, because RAID memory operates on an entire cache line of data, there is no need to read existing parity before a write operation, thus eliminating this performance bottleneck.