Since the Ampere generation, Nvidia has supplanted its flagship Titan card with the 90 series offerings targeted at professionals who also game.
The Nvidia GeForce RTX 5090's GB202 GPU brings with it substantial hardware improvements over the RTX 4090's AD102 and the RTX 3090 Ti's GA102 GPUs.
While both the RTX 3090 Ti and the RTX 4090 offered the option to toggle the VRAM ECC state in the driver, this option is curiously missing with the RTX 5090.
What exactly is ECC memory?
ECC, which stands for error correction code, is a technique that enables memory to self-correct. Memory errors occur if there are bit flips during data transmission or when errors creep into the data as memory cells offload and replenish their charge.
The self-correction is accomplished either by a dedicated ninth memory chip that checks for parity among the other eight chips on the RAM module (known as on-die ECC) or at the level of the memory controller (DRAM ECC).
Consumer DDR5 system memory supports ECC but not in its entirety. By default, DDR5 RAM can detect multi-bit errors but can only correct single-bit errors through built-in data checking.
Because of the fundamental way in which DDR5 splits 64-bit memory into two 32-bit subchannels, DDR5-ECC RAM comes in 72-bit (32+4) EC4 or 80-bit (32+8) EC8 modules.
ECC memory is rarely needed for most consumer use cases. If you're unsure about this term, chances are you probably won't need ECC memory.
Nevertheless, ECC memory is paramount in mission-critical and machine learning applications where data integrity has to be maintained along the whole chain.
Google realized this the hard way back in 1999 when skimping on use of ECC memory drastically affected performance of its search engine due to memory corruption.
All GPUs featuring GDDR5 and GDDR6/6X VRAM have a way of detecting memory errors called Error Detection Code (EDC).
Nvidia GPUs refer to this function as Error Detection and Replay (EDR), which is a way of requesting retransmission of bits from the memory controller after performing a cyclic redundancy check (CRC).
EDR helps in minimizing pixel artifacts when the VRAM is overclocked, though it may slightly affect performance.
ECC VRAM on the RTX 4090 and RTX 5090
Although not widely discussed, a distinguishing feature found on the Nvidia GeForce RTX 3090 Ti and the RTX 4090 desktop GPUs is ability to toggle between ECC and non-ECC memory states via the driver.
However, this feature is absent in the new RTX 5090.
Performance impact of enabling ECC
The RTX 3090 Ti and RTX 4090 implement something called a "soft ECC". This approach does not involve a separate chip for maintaining parity; instead, enabling this feature allocates a portion of VRAM to function similarly to an on-die ECC module.
As a result, the total available VRAM and memory speed are decreased. In the case of the RTX 4090, the usable VRAM is reduced from 24 GB to 22.5 GB, with 1.5 GB being set aside for ECC functions.
Toggling the ECC state impacts performance as seen below. With ECC activated on the RTX 4090, 3DMark Speed Way scores decrease by 6.4%, while Cyberpunk 2077 2.21 Phantom Liberty experiences approximately a 5% reduction in average fps.
The degree of performance impact will vary depending on the workload.
RTX 5090's GDDR7 VRAM is officially spec'd for on-die ECC
With GDDR7, JEDEC has incorporated on-die ECC as part of the VRAM specification taking into account the increasing likelihood of errors due to higher memory densities. GDDR7 uses on-die ECC with a transparency protocol that informs the memory controller about the type of errors encountered.
According to JEDEC, GDDR7 is capable of 100% correction of 1-bit errors and 100% detection of 2-bit errors, although the detection rate drops slightly to 99.3% for rare 3-bit errors.
Additionally, the official spec also includes command address parity with command blocking (CAPARBLK) to further improve the reliability of the command address bus.
However, it is not clear whether Blackwell's memory controller uses this on-die ECC capability by default.
The RTX 5090's 512-bit GDDR7 memory is rated for a 1.792 TB/s bandwidth at a fast 28 Gbps clock, which can potentially cause transmission errors. Besides, Nvidia is pitching the RTX 5090 for AI workflows, which can benefit from ECC while training large datasets.
Despite this, Nvidia's architecture whitepaper only mentions support for "Enhanced Cyclic Redundancy Check (CRC) for Reliability, Availability, and Serviceability (RAS)", which is not the same as ECC.
While it would be safe to expect that Nvidia would enable GDDR7's on-die ECC functionality for the rumored Blackwell workstation GPUs, it remains to be seen if the ECC state toggle would come to the consumer RTX 5090 via a future driver or VBIOS update.
Source(s)
Own