background image

GeForce GTX 980 Whitepaper

 

GM204 HARDWARE ARCHITECTURE 

IN-DEPTH 

 

 

 

In GeForce GTX 980, each GPC ships with a dedicated raster engine and four SMMs. Each SMM has 128 
CUDA cores, a PolyMorph Engine, and eight texture units. With 16 SMMs, the GeForce GTX 980 ships 
with a total of 2048 CUDA cores and 128 texture units. 

The GeForce GTX 980 features four 64-bit memory controllers (256-bit total). Tied to each memory 
controller are 16 ROP units and 512KB of L2 cache. The full chip ships with a total of 64 ROPs and 
2048KB of L2 cache (this compared to 32 ROPs and 512K L2 on GK104). 

The following table provides a high-level comparison of Maxwell vs. our previous-generation GK104 
GPU: 

GPU 

GeForce GTX 680 (Kepler) 

GeForce GTX 980 (Maxwell) 

SMs 

16 

CUDA Cores 

1536 

2048 

Base Clock 

1006 MHz 

1126 MHz 

GPU Boost Clock 

1058 MHz 

 1216 MHz 

GFLOPs 

3090 

4612

1

 

Texture Units 

128 

128 

Texel fill-rate 

128.8 Gigatexels/sec 

144.1 Gigatexels/sec 

Memory Clock  

6000 MHz 

7000 MHz 

Memory Bandwidth 

192 GB/sec 

224 GB/sec 

ROPs 

32 

64 

L2 Cache Size 

512KB 

2048KB 

TDP 

195 Watts 

165 Watts 

Transistors 

3.54 billion 

5.2 billion 

Die Size 

294 

mm²

 

398 

mm²

 

Manufacturing Process 

28-nm 

28-nm 

 
The GeForce GTX 980 has double the SMs compared to the GK104 GPU used in the GeForce GTX 680 
released two years ago. Because of the changes implemented in GTX 980’s new Maxwell SM, we were 
able to integrate 2x more SMs without doubling the die size. With each SM also containing its own 
dedicated PolyMorph Engine, GeForce GTX 980 also has twice the number of geometry units as its direct 
predecessor. We’ll be discussing more details on the new SM design in the next section of the 
whitepaper. 

Based on efficiency and workload analysis, and math vs. texture processing requirements of modern 
games, NVIDIA engineers determined that eight texture units per SMM is the best architectural balance 
for Maxwell; therefore, the total number of texture units is the same as Kepler, 128. However, thanks to 
GeForce GTX 980’s higher clocks, texture fill rate improves by 12% from one generation to the next. To 
improve performance in high AA/high resolution gaming scenarios, we doubled the number of ROPs 

                                                           

 

1

 The GFLOPS and texel fill rates in this chart are based on GPU Base Clock 

Summary of Contents for GeForce GTX 980

Page 1: ...Whitepaper NVIDIA GeForce GTX 980 Featuring Maxwell The Most Advanced GPU Ever Made V1 1 ...

Page 2: ...re In Depth 6 Maxwell Streaming Multiprocessor 8 PolyMorph Engine 3 0 9 GM204 Memory Subsystem 10 New Display and Video Engines 11 Maxwell Enabling The Next Frontier in PC Graphics 13 Hardware Acceleration for VXGI Multi Projection and Conservative Raster 21 Tiled Resources 23 Raster Ordered View 24 DirectX 12 25 Advancing the State Of The Art in Image Quality 27 Dynamic Super Resolution 29 Conclu...

Page 3: ...re ideal for use in power limited environments like notebooks and small form factor PCs in addition to mainstream desktops NVIDIA s latest GPU GM204 is the first to use the full realization of our 10th generation GPU architecture Maxwell Our design goals for GM204 were to deliver Extraordinary Gaming Performance for the Latest Displays Incredible Energy Efficiency Dramatic Leap Forward In Lighting...

Page 4: ...p PC gaming market has grown explosively in the past few years The Maxwell architecture was designed to provide an extraordinary leap in power efficiency and deliver unrivaled performance while simultaneously reducing power consumption from the previous generation With a combination of advances originally developed for Tegra K1 new architectural approaches seen first in the GeForce GTX 750 Ti and ...

Page 5: ... rendering stage to accurately determine the effect of light bouncing around in the scene Cyril s original implementation relied on voxels that were stored in an octree structure While it was able to run successfully on a GeForce GTX 680 it had limitations We ve spent the last three years developing an implementation that can be accelerated natively by the GPU as well as improving the algorithm Th...

Page 6: ...6 Maxwell SMs SMM and four memory controllers GeForce GTX 980 uses the full complement of these architectural components if you are not well versed in these structures we suggest you first read the Kepler and Fermi whitepapers Another version of the chip with 13 SMs will ship concurrently and be called GeForce GTX 970 In the future we plan to offer additional products based on GM204 that will ship...

Page 7: ...ache Size 512KB 2048KB TDP 195 Watts 165 Watts Transistors 3 54 billion 5 2 billion Die Size 294 mm 398 mm Manufacturing Process 28 nm 28 nm The GeForce GTX 980 has double the SMs compared to the GK104 GPU used in the GeForce GTX 680 released two years ago Because of the changes implemented in GTX 980 s new Maxwell SM we were able to integrate 2x more SMs without doubling the die size With each SM...

Page 8: ...igned to provide dramatically improved performance per watt than prior GeForce GPUs Compared to GPUs based on our Kepler architecture Maxwell s new SMM design has been reconfigured to improve efficiency Each SMM contains four warp schedulers and each warp scheduler is capable of dispatching two instructions per warp every clock Compared to Kepler s scheduling logic we ve integrated a number of imp...

Page 9: ...r SM but 1 4x performance per core each Maxwell SMM can deliver total per SM performance similar to Kepler s SMX and the area savings from this more efficient architecture enabled us to then double up the total SM count compared to GK104 PolyMorph Engine 3 0 Tessellation was one of DirectX 11 s key features and will play a bigger role in the future as the next generation of games are designed to u...

Page 10: ...ression is realized a second time when clients such as the Texture Unit later read the data As illustrated in the preceding figure our compression engine has multiple layers of compression algorithms Any block going out to memory will first be examined to see if 4x2 pixel regions within the block are constant in which case the data will be compressed 8 1 i e from 256B to 32B of data for 32b color ...

Page 11: ...mes Maxwell uses roughly 25 fewer bytes per frame compared to Kepler This means that from the perspective of the GPU core a Kepler style memory system running at 9 3Gbps would provide effective bandwidth similar to the bandwidth that Maxwell s enhanced memory system provides New Display and Video Engines As the rapid adoption rate of 4K displays shows consumer demand for high resolution devices ha...

Page 12: ...he distracting screen tearing that currently plagues gaming when Vsync is disabled G SYNC also eliminates display subsystem generated stutter and reduces input lag that gamers put up with today Utilizing DisplayPort the GeForce GTX 980 can drive up to three G SYNC displays in Surround GM2xx Maxwell also ships with an enhanced NVENC encoder that adds support for H 265 also known has HEVC encoding H...

Page 13: ...real world all objects are lit by a combination of direct light photons that travel directly from a light source to illuminate an object and indirect light photons that travel from the light source hit one object and bounce off of it and then hit a second object thus indirectly illuminating that object Global illumination GI is a term for lighting systems that model this effect Without indirect li...

Page 14: ...expensive lighting technique particularly in highly detailed scenes GI has been primarily used to render complex CG scenes in movies using offline GPU rendering farms While some forms of GI have been used in many of today s most popular games their implementations have relied on pre computed lighting These prebaked techniques are used for performance reasons however they require additional artwork...

Page 15: ...topic and a video from GTC 2012 is available here Epic s Elemental Unreal Engine 4 tech demo from 2012 used a similar technique Figure 6 Epic s UE4 Elemental tech demo used voxel cone tracing for its jaw dropping GI Since that time NVIDIA has been working on the next generation of this technology VXGI that combines new software algorithms and special hardware acceleration in the Maxwell architectu...

Page 16: ...rection and intensity The first step as illustrated in the following figure is the coverage calculation step In this step each triangle needs to be checked from the perspective of each face of the cube to assess what fraction of the voxel is covered The picture on the left shows a traditional rasterized image of a simple scene The picture on the right is a visualization of the voxelized result In ...

Page 17: ...evaluate direct lighting at each non empty voxel and render the scene multiple times from the point of view of different light sources capturing the amount of light that hits each voxel In the figure below the direct light source indicated by the yellow dot causes light to strike the white walls and some of the surfaces of the red and green boxes Each will then emit reflected light based on the co...

Page 18: ...the main difference is that the final rasterization and lighting now has a new and more powerful data structure the voxel data structure that it can use in its lighting calculations along with other structures such as shadow maps The approach of calculating indirect lighting during the final rendering pass of VXGI is called cone tracing Cone tracing is an approximation of the effect of secondary r...

Page 19: ...ditionally need to launch hundreds or thousands of scattered secondary rays for each ray that bounces from the original reflector It s incredibly challenging to reflect these lights realistically especially when you also factor in the material properties of the various light reflectors Using our approach we ve replaced the thousands of secondary rays with just a handful of voxel cones that are tra...

Page 20: ...te diffuse or specular lighting with only a few scattered cones Ultimately as a result we re able to compute approximate GI at high frame rates in real time allowing us to realistically render glossy and metallic surfaces Figure 10 In the example above voxel cones are used to produce various forms of diffuse and specular light ...

Page 21: ...rendering the same scene from multiple views multi projection It turns out that multi projection is a property of other important rendering algorithms as well For example cube maps used commonly for assisting with modelling of reflections require rendering to six faces And as will be discussed in more depth later shadow maps can also be rendered at multiple resolutions Therefore acceleration of mu...

Page 22: ... original 3D triangle data properly Conservative raster helps the hardware to perform this calculation efficiently without conservative raster there are workarounds that can be used to achieve the same result but they are much more expensive The benefit of these features can be measured by running the voxelization stage of VXGI both ways i e with the new features enabled vs disabled Figure 12 belo...

Page 23: ...d redundant storage of voxel data saving significant amounts of memory You can read more about Tiled Resources at this link One interesting application of Tiled Resources is multi resolution shadow maps In the following Figure 13 the image on the left shows the result of determining shadow information from a fixed resolution shadow map In the foreground the shadow map resolution is not adequate an...

Page 24: ...pecial interlock hardware in the ROP is responsible for enforcing this ordering requirement DX11 introduced the capability for the pixel shader to bind Unordered Access Views of color and Z buffers and read and write arbitrary locations within those buffers However as the name implies there is no processing order guarantee when multiple pixel shaders are accessing the same UAV The next generation ...

Page 25: ...ming DirectX 12 API has been designed to have CPU efficiency significantly greater than earlier DirectX versions One of the keys to accomplishing this is providing more explicit control over hardware giving game developers more control of GPU and CPU functions While the NVIDIA driver very efficiently manages resource allocation and synchronization under DX11 under DX12 it is the game developer s r...

Page 26: ...nservative Raster discussed earlier in the GI section of this paper is one such DX graphics feature Another is Raster Ordered Views ROVs which gives developers control over the ordering pixel shader operations GM2xx supports both Conservative Raster and ROVs The new graphics features included in DX12 will be accessible from either DX11 or DX12 so developers will be free to use these new features w...

Page 27: ...terization providing opportunities for more flexible and novel AA techniques to be implemented in the context of both deferred and conventional forward rendering With programmable sample positions the ROMs that were used to store the standard sample positions are replaced with RAMs The RAMs may be programmed with the standard patterns but the driver or application may also load the RAMs with custo...

Page 28: ...tterns or interleaved across multiple frames in time Multi Frame Sampled AA MFAA is a new AA technique that alternates AA sample patterns both temporally and spatially to produce the best image quality while still offering a performance advantage compared to traditional MSAA The final result can deliver image quality approaching that of 8xAA at roughly the cost of 4xAA or 4xAA quality at roughly t...

Page 29: ...ement in image quality artifacts are sometimes observed on textures and when certain post processing effects are applied To address the usability and quality issues NVIDIA has developed a method called Dynamic Super Resolution In principal Dynamic Super Resolution works like traditional downsampling but it has a simple on off user control and it uses a 13 tap Gaussian filter during the conversion ...

Page 30: ...ng process to be at a given resolution set by the game itself Figure 15 A screenshot from Dark Souls 2 Standard 1080p on the left DSR on the right Dynamic Super Resolution can be found in the control panel of our Release 343 driver as well as GeForce Experience where we provide Optimal Playable Settings OPS for Dynamic Super Resolution for today s hottest games While it s compatible with all GeFor...

Page 31: ...e on the PC The GeForce GTX 980 supports new features for sampling control that will enable new AA techniques like MFAA allowing lower level AA sample patterns to be perceived as higher quality AA but with the faster performance of lower AA levels And the GeForce GTX 980 supports Dynamic Super Resolution technology an NVIDIA developed version of downsampling that brings 4K visuals to existing 1080...

Page 32: ...r for any infringement of patents or other rights of third parties that may result from its use No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation Specifications mentioned in this publication are subject to change without notice This publication supersedes and replaces all information previously supplied NVIDIA Corporation products are not aut...

Reviews: