During the final presentation of the Radeon HD 6900, AMD announced the color directly the GeForce GTX 580 is out of reach. This will disappoint to some because many thought they saw ahead of AMD with Nvidia GPU performance. AMD also believed but probably also had its sights lower. The cause may be a new architecture which has proved more complex than expected to get the best output but also excellent GeForce GTX 500 which has thwarted the original plans.
With Cayman, AMD took a risk by deciding to review one aspect of the architecture of its computation units that had not changed since the Radeon HD 2900 XT. In a simplified manner, we characterize the computational units of AMD vec5, which means they are capable of executing up to five instructions in parallel. However, with such an architecture, if the code to run does not parallelize as many statements, they will not be fully exploited, in contrast to the scalar architecture from Nvidia that can maintain high efficiency over a maximum of situations. Both approaches equally valid as the other.
Please do not confuse arithmetic unit with Core, a marketing concept used by Nvidia to be compared to CPUs and AMD followed by the opportunity to have 5 cores per unit calculation vec5. Overall you can see things from two angles: one unit vec5 AMD is more efficient than scalar unit from Nvidia or AMD's core is less efficient than Nvidia core. With the GeForce GTX 460 GF104 and its derivatives, Nvidia got closer to a vector operation to increase efficiency and AMD intends to do the same with the Cayman, but in the other direction, from vec5 to vec4 . The calculation units of Cayman are thus less powerful than previous AMD GPUs, but they are statistically more efficient but not more powerful, the distinction is important. Cons by being simple, these units occupy less space and consume less, thereby increasing the number, all other things being equal.
In more detail, the previous Radeon GPUs were based on computing units of type 4 + 1, with a run line can handle complex instructions. This one that AMD has decided to get rid of. In return for these complex instructions to be processed on other lines through a succession of simpler operations. These instructions monopolize and 3 of the 4 lines of execution, making them much more intensive since only a single instruction can be executed in parallel against 4 before. Without this one a bit special and in some cases difficult to feed properly, the compiler will see his task greatly simplified, which in some cases may even make these units more efficient than vec4 previous vec5 but overall AMD now needs more computing units vec4 to maintain the same level of performance.
While Cypress GPU Radeon HD 5800, had 20 blocks of 16 computer units vec5, Cayman has 24 blocks of 16 computer units vec4. We are dealing with 320 units against 384 units vec5 vec4, which is less flattering when incorporated into cores because it gives us 1536 cores only cons for Cayman 1600 for Cypress. An important detail, however, is found at the texturing units whose number is fixed at 4 per block. Cayman to see his power at this level increased from 20% to equal frequency. Note that AMD has reported increasing flow calculation in double precision, but it is a twisted way of interpreting the fact that the "one" does not support double precision. A unit of Cayman is identical to a unit of Cypress at this level.
AMD did not stop there and introduced other minor improvements to its architecture. The first concerns the treatment of geometry who is parallelized in order to break the limit of a triangle per cycle. Nvidia retains an advantage with small triangles and above, with more units of simple geometric processing, avoiding a bottleneck at the GPU when extensive data are generated by the tessellation. To combat this problem, AMD has expanded the buffer dedicated to Barts, the GPU Radeon HD 6800, and Cayman and goes further with it which is capable of transferring all this data temporarily in video memory to avoid blocking while the GPU . This function is not directly exposed and we do not know if it engages automatically when certain charges or if AMD is to be used manually on a case by case basis.
AMD has also improved ROPs to increase throughput formats 16-bit integers and 32 bit float. They are also becoming more efficient with the antialiasing, as well as during the memory write mode compute. In this regard, AMD was inspired by what Nvidia has and allows the simultaneous execution of several different kernels, whereas before the GPU would assign successive periods of execution. It's the same for communication with the CPU that can be done in both directions simultaneously with 2 DMA engines, as GF100/110. The memory controllers have also been revised to more easily withstand the fast GDDR5.
Bookmarks