Cycles per element in out-of-order execution
Problemyour text Consider the following code on a machine that run a multiplication in 5 cycles.
double aprod(double a[], int n)
{
int i;
double x, y, z;
double r = 1;
for (i=0; i < n-2; i += 3){
x = a[i]; y=a[i+1]; z=a[i+2];
r = r * (x * (y * z));
}
}
Compute the theoretical Cycles per Element (CPE), where the only limiting factor is the data dependencies.
Reference The problem is part of the Problem 5.9, p. 589, in Computer Systems: A Programmer's Perspective from Bryant et al.
My solution:
You have basically three multiplications per iteration. 1.
dummy1 := (y * z)
dummy2 := x*(dummy1)
r*(dummy2)
The dummy1 and dummy2 were just introduced to clearly state the multiplications. In the code, they are replaced by their definition, e. g. dummy1
is replaced by (y*z)
.
The 3. multiplication of an iteration has no data dependency to 1. and 2. multiplication of the following iteration. Thus, the 3. multiplication of an iteration and 1. as well as 2. multiplications of an iteration can be executed out of order (and instruction-level-parallelism is appliable).
However, 1. and 2. multiplication have an data dependency, thus need to be executed sequentially. Therefore, a machine can compute 3. multiplication of an iteration and 1. multiplication of the next iteration in parallel, and subsequently compute 2. multiplication of the next iteration. Overall, 2 multiplications are computed in parallel and subsequently another multiplication is computed. Therefore, theoretical CPE is 2*5 / 3 = 3.33
.
The solution of the book states on p. 604 that the theoretical CPE is 5 / 3 = 1.67
arguing that 1. and 2. multiplications can be computed in parallel to the 3. multiplication. The theoretical CPE implies that all three multiplications can be computed in parallel.
Does the book miss that there is a data dependency between 1. and 2. multiplication, thus are not parallelizable?
Comments
Post a Comment