How a loop is mapped into the GPU, in blocks, warps and threads?
I would need clarification on how a loop is mapped on the device, using OpenACC. I'm also not sure about the role of blocks, warps and threads.
If I have a loop like this:
#pragma acc parallel loop
for(i=0; i<1024; i++){
vector[i] += 1;
}
And my GPU supports "maximum threads per block = 1024". How the loop is parallelized into blocks? My first thought is that a single block is sufficient to handle the operations because the vector has 1024 elements. In this case, I think that the block is composed of 1024 threads, each one corresponding to the operation vector[i] += 1;
with a different index i
.
Is my understanding of what a thread is correct?
I would have so 32 warps of 32 threads. How are they executed? Can all of them run simultaneously?
from Recent Questions - Stack Overflow https://ift.tt/3rAfE79
https://ift.tt/eA8V8J
Comments
Post a Comment