How a loop is mapped into the GPU, in blocks, warps and threads?

I would need clarification on how a loop is mapped on the device, using OpenACC. I'm also not sure about the role of blocks, warps and threads.

If I have a loop like this:

#pragma acc parallel loop
for(i=0; i<1024; i++){
  vector[i] += 1;
}

And my GPU supports "maximum threads per block = 1024". How the loop is parallelized into blocks? My first thought is that a single block is sufficient to handle the operations because the vector has 1024 elements. In this case, I think that the block is composed of 1024 threads, each one corresponding to the operation vector[i] += 1; with a different index i.

Is my understanding of what a thread is correct?

I would have so 32 warps of 32 threads. How are they executed? Can all of them run simultaneously?



from Recent Questions - Stack Overflow https://ift.tt/3rAfE79
https://ift.tt/eA8V8J

Comments

Popular posts from this blog

Today Walkin 14th-Sept

Spring Elasticsearch Operations

Hibernate Search - Elasticsearch with JSON manipulation