4000% Performance Decrease in SYCL when using Unified Shared Memory instead of Device Memory

In SYCL, there are three types of memory: host memory, device memory, and Unified Shared Memory (USM). For host and device memory, data exchange requires explicit copying. Meanwhile, data movement from and to USM is automatically managed by the SYCL runtime implicitly.

Unfortunately, during the process of implementing GPU acceleration for a numerical kernel using SYCL, I found an up-to 4000% decrease of performance just by switching from sycl::malloc_device() to sycl::malloc_shared() - even if all I do is repeatedly resubmitting the same SYCL kernel, without any attempt to access data from the host.

When building the code with sycl::malloc_device() with OpenSYCL targeting AMD HIP GFX906 (Radeon VII / Instinct MI50), the program finishes in 0.27 seconds:

$ time ./fdtd.elf 
simulate 16974593 cells for 10 timesteps.

real    0m0.271s
user    0m0.253s
sys     0m0.020s

When building the same code with sycl::malloc_shared(), the program takes 10.6 seconds to complete:

simulate 16974593 cells for 10 timesteps.

real    0m10.649s
user    0m15.172s
sys     0m0.196s

This is a 3925% slowdown.

After enabling "Above 4G Decoding" and "Re-sizable BAR" support in BIOS, now it takes 3.8 seconds instead of 10.6 seconds. But this doesn't fix the actual problem of needless memory tranfers - a 1300% performance hit is still pretty significant.

I also tested a similar kernel using the Intel DPC++ compiler previously, and saw similar results on the same hardware.

I suspect that the slowdown is caused by needless host and device copying, but I'm not sure. What heuristics does a SYCL runtime use to determine whether copying is needed?

The sample code is attached below.

ArrayNXYZ.hpp: 4-dimensional array (n, x, y, z) wrapper class.

#include <sycl/sycl.hpp>

template <typename T>
struct ArrayXYZN
{
    ArrayXYZN() {}

    inline T& operator() (const unsigned int n, const unsigned int x, const unsigned int y, const unsigned int z) const
    {
        size_t offset = n * n_stride + x * x_stride + y * y_stride + z;
        return array[offset];
    }

    unsigned long n_stride, x_stride, y_stride, size;
    T *array;
};

template <typename T>
ArrayXYZN<T>* CreateArrayXYZN(sycl::queue Q, const unsigned int* numLines)
{
    unsigned int n_max = 3;
    unsigned int x_max = numLines[0];
    unsigned int y_max = numLines[1];
    unsigned int z_max = numLines[2];

    unsigned long n_stride = x_max * y_max * z_max;
    unsigned long x_stride = y_max * z_max;
    unsigned long y_stride = z_max;

    if (n_stride % 128 != 0)
    {
        n_stride += 128 - (n_stride % 128);
    }

    // allocate 1D linear buffer
    size_t size = n_stride * n_max;

#ifdef USM
    T *buf = sycl::malloc_shared<T>(size, Q);
#else
    T *buf = sycl::malloc_device<T>(size, Q);
#endif

    // zero memory
    Q.submit([&](sycl::handler& h) {
        h.memset(buf, 0, size * sizeof(T));
    });
    Q.wait();

    // allocate wrapper class
    ArrayXYZN<T>* array = new ArrayXYZN<T>();
    array->n_stride = n_stride;
    array->x_stride = x_stride;
    array->y_stride = y_stride;
    array->size = size * sizeof(T);
    array->array = buf;

    return array;
}

fdtd.cpp:

#include <sycl/sycl.hpp>
#include "ArrayNXYZ.hpp"

/*
 * UpdateVoltages
 *
 * Using Finite Difference Time Domain (FDTD) method,
 * calculate new electric field array "volt" based on
 * magnetic field "curr" and two electromagnetic field
 * operators "vv" and "vi", precalculated from the
 * physical materials before starting up simulation.
 */
void UpdateVoltages(
        const ArrayXYZN<float>& volt,
        const ArrayXYZN<float>& curr,
        const ArrayXYZN<float>& vv,
        const ArrayXYZN<float>& vi,
        int x, int y, int z
)
{
    // note: each (x, y, z) cell has three polarizations
    // x, y, z, these are different from the cell's
    // coordinates (x, y, z)

    //for x polarization
    float volt0 = volt(0, x, y, z);
    volt0 *= vv(0, x, y, z);
    volt0 +=
        vi(0, x, y, z) * (
        curr(2, x, y  , z  ) -
        curr(2, x, y-1, z  ) -
        curr(1, x, y  , z  ) +
        curr(1, x, y  , z-1)
        );

    //for y polarization
    float volt1 = volt(1, x, y, z);
    volt1 *= vv(1, x, y, z);
    volt1 +=
        vi(1, x, y, z) * (
        curr(0, x  , y, z  ) -
        curr(0, x  , y, z-1) -
        curr(2, x  , y, z  ) +
        curr(2, x-1, y, z  )
        );

    //for z polarization
    float volt2 = volt(2, x, y, z);
    volt2 *= vv(2, x, y, z);
    volt2 +=
        vi(2, x, y, z) * (
        curr(1, x  , y  , z) -
        curr(1, x-1, y  , z) -
        curr(0, x  , y  , z) +
        curr(0, x  , y-1, z)
        );

    volt(0, x, y, z) = volt0;
    volt(1, x, y, z) = volt1;
    volt(2, x, y, z) = volt2;
}

int main(void)
{
    const unsigned int numLines[3] = {257, 257, 257};
    const int timesteps = 10;

    sycl::queue Q;

    ArrayXYZN<float>& volt = *CreateArrayXYZN<float>(Q, numLines);
    ArrayXYZN<float>& curr = *CreateArrayXYZN<float>(Q, numLines);
    ArrayXYZN<float>& vv = *CreateArrayXYZN<float>(Q, numLines);
    ArrayXYZN<float>& vi = *CreateArrayXYZN<float>(Q, numLines);

    size_t size = numLines[0] * numLines[1] * numLines[2];
    fprintf(stderr, "simulate %ld cells for %d timesteps.\n", size, timesteps);

    for (int i = 0; i < timesteps; i++) {
        Q.submit([&](sycl::handler &h) {
            h.parallel_for<class Voltage>(
                sycl::range(numLines[0] - 1, numLines[1] - 1, numLines[2] - 1),
                [=](sycl::item<3> itm) {
                    /*
                     * The first cell on each dimension has data dependency
                     * outside the simulation box (boundary condition).
                     * Ignore them for now.
                     */
                    int x = itm.get_id(0) + 1;
                    int y = itm.get_id(1) + 1;
                    int z = itm.get_id(2) + 1;

                    UpdateVoltages(volt, curr, vv, vi, x, y, z);
                }
            );
        });
        Q.wait();
    }
}

Search This Blog

Theprogrammersfirst | A technical portal.

4000% Performance Decrease in SYCL when using Unified Shared Memory instead of Device Memory

Comments

Post a Comment

Popular posts from this blog

Spring Elasticsearch Operations

Hibernate Search - Elasticsearch with JSON manipulation

Today Walkin 14th-Sept