Skip to content

optimization: support for 2-D/3-D arrays with strides >= for better memory bandwidth utilization #70

@nevion

Description

@nevion

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472173

You can see that, depending on graphics card model, there are stride values to avoid - funny enough sometimes with strides of powers of 2 - fairly common numbers. If you ignore this, you can get into trouble and under utilize the memory channels available in hardware in addition to increasing bank conflicts (so the hit is 2 fold). The fix is to simply allocate a slightly pad the strides to avoid these conflicts (found via vendor/model documentation). This should be of use to any GPU implementation and likely some accelerators. It should not be performed when the device type is CPU or code that is doing vector operations directly on memory where the hardware has alignment restrictions.

As such, I tried to use such an array with pyopencl but it doesn't work out very well in manipulation/printing/inspection:

    def make_cl_image_buffer(queue, img_dims, img_dtype, linestep = None):
        img_dtype = dtype_of(img_dtype)
        if linestep is None:
            linestep = img_dims[1] * img_dtype.itemsize
            if queue.device.type == cl.device_type.GPU:
                #make sure to have an uneven line width for better memory channel utilization, as per AMD recommendation
                if linestep % 2 == 0:
                    linestep += 1
        buf = cl.Buffer(queue.context, cl.mem_flags.READ_WRITE, img_dims[0]*linestep)
        return buf, linestep

    def make_cl_image(queue, img_dims, img_dtype, linestep = None):
        img_dtype = dtype_of(img_dtype)
        img_buf, linestep = make_cl_image_buffer(queue, img_dims, img_dtype, linestep)
        strides = (linestep, dtype_of(img_dtype).itemsize)
        img = clarray.Array(queue, img_dims, img_dtype, strides=strides, data=img_buf)
        return img


In [3]: img=com.make_cl_image(queue, (2044, 2044), np.uint32)

In [12]: img.shape
Out[12]: (2044, 2044)

In [13]: img.strides
Out[13]: (8177, 4)
    /home/jason/tmp/pyopencl.git/pyopencl/array.py in get(self, queue, ary, async)
        685                 raise TypeError("'ary' has non-matching type")
        686
    --> 687         assert self.flags.forc, "Array in get() must be contiguous"
        688
        689         if self.size:

    AssertionError: Array in get() must be contiguous

I searched around and found someone else curious about this last year - he also put some work into this:
#54

I wouldn't mind adding some lines to the library to accomplish good-no-surprises manipulation but I was unsure the best way to do it since it seems everything relies on the element wise kernels. I think the best/most compatible way of doing this is having the element wise kernels take in an additional stride parameter which they use to calculate the offset of the element they process.

What do you think? Is it possible for you to add this? If you don't have the time - can you outline what is the way to get it done?

ps. Tried bringing this up for discussion on the ML, got stuck in moderation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions