-
Notifications
You must be signed in to change notification settings - Fork 247
Description
You can see that, depending on graphics card model, there are stride values to avoid - funny enough sometimes with strides of powers of 2 - fairly common numbers. If you ignore this, you can get into trouble and under utilize the memory channels available in hardware in addition to increasing bank conflicts (so the hit is 2 fold). The fix is to simply allocate a slightly pad the strides to avoid these conflicts (found via vendor/model documentation). This should be of use to any GPU implementation and likely some accelerators. It should not be performed when the device type is CPU or code that is doing vector operations directly on memory where the hardware has alignment restrictions.
As such, I tried to use such an array with pyopencl but it doesn't work out very well in manipulation/printing/inspection:
def make_cl_image_buffer(queue, img_dims, img_dtype, linestep = None):
img_dtype = dtype_of(img_dtype)
if linestep is None:
linestep = img_dims[1] * img_dtype.itemsize
if queue.device.type == cl.device_type.GPU:
#make sure to have an uneven line width for better memory channel utilization, as per AMD recommendation
if linestep % 2 == 0:
linestep += 1
buf = cl.Buffer(queue.context, cl.mem_flags.READ_WRITE, img_dims[0]*linestep)
return buf, linestep
def make_cl_image(queue, img_dims, img_dtype, linestep = None):
img_dtype = dtype_of(img_dtype)
img_buf, linestep = make_cl_image_buffer(queue, img_dims, img_dtype, linestep)
strides = (linestep, dtype_of(img_dtype).itemsize)
img = clarray.Array(queue, img_dims, img_dtype, strides=strides, data=img_buf)
return img
In [3]: img=com.make_cl_image(queue, (2044, 2044), np.uint32)
In [12]: img.shape
Out[12]: (2044, 2044)
In [13]: img.strides
Out[13]: (8177, 4)
/home/jason/tmp/pyopencl.git/pyopencl/array.py in get(self, queue, ary, async)
685 raise TypeError("'ary' has non-matching type")
686
--> 687 assert self.flags.forc, "Array in get() must be contiguous"
688
689 if self.size:
AssertionError: Array in get() must be contiguous
I searched around and found someone else curious about this last year - he also put some work into this:
#54
I wouldn't mind adding some lines to the library to accomplish good-no-surprises manipulation but I was unsure the best way to do it since it seems everything relies on the element wise kernels. I think the best/most compatible way of doing this is having the element wise kernels take in an additional stride parameter which they use to calculate the offset of the element they process.
What do you think? Is it possible for you to add this? If you don't have the time - can you outline what is the way to get it done?
ps. Tried bringing this up for discussion on the ML, got stuck in moderation.