float8 swizzle operator failed on nvidia OpenCL

To utilize SIMD in my OpenCL code on CPU targets, I cast a float [5] array pointer to a float8*, and use the swizzle operators to group adjacent instructions into vector operations. See my changes

https://github.com/fangq/mcxcl/commit/fc2d97830a380c73a427eab84a64aca3f3531dcc

For example, the previous code segment (lines#100-104)

#define FUN(x)  (4.f*(x)*(1.f-(x)))

    t[0]=FUN(t[0]);
    t[1]=FUN(t[1]);
    t[2]=FUN(t[2]);
    t[3]=FUN(t[3]);
    t[4]=FUN(t[4]);

was replaced by (lines#100-101)

t->s0123=((float4)4.0*t->s0123*((float4)1.0-t->s0123));
    t->s4=FUN(t->s4);

this change compiles and runs fine on Intel ocl and AMD ocl, but failed to run on nvidia ocl (all t values became wildly large numbers). For Intel ocl, I gained about 15% speed due to SSE operations.

I know it probably should not impact the speed on nvidia’s ocl because there is no vector register. but still, I expect the above swizzle syntax valid on nvidia and should produce the correct values.

can someone tell me if you see anything wrong with my above change?

I’d be concerned about alignment issues. For the nvidia case where the t values become wildly large numbers, are there any errors reported if you run the code with cuda-memcheck?

@txbob: This question is in the context of OpenCL, does cuda-memcheck work with OpenCL programs?

If this were CUDA code, there would be an issue with alignment when simply casting a ‘float’ pointer (requiring 32-bit alignment) to a ‘float8’ pointer, as txbob points out. I am not sure that CUDA actually provides a built-in type ‘float8’. It definitely has ‘float4’ which requires 128-bit alignment.

I don’t know what OpenCL specifies with regard to pointer alignment for built-in compound types. @FangQ, could you point at the relevant section of the OpenCL specification?

[Later:] Here is what I found in the OpenCL specification regarding alignment:

This suggests that in OpenCL, just like in CUDA, one cannot simply cast a ‘float’ pointer to a ‘float8’ pointer, then dereference the ‘float8’ pointer and expect this to work by design. It may “happen to work” by chance, if the original ‘float’ pointer happens to satisfy the alignment requirements of a ‘float8’ pointer.

Does OpenCL support unions? If so, a safe way to do handle the situtation would be to create a union of a ‘float8’ and an array of ‘float’, this should cause the union to be suitably aligned for ‘float8’ data.

I’m not particularly knowledgeable in OpenCL, but I ran a simple test case with a very simple kernel that was structured correctly. cuda-memcheck appeared to run correctly and reported no errors. I then made an invalid 8 byte write in the same code and reran it, and cuda-memcheck correctly reported the error:

$ cuda-memcheck ./t1_bug
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
=========     at 0x00000048 in test_rotate
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x05361f40 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libnvidia-opencl.so.1 [0x14c8fc]
=========     Host Frame:./t1_bug [0x2b29]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21d65]
=========     Host Frame:./t1_bug [0x28c9]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000048 in test_rotate
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x05361f40 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libnvidia-opencl.so.1 [0x14c8fc]
=========     Host Frame:./t1_bug [0x2b29]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21d65]
=========     Host Frame:./t1_bug [0x28c9]
=========
========= Error: process didn't terminate successfully
=========        The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host side errors.
========= Internal error (20)

so it might be worth a try. I cannot say categorically that cuda-memcheck supports OpenCL.

thanks for the comment, I tried cuda-memcheck, it did not report any memory error, although the code produced incorrect RNG states. steps to reproduce:

git clone https://github.com/fangq/mcxcl.git
cd mcxcl
git checkout fe4cc1fe011a0885dcf3b221ef471b35a8ee3ce9 .  # use the swizzle operators
cd src
make
cd ../example/quicktest/
./listgpu.sh           # list all GPUs
./run_qtest.sh         # this runs the benchmark using the first GPU (-G 1)

the above benchmark fails with incorrect absorption fraction (19.9% instead of 17.7% as expected) and the kernel returns almost immediately due to the failed RNG (because nvidia’s opencl does not support printf, I only manage to identify the RNG issue after passing my RNG states back to the host and print).

running cuda-memcheck failed to reveal any memory issues

fangq@wazu$ cuda-memcheck ../../bin/mcxcl -t 16384 -T 64 -g 10 -n 10 -f qtest.inp -s qtest -r 1 -a 0 -b 0 -k ../../src/mcx_core.cl -d 0 -G 1

========= CUDA-MEMCHECK
workdev=1
===============================================================================
=                     Monte Carlo eXtreme (MCX) -- OpenCL                     =
=           Copyright (c) 2009-2016 Qianqian Fang <q.fang at neu.edu>         =
=                                                                             =
=                    Computational Imaging Laboratory (CIL)                   =
=             Department of Bioengineering, Northeastern University           =
===============================================================================
$MCXCL$Rev::    $ Last Commit $Date::                     $ by $Author:: fangq$
===============================================================================
- code name: [Vanilla MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] Logistic-Lattice [Seed Length] 5
initializing streams ...	init complete : 0 ms
build program complete : 2 ms
- [device 0] threadph=0 oddphotons=10 np=10.0 nthread=16384 repetition=1
set kernel arguments complete : 2 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... 	kernel complete:  	12 ms
retrieving flux ... 	transfer complete:	13 ms
normalizing raw data ...	normalization factor alpha=20000000.000000
saving data to file ... 216000 1	saving data complete : 14 ms

simulated 10 photons (10) with 1 CUs with 16384 threads (repeat x1)
MCX simulation speed: 1.00 photon/ms
total simulated energy: 10.00	absorbed: 19.89206%
(loss due to initial specular reflection is excluded in the total)
========= ERROR SUMMARY: 0 errors

changing GPUs (-G 01 or -G 001) did not change the results much.

I think one question I have was how the byte-alignment works with built-in type arrays. Because OpenCL specs also says

what I did was to cast a float t[5] to float8 *t8 and only read/write t8->s01234.

could this cause byte alignment mismatch?

I replaced my float t[5] to either float8 t[0] or a union defined in the below form

typedef union MCX_Float8{
    RandType f[5];
    float8   v;
} Float8  __attribute__ ((aligned (32)));
...
Float8 t[0];

the code now produces the correct results, as expected. but I meant to avoid doing this because I thought I could save 3 floats in the register space, especially the kernel is already register hungry.

I guess I just don’t fully understand why casting a float t[5] to float8 *t fails to work.

on a side note, is cuda-memcheck supposed to capture misaligned pointers?

Yes, for CUDA codes, cuda-memcheck should identify an attempt to use (dereference) a misaligned pointer.

An array of T has the alignment requirements of T (the type of each element). So an array float t[5] is aligned according to the requirements of float, i.e. 32-bit aligned.

Predefined, built-in compound types (under the hood: structs with additional alignment attributes) either have the alignment indicated by their size, e.g. float4 is 128-bit aligned, or the next higher power of two if their size is not a power of two, e.g. float3 is also 128-bit aligned.

so, float t[5] is 32-bit aligned, and float8 is 128 bit aligned. maybe that explains.

does that mean I can not cast “float t[4]” to float4 t[0] as well?

also, I made the following change to declare my float t[5] array to be 128bit aligned, but it still gives incorrect results.

diff --git a/src/mcx_core.cl b/src/mcx_core.cl
index 7fa8e23..bf4aaf6 100644
--- a/src/mcx_core.cl
+++ b/src/mcx_core.cl
@@ -308,7 +308,8 @@ __kernel void mcx_main_loop(const int nphoton, const int ophoton,__global const
      int flipdir=0;
 
      //for MT RNG, these will be zero-length arrays and be optimized out
-     RandType t[RAND_BUF_LEN],tnew[RAND_BUF_LEN];
+     RandType t[RAND_BUF_LEN] __attribute__ ((aligned (128)));
+     RandType tnew[RAND_BUF_LEN] __attribute__ ((aligned (128)));
      float4 prop;    //can become float2 if no reflection
 
      float len,cphi,sphi,theta,stheta,ctheta,tmp0,tmp1;

Yes, the same alignment issues exists for ‘float t[4]’ vs ‘float4 t’. I don’t think alignment attributes can meaningfully be applied to arrays, which would explain why your approach failed. But I have never used OpenCL, so by all means check the specification to confirm or refute.