Bug with !$acc routine seq?

Hi,

I’m getting quite confused with a code that I’m trying to parallelize with OpenACC. For simplicity, I got this toy example:

module test_m
  implicit none
contains
  
  function dou(a)
    !$acc routine seq                                                                                                                                                                                                
    double precision :: dou,a

    dou=a*2.0
  end function dou
end module test_m


program test
  use test_m
  implicit none
  
  double precision, dimension(1000,1000) :: a,b
  integer :: i,j
  
  a=0.0 ; b=0.0
   
  !$acc data copy(a,b)                                                                                                                                                                                               
  !$acc parallel loop collapse(2) default(none) private(i,j)                                                                                                                                                         
  do i=1,1000
     do j=1,1000
        a(i,j) = (i*1000.0)*j
        b(i,j) = (i*1000.0)*j

        a(i,j) = 2.0*a(i,j)
        b(i,j) = dou(b(i,j))
     end do
  end do
  !$acc end parallel loop
  !$acc end data                                                                                                                                                                                                     
  
  print*, sum(a), sum(b)
end program test

First, when I compile it, I get the message that the loop could not be parallelized because it contains a call, which looks strange, since I already included the !$acc routine seq in the only function call.

[test]$ pgf90 --version
pgf90 18.10-1 64-bit target on x86-64 Linux -tp haswell

[test]$ pgf90 -acc -Minfo  -O3 test.f90 -o test
dou:
      5, Generating acc routine seq
         Generating Tesla code
test:
     21, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
     23, Generating copy(b(:,:),a(:,:))
     24, Accelerator kernel generated
         Generating Tesla code
         25, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         26,   ! blockidx%x threadidx%x collapsed
     26, Loop not vectorized/parallelized: contains call
     36, sum reduction inlined
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
[test]$

More worrying, when I run it as shown, then the output is as for the serial version:

[test]$ ./test
    501000499999440.0         501000499999440.0

But if I just reverse the order of the assignments in the main code, then I get incorrect results:

[test]$ diff -C 2 test2.f90 test.f90
*** test2.f90   Mon Apr 29 10:05:51 2019
--- test.f90    Mon Apr 29 10:05:42 2019
***************
*** 28,33 ****
          b(i,j) = (i*1000.0)*j

-         b(i,j) = dou(b(i,j))
          a(i,j) = 2.0*a(i,j)
       end do
    end do
--- 28,33 ----
          b(i,j) = (i*1000.0)*j

          a(i,j) = 2.0*a(i,j)
+         b(i,j) = dou(b(i,j))
       end do
    end do
[test]$ pgf90 -acc -O3 test2.f90 -o test2
[test]$ ./test2
    501000499999440.0         0.000000000000000

This looks like a textbook example, so I don’t understand how I can get this strange behaviour…

Ángel de Vicente

Hi Angel,

26, Loop not vectorized/parallelized: contains call

This is coming from the host size where the compiler can vectorize or auto-parallelize the loop due to the call. It’s extraneous to the device code generation. To only see the OpenACC compiler feedback messages, use the option “-Minfo=accel”.

then I get incorrect results:

This does look like a compiler issue where it’s incorrectly optimizing away some of the offsets into the fixed size array. Hence I added a problem report (TPR #27100) and sent it to our compiler engineers for further investigation.

The error seems to occur only with our older non-LLVM based compilers hence you can work around the problem by using LLVM. We made LLVM the default in 19.1 but in 18.10 and earlier, add the the flag “-Mllvm” to your compile, or set your PATH to “$PGI/linux86-64-llvm/18.10/bin” to get LLVM.

Also, the error seems to only occur when using fixed size arrays. Hence another workaround is to make “a” and “b” allocatable arrays.

Thanks for the report!
Mat

Hello,

ok, I will use for the time being -Minfo=accel then.

Also, I can confirm that with 18.7-0 without -Mllvm the incorrect output was still there, but with -Mllvm, all seems fine.

Many thanks,
AdV