Runtime error with nvfortran 20.7

hyzhou · September 15, 2020, 4:27pm

Hi,

My group is working on porting a CFD code to GPU with OpenACC. As a first step, we wanted to compile the code on CPU but issues came up.

On a Linux x86 system, it shows segmentation fault with nvfortran -g -Ktrap=fp -r8 -O0. In some tests, it shows the line number where it encounters an array index out-of-bound error, where the displayed array range makes no sense to me:

0: Subscript out of range for array mhdflux_v (FaceFlux.f90: 2983)
    subscript=2, lower bound=1640695287988, upper bound=1640695287990, dimension=1

In some other tests, it does not show the line number, just messages like

[ny01:09070] *** Process received signal ***
[ny01:09070] Signal: Segmentation fault (11)
[ny01:09070] Signal code:  (128)
[ny01:09070] Failing at address: (nil)

The code that is causing this issue is related to the usage of derived types in Fortran. We have a large derived type declared in one module and used in another module, with a mixture of scalars and vectors that looks like

type, public :: Param
     integer :: iLeft,  jLeft, kLeft
     integer :: iRight, jRight, kRight
     integer :: iBlockFace
     integer :: iDimFace
     integer :: iFluidMin = 1, iFluidMax = nFluid
     integer :: iVarMin   = 1, iVarMax   = nVar
     integer :: iEnergyMin = nVar+1, iEnergyMax = nVar + nFluid

     integer :: iFace, jFace, kFace   

     real :: CmaxDt
     real :: Area2, AreaX, AreaY, AreaZ, Area = 0.0
     real :: DeltaBnL, DeltaBnR
     real :: DiffBb ! (1/4)(BnL-BnR)^2
     real :: StateLeft_V(nVar)
     real :: StateRight_V(nVar)
     real :: FluxLeft_V(nVar+nFluid), FluxRight_V(nVar+nFluid)

     real :: Normal_D(3), NormalX, NormalY, NormalZ
     real :: Tangent1_D(3), Tangent2_D(3)
     real :: B0n, B0t1, B0t2
     real :: UnL, Ut1L, Ut2L, B1nL, B1t1L, B1t2L
     real :: UnR, Ut1R, Ut2R, B1nR, B1t1R, B1t2R

     real :: MhdFlux_V(     RhoUx_:RhoUz_)
     real :: MhdFluxLeft_V( RhoUx_:RhoUz_)
     real :: MhdFluxRight_V(RhoUx_:RhoUz_)

     real :: Enormal
     real :: Unormal_I(nFluid+1) = 0.0
     real :: UnLeft_I(nFluid+1)
     real :: UnRight_I(nFluid+1)
     real :: EtaJx, EtaJy, EtaJz, Eta     
     real :: InvDxyz, HallCoeff
     real :: HallJx, HallJy, HallJz
     logical :: UseHallGradPe = .false.
     real :: BiermannCoeff, GradXPeNe, GradYPeNe, GradZPeNe
     real :: DiffCoef, EradFlux=0.0, RadDiffCoef
     real :: HeatFlux, IonHeatFlux, HeatCondCoefNormal     
     real :: bCrossArea_D(3) = 0.0
     real :: B0x=0.0, B0y=0.0, B0z=0.0
     real :: ViscoCoeff
     logical :: IsBoundary
     real :: InvClightFace, InvClight2Face
     logical :: DoTestCell = .false.
     logical :: IsNewBlockVisco = .true.
     logical :: IsNewBlockGradPe = .true.
     logical :: IsNewBlockCurrent = .true.
     logical :: IsNewBlockHeatCond = .true.
     logical :: IsNewBlockIonHeatCond = .true.
     logical :: IsNewBlockRadDiffusion = .true.
     logical :: IsNewBlockAlfven = .true.
  end type Param

An object of this derived type is passed between several subroutines to set the parameters and intermediate values.
One of the arrays with declared range 2:4 in this derived type caused the issue. I have tried several different approaches to resolve this issue:

turn off OpenMP
use local array (copy) instead of pointer to the vectors
direct call with p%MhdFlux_V, etc., without using the associate block
change vector range from 2:4 to 1:3
move the vectors into a separate type declaration

However, none of these works. An older version of this module without using derived types can be compiled and run without issue, which indicates that there’s something going on with the usage of derived type.

With -O2 or above, the code does not generate runtime error, but the result is wrong. We have confirmed that the same code has no issue with gfortran, nagfor and ifort. We have also run valgrind with gcc, and it showed no memory issue.

MatColgrove · September 15, 2020, 6:46pm

Would you be able to provide a reproducing example that we can use to investigate?

If not, can you post the section of code where the out-of-bound error occurs?
If you run the code through a debugger, does the seg fault occur in the same spot?

In your type, the MhdFlux_V array which is the same one that the out-of-bounds error occur, is declared as:

 real :: MhdFlux_V(     RhoUx_:RhoUz_)
 real :: MhdFluxLeft_V( RhoUx_:RhoUz_)
 real :: MhdFluxRight_V(RhoUx_:RhoUz_)

Though, I don’t see where “RhoUx_” or “RhoUz_” are declared. Where do these variables get declared and what are their values?

hyzhou · September 15, 2020, 6:57pm

The indexes, RhoUx_,RhoUz_ are constant parameters declared in another module. I tried to reproduce the issue with some simple program, but no success yet. Do you mind if I share the entire code with a makefile and instructions to run?

Thanks.

MatColgrove · September 15, 2020, 7:00pm

No, the full source is fine. If we can better understand why it’s erroring, it may be easier to write a reproducer, assuming that it’s a compiler issue.

hyzhou · September 15, 2020, 7:05pm

Since our code is not fully open-source yet, is there a way I can share files that’s not publicly available online?

hyzhou · October 8, 2020, 4:08pm

It turned out that it is due to the incorrect recognition of associate syntax in a contained subroutine which accesses the derived type components for nvfortran 20.7.

MatColgrove · March 24, 2022, 6:09pm

Hi hyzhou,

We were able to reduce the original issue down to the following small reproducer. There appeared to be a problem with internal procedures that are called within an associate when the internal procedure has an identical associate. If the associate expression (o%x in this case) was changed to something else (say, o%y), the code worked fine. However the original version should be correct and engineering has fixed the problemed in our 22.3 release.

For example:

% cat test.F90
program p
type t
integer :: x(10)
integer :: y(10)
end type
type(t) :: o
associate (z=>o%x) ! this fails if child also has associate (z=>o%x)
! associate (z=>o%y) ! this works if child has associate (z=>o%x)
call child
end associate
contains
subroutine child()
associate (z=>o%x) ! this fails if parent has associate (z=>o%x)
! associate (z=>o%y) ! this works if child has associate (z=>o%x)
print *, lbound(z),ubound(z)
end associate
end subroutine
end

Fails in 22.2:
% nvfortran test.F90 -V22.2 -fast; a.out
Segmentation fault

Works correctly in 22.3:
% nvfortran test.F90 -V22.3 -fast ; a.out
            1           10

-Mat

Topic		Replies	Views
Bounds error with OpenMP using array reduction in nvhpc 21.11 nvc, nvc++ and nvfortran	3	604	January 21, 2022
Fortran OpenACC array reduction nvc, nvc++ and nvfortran	7	796	September 13, 2022
Fortran-C interop segfault when using CFI_allocate nvc, nvc++ and nvfortran	5	203	August 7, 2024
NVFortran: closures does not work nvc, nvc++ and nvfortran nvbugs	2	174	May 9, 2024
NVFORTRAN SEGMANTATION FAULT (CORE DUMPED) in OPENACC DATA REGION nvc, nvc++ and nvfortran cuda	5	1234	August 3, 2021
Bug of nvfortran 24.3-0: "fort1 TERMINATED by signal 11" nvc, nvc++ and nvfortran nvbugs	8	524	September 24, 2024
Segmentation fault when using abstract interface with assumed shape nvc, nvc++ and nvfortran	2	640	February 12, 2023
Nvfortran bug(s) with allocatable character objects nvc, nvc++ and nvfortran nvbugs	2	581	April 13, 2023
Nvfortran "Ambiguous interfaces for generic" inconsistency nvc, nvc++ and nvfortran nvbugs	3	230	October 4, 2024
Nvfortran with gpu flags nvc, nvc++ and nvfortran	3	52	March 10, 2025

Runtime error with nvfortran 20.7

Related topics