[NDTensors] [BUG] Many basic tensor operations fail with CUDA extension due to `LoadError: Scalar indexing is disallowed.`

**Description of bug**

I am attempting to use the CUDA extension for NDTensors. I am finding that tensor contraction sometimes fails, along with other functions like `permute`, as well as the `NDTensors.generic_randn` call used in the example of the NDTensors CUDA extension (`NDTensors/ext/examples/NDTensorCUDA.jl`).

They all fail with `ERROR: LoadError: Scalar indexing is disallowed.` (or emit a similar warning in the REPL where scalar indexing is allowed).

**Minimal code demonstrating the bug or unexpected behavior**

<details><summary>Minimal runnable code</summary>

```julia
import CUDA
import ITensors
import NDTensors

i, j, k = ITensors.Index(3), ITensors.Index(5), ITensors.Index(2)
A, B = ITensors.randomITensor(i, j), ITensors.randomITensor(j, k)

# works when running script directly, but warns of scalar indexing in REPL
cuA, cuB = NDTensors.cu(A), NDTensors.cu(B)

contracted = cuA * cuB

@show NDTensors.cpu(contracted) ≈ A * B

# without @CUDA.allowscalar, fails with `ERROR: LoadError: Scalar indexing is disallowed.`
permuted = ITensors.permute(cuA, (j, i))
```
</details>


**Expected output or behavior**

`NDTensors.cu()` should work without warning in REPL on `ITensors.ITensor`, and `ITensors.permute` should work without the scalar indexing error.

**Actual output or behavior**

<details><summary>Output of minimal runnable code</summary>

```julia
NDTensors.cpu(contracted) ≈ A * B = true
ERROR: LoadError: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
 [1] error(s::String)
 @ Base ./error.jl:35
 [2] assertscalar(op::String)
 @ GPUArraysCore ~/.julia/packages/GPUArraysCore/uOYfN/src/GPUArraysCore.jl:103
 [3] getindex
 @ ~/.julia/packages/GPUArrays/5XhED/src/host/indexing.jl:9 [inlined]
 [4] getindex
 @ ~/.julia/packages/StridedViews/MjjHH/src/stridedview.jl:97 [inlined]
 [5] macro expansion
 @ ~/.julia/packages/Strided/O8zfl/src/mapreduce.jl:330 [inlined]
 [6] macro expansion
 @ ./simdloop.jl:77 [inlined]
 [7] macro expansion
 @ ~/.julia/packages/Strided/O8zfl/src/mapreduce.jl:329 [inlined]
 [8] _mapreduce_kernel!(f::Strided.CaptureArgs{NDTensors.var"#69#70", Tuple{Strided.Arg, Strided.Arg}}, op::Nothing, initop::Nothing, dims::Tuple{Int64, Int64}, blocks::Tuple{Int64, Int64}, arrays::Tuple{StridedViews.StridedView{Float64, 2, CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, typeof(identity)}, StridedViews.StridedView{Float64, 2, CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, typeof(identity)}, StridedViews.StridedView{Float64, 2, CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, typeof(identity)}}, strides::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}, Tuple{Int64, Int64}}, offsets::Tuple{Int64, Int64, Int64})
 @ Strided ~/.julia/packages/Strided/O8zfl/src/mapreduce.jl:229
 [9] _mapreduce_block!(f::Any, op::Any, initop::Any, dims::Tuple{Int64, Int64}, strides::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}, Tuple{Int64, Int64}}, offsets::Tuple{Int64, Int64, Int64}, costs::Tuple{Int64, Int64}, arrays::Tuple{StridedViews.StridedView{Float64, 2, CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, typeof(identity)}, StridedViews.StridedView{Float64, 2, CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, typeof(identity)}, StridedViews.StridedView{Float64, 2, CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, typeof(identity)}})
 @ Strided ~/.julia/packages/Strided/O8zfl/src/mapreduce.jl:152
 [10] _mapreduce_order!(f::Any, op::Any, initop::Any, dims::Tuple{Int64, Int64}, strides::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}, Tuple{Int64, Int64}}, arrays::Tuple{StridedViews.StridedView{Float64, 2, CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, typeof(identity)}, StridedViews.StridedView{Float64, 2, CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, typeof(identity)}, StridedViews.StridedView{Float64, 2, CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, typeof(identity)}})
 @ Strided ~/.julia/packages/Strided/O8zfl/src/mapreduce.jl:138
 [11] _mapreduce_fuse!
 @ ~/.julia/packages/Strided/O8zfl/src/mapreduce.jl:116 [inlined]
 [12] copyto!
 @ ~/.julia/packages/Strided/O8zfl/src/broadcast.jl:35 [inlined]
 [13] materialize!
 @ ./broadcast.jl:884 [inlined]
 [14] materialize!
 @ ./broadcast.jl:881 [inlined]
 [15] permutedims!(R::NDTensors.DenseTensor{Float64, 2, Tuple{ITensors.Index{Int64}, ITensors.Index{Int64}}, NDTensors.Dense{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}}, T::NDTensors.DenseTensor{Float64, 2, Tuple{ITensors.Index{Int64}, ITensors.Index{Int64}}, NDTensors.Dense{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}}, perm::Tuple{Int64, Int64}, f::NDTensors.var"#69#70")
 @ NDTensors ~/.julia/packages/NDTensors/olKso/src/dense/densetensor.jl:246
 [16] permutedims!!(R::NDTensors.DenseTensor{Float64, 2, Tuple{ITensors.Index{Int64}, ITensors.Index{Int64}}, NDTensors.Dense{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}}, T::NDTensors.DenseTensor{Float64, 2, Tuple{ITensors.Index{Int64}, ITensors.Index{Int64}}, NDTensors.Dense{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}}, perm::Tuple{Int64, Int64}, f::Function)
 @ NDTensors ~/.julia/packages/NDTensors/olKso/src/dense/densetensor.jl:188
 [17] permutedims!!
 @ ~/.julia/packages/NDTensors/olKso/src/dense/densetensor.jl:186 [inlined]
 [18] permutedims
 @ ~/.julia/packages/NDTensors/olKso/src/tensoralgebra/generic_tensor_operations.jl:5 [inlined]
 [19] permutedims
 @ ~/.julia/packages/ITensors/Qe99p/src/tensor_operations/permutations.jl:64 [inlined]
 [20] _permute(as::NDTensors.NeverAlias, T::NDTensors.DenseTensor{Float64, 2, Tuple{ITensors.Index{Int64}, ITensors.Index{Int64}}, NDTensors.Dense{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}}, new_inds::Tuple{ITensors.Index{Int64}, ITensors.Index{Int64}})
 @ ITensors ~/.julia/packages/ITensors/Qe99p/src/tensor_operations/permutations.jl:69
 [21] permute(as::NDTensors.NeverAlias, T::ITensors.ITensor, new_inds::Tuple{ITensors.Index{Int64}, ITensors.Index{Int64}})
 @ ITensors ~/.julia/packages/ITensors/Qe99p/src/tensor_operations/permutations.jl:73
 [22] #permute#271
 @ ~/.julia/packages/ITensors/Qe99p/src/tensor_operations/permutations.jl:54 [inlined]
 [23] permute(T::ITensors.ITensor, new_inds::Tuple{ITensors.Index{Int64}, ITensors.Index{Int64}})
 @ ITensors ~/.julia/packages/ITensors/Qe99p/src/tensor_operations/permutations.jl:38
 [24] top-level scope
 @ ~/it_test2/test.jl:17
in expression starting at /home/william/it_test2/test.jl:17
```
</details>

<details><summary>Output of `cuA, cuB = NDTensors.cu(A), NDTensors.cu(B)` in REPL</summary>

```julia
┌ Warning: Performing scalar indexing on task Task (runnable) @0x00007f5e6b5fc970.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore ~/.julia/packages/GPUArraysCore/uOYfN/src/GPUArraysCore.jl:
```
</details>

<details><summary>Also, running `NDTensors/ext/examples/NDTensorCUDA.jl` results in a similar error</summary>

```julia

ERROR: LoadError: Scalar indexing is disallowed.
Invocation of setindex! resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
 [1] error(s::String)
 @ Base ./error.jl:35
 [2] assertscalar(op::String)
 @ GPUArraysCore ~/.julia/packages/GPUArraysCore/uOYfN/src/GPUArraysCore.jl:103
 [3] setindex!
 @ ~/.julia/packages/GPUArrays/5XhED/src/host/indexing.jl:17 [inlined]
 [4] generic_randn(arraytype::Type{CuArray{T, 1} where T}, dim::Int64)
 @ NDTensors ~/.julia/packages/NDTensors/olKso/src/abstractarray/fill.jl:8
 [5] main()
 @ Main ~/ITensors.jl/NDTensors/ext/examples/NDTensorCUDA.jl:21
 [6] top-level scope
 @ ~/ITensors.jl/NDTensors/ext/examples/NDTensorCUDA.jl:116
in expression starting at /home/william/ITensors.jl/NDTensors/ext/examples/NDTensorCUDA.jl:116

```

</details>

**Version information**

 - Output from `versioninfo()`:
```julia

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
 OS: Linux (x86_64-linux-gnu)
 CPU: 12 × Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
 WORD_SIZE: 64
 LIBM: libopenlibm
 LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
 Threads: 1 on 12 virtual cores
```

<details><summary>Other relevant version info</summary>

```julia

(it_test2) pkg> status ITensors
Status `~/it_test2/Project.toml`
 [9136182c] ITensors v0.3.42

(it_test2) pkg> status NDTensors
Status `~/it_test2/Project.toml`
 [23ae76d9] NDTensors v0.2.10

(it_test2) pkg> status CUDA
Status `~/it_test2/Project.toml`
 [052768ef] CUDA v4.4.1

julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.2
NVIDIA driver 535.104.5

CUDA libraries: 
- CUBLAS: 12.1.3
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.5
- CUSPARSE: 12.1.0
- CUPTI: 18.0.0
- NVML: 12.0.0+535.104.5

Julia packages: 
- CUDA: 4.4.1
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0

Toolchain:
- Julia: 1.9.2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
 0: NVIDIA GeForce RTX 3090 (sm_86, 22.728 GiB / 24.000 GiB available)
```

</details>

I am very new to Julia (and ITensors.jl), sorry if there is something obvious I'm missing!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NDTensors] [BUG] Many basic tensor operations fail with CUDA extension due to `LoadError: Scalar indexing is disallowed.` #1193

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NDTensors] [BUG] Many basic tensor operations fail with CUDA extension due to LoadError: Scalar indexing is disallowed. #1193

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[NDTensors] [BUG] Many basic tensor operations fail with CUDA extension due to `LoadError: Scalar indexing is disallowed.` #1193