C contiguous stride not working for the example #178

Jinghong-Zhang · 2025-03-11T16:55:35Z

Jinghong-Zhang
Mar 11, 2025

Hi cuquantum team, I am trying to play around with the example python/samples/cutensornet/tensornet_example.py, and I wonder if I can use the C-contiguous arrays for the optimization of contraction paths and also the actual contraction. To do this, I just replaced
strides_in = (0,0,0,0) with strides_in = (A_d.strides, B_d.strides, C_d.strides, D_d.strides) and also

desc_net = cutn.create_network_descriptor(handle,
    num_inputs, num_modes_in, extents_in, strides_in, modes_in, qualifiers_in,  # inputs
    nmode_R, extent_R, 0, modes_R,  # output
    data_type, compute_type)

with

desc_net = cutn.create_network_descriptor(handle,
    num_inputs, num_modes_in, extents_in, strides_in, modes_in, qualifiers_in,  # inputs
    nmode_R, extent_R, R_d.strides, modes_R,  # output
    # nmode_R, extent_R, 0, modes_R,  # output
    data_type, compute_type)

and didn't change any other thing. However, the output is:

cuTensorNet-vers: 20600
===== device info ======
GPU-local-id: 0
GPU-name: NVIDIA A100-SXM4-40GB MIG 3g.20gb
GPU-clock: 1410000
GPU-memoryClock: 1215000
GPU-nSM: 42
GPU-major: 8
GPU-minor: 0
========================
Include headers and define data types.
Define network, modes, and extents.
(262144,)
((4,), (4,), (4,), (4,))
Initialize the cuTensorNet library and create a network descriptor.
Traceback (most recent call last):
  File "pathtotest/cutensornet_test/stride_modification.py", line 125, in <module>
    cutn.contraction_optimize(handle, desc_net, optimizer_config, workspace_limit, optimizer_info)
  File "cuquantum/cutensornet/cutensornet.pyx", line 754, in cuquantum.cutensornet.cutensornet.contraction_optimize
  File "cuquantum/cutensornet/cutensornet.pyx", line 768, in cuquantum.cutensornet.cutensornet.contraction_optimize
  File "cuquantum/cutensornet/cutensornet.pyx", line 277, in cuquantum.cutensornet.cutensornet.check_status
cuquantum.cutensornet.cutensornet.cuTensorNetError: ALL_HYPER_SAMPLES_FAILED (24): CUTENSORNET_STATUS_ALL_HYPER_SAMPLES_FAILED

Could you tell me what happened here or how can I use C-contiguous arrays in this example instead of F-contiguous? Thanks in advance!

Answered by yangcal

Mar 11, 2025

Hello,

cuTensorNet does support C-contiguous arrays and there are a few misses in your current approach.

all input/output arrays are generated as 1D cupy.ndarrays in the script, therefore the A_d.strides in your change is only of size one, not corresponding to the fulll ndarray, at the top, you would need to do cp.random.random((np.prod(extent_A),), dtype=np.float32).reshape(extent_A) such that A_d.strides is of the correct size. This is needed for all input/output arrays for generic strides support.
cutensornet requires strides to not be scaled while cupy.ndarray.strides is scaled by the element items, therefore, strides_in needs to be modified with below, same for the strides specifica…

View full answer

yangcal · 2025-03-11T18:25:20Z

yangcal
Mar 11, 2025
Maintainer

Hello,

cuTensorNet does support C-contiguous arrays and there are a few misses in your current approach.

all input/output arrays are generated as 1D cupy.ndarrays in the script, therefore the A_d.strides in your change is only of size one, not corresponding to the fulll ndarray, at the top, you would need to do cp.random.random((np.prod(extent_A),), dtype=np.float32).reshape(extent_A) such that A_d.strides is of the correct size. This is needed for all input/output arrays for generic strides support.
cutensornet requires strides to not be scaled while cupy.ndarray.strides is scaled by the element items, therefore, strides_in needs to be modified with below, same for the strides specification for R_d

strides_in = [[s // o.itemsize for s in o.strides] for o in (A_d, B_d, C_d, D_d)]

One more question, is there any particular reason why you're not using our pythonic API cuquantum.contract/Network directly? It supports ndarray as input (any strides) and works just like np.einsum.

5 replies

Jinghong-Zhang Mar 11, 2025
Author

Hello,

Thank you so much for you reply! It solved my problem.

As for the question, I was trying to contract very large tensors (storing of the tensors costs several GBs) using cuquantum.contract. But it seems that the automatic slicing doesn't work so well because the intermediates created are often larger than it expected so the number of slices is often not enough for it to handle. It often reports OOM error so I had to manually slice the contraction myself using for-loops. However, I felt like there should be a better way to handle the slices so I looked at the example and tried to find the slicing options in the cutensornet API for a finer control.

I ended up figuring out that there was not an option to break an axis into multiple slices (i.e. if I have an axis with shape 125, it can only create 125 slices, instead of 5 slices with 25 elements each.) But sometimes I will only need 250 slices to fit in the memory but if there is not an axis that has shape 250, it will try to find the next largest axis, which is often much larger in my case, say 1160. Then the 1160 slices contraction will be much slower than my manual slicing (i.e. 125 slices along the original axis and 2 slices along the axis with 1160 elements). I wonder if there is some option that I didn't discover. I would appreciate if you can give me some instructions on this.

yangcal Mar 11, 2025
Maintainer

Hello,

We do support slicing not just by 1 per slice, but by another integer that can evenly divide the full dimension, this is documented here for contraction execution and one minimal example is provided below:

import cupy as cp
from cuquantum import contract

t = cp.random.random((100, 100))

out = contract('ij,jk->ik', t, t, optimize={'slicing': [('i', 50), ('j', 20)]}) # note that both 50 and 20 can evenly divide the shape 100
out1 = cp.einsum('ij,jk->ik', t, t)
print(abs(out-out1).sum())

Jinghong-Zhang Mar 11, 2025
Author

Hello,

Thanks for your answer, and it is very helpful.

I was originally wondering whether if you can get the number of slices on different axes directly through optimization of the contraction path instead of slicing it manually, because one doesn't know the size of the intermediates it produces a priori. Also I wonder if the autotune (selecting the correct cuTensor kernel) happens in the cuquantum.contract function or I should use cutensornet if I wanted autotuned optimization of the contraction.

yangcal Mar 11, 2025
Maintainer

The path finder component of our library currently do not support searching for a path with sliced extent other than 2, this is due to the fact that slicing, intermediate forming & optimization are all happening concurrently and there is no known good way to consider non-1 sliced extent without significantly increasing the search space. The execution part can support this, just like the snippet that I shared assuming you know the mode you wanna slice as a priori. I think you should also be able to provide a path with the optimize arg, but you just can't use our path finder to find a path with non-1 slicing.

RE: autotune, the function cuquantum.contract is meant for one-time use and therefore autotune is not performed when calling this function. autotune is generally recommended for cases where you need to execute the same contraction multiple times, potentially same shape but different data. This is exposed as Network.autotune API documented here and one example can be found here.

Jinghong-Zhang Mar 11, 2025
Author

Thanks! This was very helpful and I learned a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

C contiguous stride not working for the example #178

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

C contiguous stride not working for the example #178

Uh oh!

Jinghong-Zhang Mar 11, 2025

Replies: 1 comment · 5 replies

Uh oh!

yangcal Mar 11, 2025 Maintainer

Uh oh!

Uh oh!

Jinghong-Zhang Mar 11, 2025 Author

Uh oh!

Uh oh!

yangcal Mar 11, 2025 Maintainer

Uh oh!

Jinghong-Zhang Mar 11, 2025 Author

Uh oh!

yangcal Mar 11, 2025 Maintainer

Uh oh!

Jinghong-Zhang Mar 11, 2025 Author

Jinghong-Zhang
Mar 11, 2025

Replies: 1 comment 5 replies

yangcal
Mar 11, 2025
Maintainer

Jinghong-Zhang Mar 11, 2025
Author

yangcal Mar 11, 2025
Maintainer

Jinghong-Zhang Mar 11, 2025
Author

yangcal Mar 11, 2025
Maintainer

Jinghong-Zhang Mar 11, 2025
Author