Applying batched_mul from `NNlib`

/`NNlibCUDA`

to a CuArray tensor and a CuSparseMatrix matrix falls back to the generic method and performs scalar indexing. Is there a GPU-efficient way to broadcast multiplication of a CuSparseMatrix over a CuArray tensor?

p.s. There is currently an issue where the `similar`

in `batched_mul`

will cause the output to be a CPU array, but even fixing that does not seem to solve the scalar indexing problem.

MWE:

```
using CUDA, CUDA.CUSPARSE, NNlib, NNlibCUDA
CUDA.allowscalar(false)
a = CuArray(ones(2,2,3))
b = CuSparseMatrix(CuArray(ones(2,2)))
c = batched_mul(a,b) #This does scalar indexing
```