Release Notes :: CUDA Toolkit Documentation


This section summarizes the changes in CUDA 11.0 GA since the 11.0 RC release.

General CUDA

  • Added support for Ubuntu 20.04 LTS on x86_64 platforms.
  • Arm server platforms (arm64 sbsa) are supported with NVIDIA T4 GPUs.

NPP New Features

  • Batched Image Label Markers Compression that removes sparseness between marker label IDs output from LabelMarkers call.
  • Image Flood Fill functionality fills a connected region of an image with a specified new value.
  • Stability and performance fixes to Image Label Markers and Image Label Markers Compression.

nvJPEG New Features

  • nvJPEG allows the user to allocate separate memory pools for each chroma subsampling format. This helps avoid memory re-allocation overhead. This can be controlled by passing the newly added flag NVJPEG_FLAGS_ENABLE_MEMORY_POOLS to the nvjpegCreateEx API.
  • nvJPEG encoder now allow compressed bitstream on the GPU Memory.

cuBLAS New Features

  • cuBLASLt Matrix Multiplication adds support for fused ReLU and bias operations for all floating point types except double precision (FP64).
  • Improved batched TRSM performance for matrices larger than 256.

cuSOLVER New Features

  • Add 64-bit API of GESVD. The new routine cusolverDnGesvd_bufferSize() fills the missing parameters in 32-bit API cusolverDn[S|D|C|Z]gesvd_bufferSize() such that it can estimate the size of the workspace accurately.
  • Added the single process multi-GPU Cholesky factorization capabilities POTRF, POTRS and POTRI in cusolverMG library.

cuSOLVER Resolved Issues

  • Fixed an issue where SYEVD/SYGVD would fail and return error code 7 if the matrix is zero and the dimension is bigger than 25.

cuSPARSE New Features

  • Added new Generic APIs for Axpby (cusparseAxpby), Scatter (cusparseScatter), Gather (cusparseGather), Givens rotation (cusparseRot). __nv_bfloat16/ __nv_bfloat162 data types and 64-bit indices are also supported.
  • This release adds the following features for cusparseSpMM:

    • Support for row-major layout for cusparseSpMM for both CSR and COO format
    • Support for 64-bit indices
    • Support for __nv_bfloat16 and __nv_bfloat162 data types
    • Support for the following strided batch mode:

cuFFT New Features

  • cuFFT now accepts __nv_bfloat16 input and output data type for power-of-two sizes with single precision computations within the kernels.

Known Issues

  • cuFFT now accepts __nv_bfloat16 input and output data type for power-of-two sizes with single precision computations within the kernels.
  • Note that starting with CUDA 11.0, the minimum recommended GCC compiler is at least GCC 5 due to C++11 requirements in CUDA libraries e.g. cuFFT and CUB. On distributions such as RHEL 7 or CentOS 7 that may use an older GCC toolchain by default, it is recommended to use a newer GCC toolchain with CUDA 11.0. Newer GCC toolchains are available with the Red Hat Developer Toolset.
  • cublasGemmStridedBatchedEx() and cublasLtMatmul() may cause misaligned memory access errors in rare cases, when Atype or Ctype is CUDA_R_16F or CUDA_R_16BF and strideA, strideB or strideC are not multiple of 8 and internal heuristics determines to use certain Tensor Core enabled kernels. A suggested work around is to specify CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_<A,B,C,D>_BYTES accordingly to matrix stride used when calling cublasLtMatmulAlgoGetHeuristic().

Deprecations

The following functions have been removed:

  • cusparse<t>gemmi()
  • cusparseXaxpyi, cusparseXgthr, cusparseXgthrz, cusparseXroti, cusparseXsctr