What's new?#

Release 1.5.1 (2026-02-03)#

Compiler: NVCC emulation#

-Wregister no longer defaults to an error in nvcc dialect mode.
-Xfatbin and -Xptxas are now parsed to extract meaningful flags from them.
nvcc -v's path outputs now reflect the actual CUDA install search process.
Fixed an edgecase of the target_overload attribute that led to incompatibility with some versions of the standard library.
Don't crash when a variable is illegally declared __host__ __device__.
nv_weak is now an alias of weak.
Added support for -ptx flag.
Fixed some edgecases of argument translation for list-form arguments.
Dependency information is no longer duplicated for CUDA mode (for -M and friends).
Even more missing-typename relaxations to support the nvcc dialect.
Unqualified lookups of dependent members in base classes is now alloewd in nvcc mode.
Fix some builtin macro pedantry so programs that check with #if <foo> == 1 work, as well as simply #ifdef <foo>, work.
Implement some simple IR-level optimisations for __byte_perm().
--extended-lambda and --relaxed-constexpr now drive the same macro as they do in NVIDIA's implementation, even though they still do not change the behaviour of the semantic analyzer.

Compiler: Inline PTX support#

Add minifloat types (fp8/6/4) to the PTX type system, pending implementation of the corresponding tensor core ops.
Fix subtle miscompiles surrounding edgecases for .relu PTX operations. Seems the semantics of the .relu operator differ between different opcodes...
Fix miscompile when both .satf and .relu are provided on the same instruction.
Implement __nv_is_extended_device_lambda_with_preserved_return_type. Although it is largely meaningless given our proper lambda suppport, user code relies on it.
Treat .mmio as an alias of .volatile, at least until we support some devices with actual IO devices stuck to the GPU.
Fix a compiler crash when passing a pointer to a struct directly into a PTX asm block.
Fix a compiler crash when compiling certain SIMD PTX instructions, such as fma.sat.f32x2.
Fix a type inference failure for instructions include vector expressions with constants in them, such as asm("mov.b32 %0, {0, %1};" : "=f"(f) : "h"(h));.
Fix a compiler crash when a mandatory round-mode operand was omitted from a float-to-int cvt instruction.
Fix failure to compile certain cases of cvt, such as f64<->f16.
Fix failure to compile vector-splat cases of cvt for small float types.

Compiler: Other#

Fix a miscompilation in which rounding-mode state would sometimes be leaked, leaving the rounding mode register set to a non-default value after executing certain combinations special-rounding-mode operations.
clangd now crashes less often.

Library#

Newly-added CUDA APIs:
- cudaMalloc3D
- cuCtxSynchronize_v2
- cuCtxGetDevice_v2
- cuCtxWaitEvent
- cuFuncGetName
- cuFuncGetParamInfo
- cuStreamWriteValue32
Add the NVML APIs for query process information.
Small improvements to exception handling performance.
Fix an implicit synchronisation bug for memsets. Memsets targeting pinned host memory are supposed to implicitly synchronise the affected stream.
Fix a crash when using the extra argument of various launch APIs to pass a user-provided kernel argument buffer containing objects with "diamond of doom" inheritance.
half2char now properly converts to signed char, not unsigned char.
Fixed handling of -0 and NaN in various Math API routines.
Fixed a few cases where denorms were flushed on the "wrong side" of the operation. NVIDIA don't document which operations flush denorms before doig the operation (eg. min()), and which ones flush afterwards.
Fixed an endianness issue affecting bf16x2 assembly APIs such as __low2bfloat16.
Entirely disabled SDMA for gfx9xx devices except gfx90a, due to apparent hardware bugs. This should resolve issues where cudaMemcpys would mysteriously not synchronise with the host even when asked to. Seems SDMA barriers sometimes just don't work...?
Fixed various header defects (missing typedef enum, implicit STL includes, etc.)

Other#

scaleenv no longer leaks PS1 to the env.
The scripts in scale-validation are now much simpler, hopefully making it more obvious that they are just copies of the upstream build scripts!

Release 1.5.0 (2025-11-15)#

Platform#

Compiler is now based on llvm 20.1.8
Using rocm 7.1 versios of rocBLAS etc.

Supported architectures#

All architectures are now enabled in the free version of SCALE. See new EULA for details.
Newly-supported AMD GPU architectures:
- gfx950 (MI350x/MI355x)
- gfx1151 (Strix-halo etc.)
- gfx1200 (RX 9060 XT etc.)
- gfx1201 (RX 9070 XT etc.)
- gfx908 (MI100)
- gfx906 (MI50/MI60)

Compiler: Inline PTX support#

Improve PTX diagnostics for unused instruction flags
Support for q constraints (128-bit asm inputs)
Diagnostics for implicit truncation via asm constraints.
New PTX instructions:
- movmatrix
- bar/barrier: only cases that can be represented as __synthreads_*.

Compiler: NVCC emulation#

New NVCC flags:
- -keep
- -keep-dir
- -link
- -preprocess (alias of -E)
- -libdevice-directory/-ldir: Meaningless
- -target-directory/target-dir: Meaningless
- -cudadevrt: no-op because our devrt implementation uses no smem, so no point.
- -opt-info: alias of -Rpass, so it can do more than just -inline
Fix resolution of relative paths for nvcc's option-files flag.
Fix compiler crash when fence.cluster appears in inline PTX in dead code.
Accept even more cases of invalid identifiers in unused code in nvcc mode.
Improvements and fixes to the __shfl* optimiser. More DPP, less UB.
Add a compiler warning to complain about mixing __constant__ and constexpr.

Compiler (misc.)#

Introduce builtins for conversion between arbitrary types (fp8 here we come)
Further improvements to deferred diagnostics, especialy surrounding typo'd identifiers
Add [[clang::getter]]
Fix kernel argument buffer alignment sometimes being wrong
Microsoft extensions are no longer enabled in CUDA mode.
Arithmetic involving threadIdx and friends now compiles.
Various small optimiser enhancements.

Runtime#

Fix a rare race condition resolving cudaEvents
Slightly improve performance of every API by optimising error handling routines.
Introduce nvtx3 support as a header-only library, like it should be.
New API: cudaEventRecordWithFlags
Respect CUDA_GRAPHS_USE_NODE_PRIORITY environment variable.
Implement UUID-query APIs, and make them consistent between the driver API and nvml
Support the new __nv_* atomic functions
Support (and ignore) the ancient CU_CTX_MAP_HOST/cudaDeviceMapHost flags. This feature is always enabled.
Startup-time improvements.
Fix crash when empty CUDA graphs are executed.
Fix occasionally picking the wrong cache coherence domain for SDMA engines and breaking everything.
fp16x2/bf16x2 are now trivially-copyable in C++11+: a very tiny extension.
Fix intermittent crash in cuMemUnmap
Add the new CUDA13 aligned vector types
Workaround SDMA bug that gave incorrect results for cuMemsetD16 on some devices.
Many random header compatibility/typedef fixes.
Accuracy improvements to tanh() and sin().
Fix crash when millions of cudaStreams are destroyed all at once.
Crash correctly when the GPU executes a trap instruction.
Fix the GPU trap handler on GFX94x devices.
Fix a few C89-compatibility issues in less-commonly-used headers.
Make headers warning-free on GCC, since not everyone uses -isystem properly.
Slightly improve performance of nvRTC.
Raise an error if the user attempts to execute PTX as AMDGPU machine code, instead of actually trying it.
Fix a few cases where the runtime library and RTC compiler would disagree about what architecture to build for.

Release 1.4.2 (2025-09-19)#

Platform/Packaging#

Simplified packaging/installation: removed dependency on rocm package repos. Removal of more rocm components from the package is in development.
Using rocm 6.4.1 versions of rocBLAS etc.

Compiler Diagnostics#

Warn about PTX instruction flags that don't actually do anything.
Warn about PTX variable declarations leaking into the enclosing function, since this may cause ptxas failures when building for NVIDIA.
Warn about passing a generic address space C++ pointer into an asm operand for a non-generic PTX memory instruction if the corresponding addrspacecast is not a no-op.
Detect when the user seems to have gotten cvta.to and cvta mixed up.
Cleanly reject .param address space in PTX: using that in inline asm is undefined behaviour.
Diagnose accidental address operands (eg. use of [] for cvta is a common screwup).
Proper errors for integer width mismatch in PTX address space conversions. Implicitly truncating pointers is bad.
Disallow overloads that differ only in return type and sideness in clang-dialect mode.
Implement a fancier-than-nvcc edition of -Wdefault-stream-launch, warning about implicit use of the legacy default stream via any launch mechanism, with opt-in support for warning about any use of the default stream (ptds or not).
PTX diagnositcs now correctly point at the offending operand, where applicable.
Correctly report source locations for diagnostics in PTX blocks that use C constraints.

Compiler Optimisation#

Use DPP to optimise applicable __shfl*/__reduce* calls, and loops that perform common reduction/cumsum idioms.
Improved instruction selection for shared-memory reads of sub-word types.
Vectorisation support for __shfl* (eg. turns multiple i8 shuffles into an i32 or i64 shuffle). Particularly useful for exploiting architectures that have support for i64 shuffles.
Ability to move extending conversions across shuffles, to avoid shuffling useless bits.
Don't generate redundant warp reductions prior to uniform atomic ops.
Constant propagation for __shfl*.
Improved program startup time in the presence of CUDA translation units with no device code.
Improvements to i1 vectorisation.
Don't discard __builtin_provable too early when doing LTO or building bitcode libraries.

NVCC Compatibility#

Support for __shared__ and __constant__ keywords an anonymous unions/structs.
Allow __device__ keyword to redundantly accompany __shared__ or __constant__.
Remove spurious warning about passing C++ pointers to memory ops.
Commas in asm statement argument lists are, it turns out, optional.
Support forward-unknown-opts.
Correctly handle missing template keywords in more dependent name lookup scenarios.
Tolerate wrong-sided access during constant evaluation.
Cleaner diagnostics when handling redundant commas in the presence of multiple templates closing at once.
Allow out-of-line redeclaration of namespaced template functions with conflicting signatures (this becomes an overload, when it should be a compile error).

Inline PTX Support#

Support for undocumented syntax quirks, like spaces in the middle of directives.
Support for a wider variety of constant expressions.
Fix miscompilation of min.xorsign.abs.
Fix miscompilation of testp.finite in -ffast-math mode.
Correct behaviour of dp4a/dp2a in the presence of overflow.
Correctly parse identifiers including % on asm blocks with no inputs/outputs.
Fix miscompile of shifts/bmsk with extremely large shift amounts.
Implement insane implicit asm-input conversion rules.
Fix miscompiles in video instructions using min/max as the secondary op, since real behaviour turned out to differ from the manual.
Avoid compiler crash when trying to constant-evaluate a PTX n input that has a type error.
tf32 support, via upcasting to fp32.
Fix some corner cases of mixing vector splats with _ operands.
Fix crash parsing video byte selectors.
Correctly handle implicit vector-of-vector types, because it turns out those are at least semantically a thing.
Newly-supported instructions:
- cp.async.*
- min/max, 3-input edition.
- add/sub/mul/fma/cvt with non-default rounding modes.
- mma/wmma for sub-byte integer types.
- mma/wmma for all remaining shapes, in all datatypes except fp8/6/4.
- red/atom for vector types.
- cvt.pack, 4-input edition.
- nanosleep
Newly-supported special registers:
- %globaltimer
- %globaltimer_hi
- %globaltimer_lo
- %warpid
- %nwarpid
- %smid
- %nsmid

Compiler Misc.#

Fixed compiler crash when doing decltype(lambda) in certain conditions.
__syncthreads_and() and friends no longer use any shared memory.
Improvements to compile speed, espeically in the presence of inline PTX. Extremely so in the presence of mma/wmma.

Library: New APIs#

See the API diffs for precise information.

Added most commonly-used NVML APIs.
More cuSOLVER/cuSPARSE.
Non-default-rounding-mode APIs (conversions and arithmetic).
Added 1D CUarray copy APIs.
Added device-side versions of device/context property query APIs (cudaDeviceGetAttribute etc).
bmma_sync.
atomicAdd for float2 and float4.
Added API for programmatically controlling SCALE's exception behaviour.

Library: Fixes#

Allocator improvements prevent premature OOM due to address space fragmentation in long-running applications with a high level of memory churn.
Fix const-correctness in coop-groups.
Removed some macro leaks.
__sad() now behaves correctly in the presence of integer overload.
Fix bugs in FFT plan creation.
Fix some edgecases relating to delting the active CUcontext before popping it.
Fixed use of cuCtxSetCurrent when stack is empty
Don't crash when unloading a module at the same time as launching one of its kernels.
scaleenv now works with zsh, and does not pollute the shell environment.
Endless tiny fixes, random macros, header compatibility tweaks, etc.
Device PCI IDs now match the format of NVIDIA CUDA exactly.
Fix some edgecases where denorms weren't being flushed, but should be.
Stream creation is faster.
Don't crash when the printf buffer is larger than 4GB.
Fix rare hang when using the IPC APIs.

Release 1.3.1 (2025-05-12)#

Compiler#

Fixed a bug in the handling of weak device-side symbols which broke the device binaries for certain projects.
Fixed various PTX miscompilations.
Added support for approximate-math PTX instructions (lg2.approx and friends).

Library#

Fixed many small bugs in the device-side APIs.
Per-thread-default-stream actually works now, rather than silently using the legacy stream.
Fixed a race condition in the fft library.

Thirdparty Project demos#

GROMACS now works. SCALE appears to support a wider selection of AMD architectures than the HIP port, and seems to perform somewhat better (on MI210, at least!).

Release 1.3.0 (2025-04-23)#

Platform#

Upgraded from llvm17 to llvm19.1.7.
Support for rocm 6.3.1.
Support for gfx902 architecture.
Enterprise edition: Support for new architectures:
- gfx908
- gfx90a
- gfx942

Packaging#

Packages for Rocky9 are now available.
Package repos for Ubuntu and Rocky9 to simplify installation/upgrades.

New Features#

Added scaleenv, a new and much easier way to use SCALE.
Support for simulating a warp size of 32 even on wave64 platforms, fixing many projects on such platforms.
Support for bfloat16.
Compatibility improvements with non-cmake buildsystems.
Added SCALE_CUDA_VERSION environment variable to tell SCALE to impersonate a specific version of CUDA.
SCALE_EXCEPTIONS now supports a non-fatal mode.

Library wrappers#

Added most of cuFFT.
Added lots more cuSolver and cuSPARSE.
Filled in some missing NVTX APIs.
Added homeopathic quantities of nvtx3.

Library#

Lazy-initialisation of primary contexts now works properly, fixing some subtle lifecycle issues.
Added some missing undocumented headers like texture_types.h.
Added the IPC memory/event APIs
Added many multi-GPU APIs
Added cuMemcpyPeer/cuMemcpyPeerAsync.
Rewritten device allocator to work around HSA bugs and performance issues.
Fix a crash when initialising SCALE with many GPUs with huge amounts of memory.
Added CUDA IPC APIs. Among other things, this enables CUDA-MPI applications to work, including AMGX's distributed mode.
Fixed lots of multi-GPU brokenness.
Implemented the CU_CTX_SYNC_MEMOPS context flag.
Fixed accuracy issues in some of the CUDA Math APIs.
fp16 headers no longer produce warnings for projects that include them without -isystem.
Improved performance and correctness of cudaMemcpy/memset.
Fix subtle issues with pointer attribute APIs.
Improvements to C89 compatibility of headers.
Added more Cooperative Groups APIs.
Support for grid_sync().
Fix some wave64 issues with cooperative_groups.h.

Compiler#

__launch_bounds__ now works correctly, significantly improving performance.
Device atomics are now much more efficient.
Denorm-flushing optimisations are no longer skipped when they aren't supposed to be.
Ability to use DPP to optimise warp shuffles in some cases. Currently, this only works if the individual shfl is provably equivalent to a DPP op, not when loop analysis would be required. __shfl_xor is your friend.

NVCC Interface#

Corrected the behaviour of the nvcc -odir flag in during dependency generation.
Added the nvcc -prec-sqrt and -prec-div flags.
use_fast_math now matches nivida's behaviour, instead of mapping to clang's -ffast-math, which does too much.
--device-c no longer inappropriately triggers the linker.
Newly-supported nvcc flags:
- -arch=native
- -jump-table-density (ignored)
- -compress-mode (ignored)
- -split-compile-extended (ignored)

NVCC Semantics#

Support broken template code in more situations in nvcc mode.
Allow invalid const-correctness in unexpanded template code in nvcc mode.
Allow trailing commas in template argument lists in nvcc mode.
Fix a parser crash when explicitly calling operator<<<int>() in CUDA mode.
Fix a crash when using --compiler-options to pass huge numbers of options through to -Wl.

Diagnostics#

Warning for unused PTX variables
Error for attempts to return the carry bit (undefined behaviour on NVIDIA).
Compiler diagnostic to catch some undefined behaviour patterns with CUDA atomics.

PTX#

New instructions supported
- sm_100 variants of redux.
- Mixed-precision add/sub/fma FP instructions.
- membar
- bar.warp.sync
- fence (partial)
- mma (software emulated)
- wmma (software emulated)
Fixed parsing of hex-float constants.
Support for PTX C constraints (dynamic asm strings).
f16/bf16 PTX instructions no longer depend on the corresponding C++ header.
asm blocks can now refer to variables declared in other asm blocks, including absolutely cursed patterns.
Fixed an issue where template-dependent asm strings were mishandled.
Fixed various parsing issues (undocumented syntax quirks etc.).
Fixed a crash when trying to XOR floating point numbers together.

Thirdparty Project demos#

Things that now appear to work include:

CUDA-aware MPI
MAGMA
whisper.cpp
TCLB

Release 1.2.0 (2024-11-27)#

Library Enhancements#

Support for gfx900 architecture.
Support for gfx1102 architecture.

PTX#

Improved handling of wave64 in inline PTX.
Various inline PTX compilation fixes.

Other#

Support for Ubuntu 24.04.
Upgraded to ROCm 6.2.2.

Release 1.1.0 (2024-10-31)#

Library Enhancements#

Added much of the CUDA graph API.
Improvements to multi-GPU handling.
Fixed rare shutdown-time segfaults.
Added many random API functions. As usual, see The diff.

PTX#

f16x2, u16x2 and s16x2 types.
fns instruction
Fixed miscompile of sad instruction.

Thirdparty Project demos#

The scale-validation repo now has working demos for the following additional projects:

FLAMEGPU2
GPUJPEG
gpu_jpeg2k

Release 1.0.2.0 (2024-09-05)#

Documented a record of the CUDA APIs already available in SCALE, and those still to come: Implemented APIs.

Library Enhancements#

Kernel arguments larger than 4kb no longer crash the library.
Programs that ignore CUDA error codes can no longer get stuck in a state where the library always returns the error code you ignored.
Fixed synchronisation bugs when using synchronous cuMemset* APIs.
Fixed implicit synchronisation behaviour of cuMemcpy2D/cuMemcpy2DAsync().
Fixed precision issues in fp16 exp2(), rsqrt(), and h2log().
cudaEventRecord for an empty event no longer returns a time in the past.
Fixed occupancy API behaviour in edgecases that are not multiples of warp size.
Fixed rare crashes during static de-initialisation when library wrappers were in use.
All flags supported by SCALE's nvcc are now also accepted by our nvrtc implementation.
Various small header fixes.

Compiler Enhancements#

decltype() now works correctly for __host__ __device__ functions.
-Winvalid-constexpr no longer defaults to -Werror, for consistency with nvcc.
PTX variable names including % are no longer rejected.
Support for nvcc's nonstandard permissiveness surrounding missing typename keywords in dependent types.

Support for nvcc's wacky "split declaration" syntax for __host__ __device functions (with a warning):

int foo();
__device__ int foo();
__host__ int foo() {
    return 5;
}
// foo() is a __host__ __device__ function. :D

Newly-supported compiler flags (all of which are aliases for standard flags, or combinations thereof):
- -device-c
- -device-w
- -pre-include
- -library
- -output-file
- -define-macro
- -undefine-macro

New CUDA APIs#

Math APIs#

exp10(__half)
exp2(__half)
rcp(__half)
rint(__half)
h2exp10(__half2)
h2exp2(__half2)
h2rcp(__half2)
h2rint(__half2)

Release 1.0.1.0 (2024-07-24)#

This release primarily fixes issues that prevent people from successfully compiling their projects with SCALE. Many thanks to those users who submitted bug reports.

CUDA APIs#

The extra argument to cuLaunchKernel is now supported.
Added support for some more undocumented NVIDIA headers.
Fix various overload resolution issues with atomic APIs.
Fix overload resolution issues with min/max.
Added various undocumented macros to support projects that are explicitly checking cuda include guard macros.
lrint() and llrint() no longer crash the compiler. :D
Newly supported CUDA APIs:
- nvrtcGetNumSupportedArchs
- nvrtcGetSupportedArchs
- cudaLaunchKernelEx, cuLaunchKernelEx, cudaLaunchKernelExC: some of the performance-hint launch options are no-ops.
- __vavgs2, __vavgs4
- All the atomic*_block() and atomic*_system() variants.

Compiler#

Improved parsing of nvcc arguments:
- Allow undocumented option variants (-foo bar, --foo bar, --foo=bar, and -foo=bar are always allowed, it seems).
- Implement "interesting" quoting/escaping rules in nvcc arguments, such as embedded quotes and \,. We now correctly handle cursed arguments like: '-Xcompiler=-Wl\,-O1' '-Xcompiler=-Wl\,-rpath\,/usr/lib,-Wl\,-rpath-link\,/usr/lib'
Support for more nvcc arguments:
- NVCC-style diagnostic flags: -Werror, -disable-warnings, etc.
- --run, --run-args
- -Xlinker, -linker-options
- -no-exceptions, -noeh
- -minimal: no-op. Exact semantics are undocumented, and build times are reasonably fast anyway.
- -gen-opt-lto, -dlink-time-opt, -dlto. No-ops: device LTO not yet supported.
- -t, --threads, -split-compile: No-ops: they're flags for making compilation faster and are specific to how nvcc is implemented.
- -device-int128: no-op: we always enable int128.
- -extra-device-vectorization: no-op: vectorisation optimisations are controlled by the usual -O* flags.
- -entries, -source-in-ptx, -src-in-ptx: no-ops: there is no PTX.
- -use-local-env, -idp, -ddp, -dp, etc.: ignored since they are meaningless except on Windows.
Allow variadic device functions in non-evaluated functions.
Don't warn about implicit conversion from cudaLaneMask_t to bool.
__builtin_provable no longer causes compiler crashes in -O0/-O1 builds.
Fixed a bug causing PTX asm blocks inside non-template, non-dependent member functions of template classes to sometimes not be compiled, causing PTX to end up in the AMD binary unmodified.
CUDA launch tokens with spaces (ie.: myKernel<< <1, 1>> >()) are now supported.
Building non-cuda C translation units with SCALE-nvcc now works.

Other#

The meson build system no longer regards SCALE-nvcc as a "broken" compiler.
hsakmtsysinfo no longer explodes if it doesn't like your GPU.
New documentation pages.
Published more details about thirdparty testing, including the build scripts.

Release 1.0.0.0 (2024-07-15)#

Initial release