v0.19.0
This release features official, out-of-the-box support for compiling dpctl for specified AMD GPU architectures, the addition of new function tensor.top_k, a radix-sort-based implementation of sorting functions, and improvements to interoperability with DLPack through tensor.dldevice_to_sycl_device and tensor.sycl_device_to_dldevice.
A number of adjustments were also made to improve performance of dpctl reductions (i.e., sum, min, max, etc.), accumulators (i.e., cumulative_sum, cumulative_logsumexp), and copy-and-cast operations.
Added
- Support for compiling
dpctlfor specified AMD GPU architecture with use of CodePlay oneAPI plug-in gh-1731 - Added
tensor.top_kper Python Array API specification gh-1921 - Added functions
tensor.dldevice_to_sycl_deviceandtensor.sycl_device_to_dldevicefor converting between DLPack and sycl devices, and a methodget_device_idtodpctl.SyclDeviceto improve interoperability with DLPack protocol gh-1953 - Added
DPCTL_OFFLOAD_COMPRESScmake option (set toOFFby default) to toggle --offload-compress linker option when buildingdpctlgh-1961
Changed
- Improved performance of copy-and-cast operations from
numpy.ndarraytotensor.usm_ndarrayfor contiguous inputs gh-1829 py_sortandpy_argsortnow throwpy::value_errorif inputs are not C-contiguous gh-1838- Improved performance of copying operation to C-/F-contig array, with optimization for batch of square matrices gh-1850
- Improved performance of
tensor.argsortfunction for all types gh-1859 - Improved performance of
tensor.sortandtensor.argsortfor short arrays in the range [16, 64] elements gh-1866 - Implemented radix sort algorithm to be used in
dpt.sortanddpt.argsortgh-1867, gh-1883 - Extended
dpctl.SyclTimerwithdevice_timerkeyword, implementing different methods of collecting device times gh-1872 dpctlchanged to see GPU devices out of the box in virtual environment on Windows gh-1922- Improved performance of
tensor.cumulative_sum,tensor.cumulative_prod,tensor.cumulative_logsumexpas well as performance of boolean indexing gh-1923, gh-1942 - Improved performance of
tensor.min,tensor.max,tensor.logsumexp,tensor.reduce_hypotfor floating point type arrays by at least 2x gh-1932, gh-1937 - Updated Cython examples to use scikit-build gh-1935
- Reduced binary size of
_tensor_accumulation_implby 13 MB gh-1957 - Extended
tensor.asarrayto support objects that implement__usm_ndarray__property to be interpreted asusm_ndarrayobjects gh-1959 tensor.usm_ndarrayobject disallows implicit conversions to NumPy array gh-1964streamarguments intensor.usm_ndarraymethods now raise an error ifstreamis not atensor.SyclQueuegh-1969dpctlinitialization sets subprocess to use SPAWN method on Linux to enablegdb-oneapito debug kernels submitted from Python applications gh-1971- Reduced binary size of
_tensor_elementwise_implgh-1976 - Allow
dpctl.SyclQueue.memcpyto and from multi-dimensional buffers gh-1985
Fixed
- Fixed a bug in
tensor.rollfor very large values ofshiftgh-1869 - Fix for
tensor.result_typewhen all inputs are Python built-in scalars gh-1877 - Improved error in constructors
tensor.fullandtensor.full_likewhen provided a non-numeric fill value gh-1878 - Added a check for pointer alignment when copying to C-contiguous memory gh-1890, gh-1891
- Fixed
dpctlinstalled into virtual environment not finding DPC++ runtime libraries by addingDPCTL_WITH_REDISTcmake option (set toOFFby default) gh-1893 - Fixed incorrect result (issue gh-1901) in
tensor.cumulative_sumand in advanced indexing gh-1902 - Fixed
__setitem__()fortensor.usm_ndarraywhen passed an empty boolean mask gh-1915 tensor.from_dlpackdocstring now shows that return type can be NumPy array and stipulates when this will be the case gh-1919- Fixed docstring in helper class in DLPack tests gh-1920
- Fixed a bug in
tensor.astypewherecopy=Falsewould not be respected for 1d arrays when order keyword is specified gh-1928 - Replaced deprecated
CL/sycl.hppwith recommendedsycl/sycl.hppin examples gh-1933 - Fixed
tensor.take_along_axisandtensor.put_along_axisraising an error fortensor.uint64indices when given an array of dimension greater than 1 gh-1934 - Fixed unexpected results of
tensor.sumwith a requested output type ofboolgh-1958 - Use
std::moveto avoid unnecessary copying of temporary intriul_ctor.cppgh-1960 - Make
streama keyword-only argument intensor.usm_ndarray.to_deviceper requirement by array API specification gh-1966 - Improve efficiency of copy implementation and avoid an unnecessary kernel invocation in
tensor.argsortfor 1d input gh-1967 - Corrected uses of NumPy constructors with
tensor.usm_ndarrayinputs in test suite gh-1968 - Fixed array API namespace inspection utilities showing
complex128as a valid dtype on devices without double precision anddevicekeywords not working withdpctl.SyclQueueor filter strings gh-1979 - Fixed a bug in
test_sycl_device_interface.cppwhich would cause compilation to fail with Clang version 20.0 gh-1989 - Fixed memory leaks in smart-pointer-managed USM temporaries in synchronizing kernel calls gh-2002
UsmNDArray_MakeSimpleFromPtrandUsmNDArray_MakeFromPtrnow raise an error when provided an invalidtypenumbefore attempting to create the array gh-2003- Fixed typos in
tensor.from_numpyandtensor.astypegh-2006
Maintenance
- Revert pinning of cmake to 3.26 on Windows gh-1823
- Update black version used in Python code style workflow gh-1828
- Fixed CI/CD workflow for building conda packages on Windows gh-1831
- Revert work-around in
test_sycl_kernel_submit.pyfor problem in MKL 2024.2.0 gh-1836 - Do not use Mambaforge variant of miniforge as deprecated gh-1844
- Use pybind11=2.13.6 gh-1845
- Remove unnecessary include in C++ header file gh-1846
- Build translation unit "simplify_iteration_space.cpp" compiled multiple times as a static library gh-1847
- Add instructions for installing
dpctlfrom Intel PyPi channel gh-1860 - Fix warnings when generating docs gh-1855, gh-1861
- Align conda recipe with conda-forge's
{{ stdlib("c") }}migration gh-1868 - Add missing include of SYCL header to "math_utils.hpp" gh-1899
- Add support of CV-qualifiers in
is_complex<T>helper gh-1900 - Tuning work for elementwise functions with modest performance gains (under 10%) gh-1889
- Reduce binary size of accumulators by saving repeated expressions to a temporary gh-1896
- Added workflow to run nightly tests of
dpctlgh-1903, gh-1905 - Support and testing for Python 3.13 for
dpctlgh-1941, gh-1943 - Change libtensor to use
std::size_tanddpctl::tensor::ssize_tthroughout and fix missing includes forstd::size_tandsize_tgh-1950 - Fixed some unqualified
size_tand fixed-width integral types inlibtensorgh-1955 - Add versioneer as a build requirement in documentation on building
dpctlfrom source gh-1972 - Remove const qualifiers for class and struct members gh-1974, gh-1975
- Various code quality improvements to
test_sycl_queue_submit_local_accessor_arg.cppgh-1990 - Added Python 3.12 to package metadata gh-2005
- Miscellaneous changes to continuous integration/delivery (CI/CD) supporting scripts:
gh-1837, gh-1839, gh-1848, gh-1853, gh-1854, gh-1856, gh-1858, gh-1863, gh-1864, gh-1865, gh-1881, gh-1882, gh-1884, gh-1886, gh-1888, gh-1897, gh-1898, gh-1909, gh-1916, gh-1927, gh-1940, gh-1948, gh-1949, gh-1952, gh-1962, gh-1963, gh-1973, gh-1980, gh-1981, gh-1983, gh-1988
New Contributors
- @sommerlukas made their first contribution in #1985