Device Interface¶
Version Queries¶
- pycuda.VERSION¶
Gives the numeric version of PyCUDA as a variable-length tuple of integers. Enables easy version checks such as VERSION >= (0, 93).
Added in PyCUDA 0.93.
- pycuda.VERSION_STATUS¶
A text string such as “rc4” or “beta” qualifying the status of the release.
New in version 0.93.
- pycuda.VERSION_TEXT¶
The full release name (such as “0.93rc4”) in string form.
New in version 0.93.
Error Reporting¶
- exception pycuda.driver.Error¶
Base class of all PyCuda errors.
- exception pycuda.driver.CompileError¶
Thrown when
pycuda.compiler.SourceModule
compilation fails.- msg¶
New in version 0.94.
- stdout¶
New in version 0.94.
- stderr¶
New in version 0.94.
- command_line¶
New in version 0.94.
- exception pycuda.driver.MemoryError¶
Thrown when
mem_alloc()
or related functionality fails.
- exception pycuda.driver.LogicError¶
Thrown when PyCuda was confronted with a situation where it is likely that the programmer has made a mistake.
LogicErrors
do not depend on outer circumstances defined by the run-time environment.Example: CUDA was used before it was initialized.
- exception pycuda.driver.LaunchError¶
Thrown when kernel invocation has failed. (Note that this will often be reported by the next call after the actual kernel invocation.)
- exception pycuda.driver.RuntimeError¶
Thrown when a unforeseen run-time failure is encountered that is not likely due to programmer error.
Example: A file was not found.
Constants¶
- class pycuda.driver.ctx_flags¶
Flags for
Device.make_context()
. CUDA 2.0 and above only.- SCHED_AUTO¶
If there are more contexts than processors, yield, otherwise spin while waiting for CUDA calls to complete.
- SCHED_SPIN¶
Spin while waiting for CUDA calls to complete.
- SCHED_YIELD¶
Yield to other threads while waiting for CUDA calls to complete.
- SCHED_MASK¶
Mask of valid scheduling flags in this bitfield.
- SCHED_BLOCKING_SYNC¶
Use blocking synchronization. CUDA 2.2 and newer.
- MAP_HOST¶
Support mapped pinned allocations. CUDA 2.2 and newer.
- LMEM_RESIZE_TO_MAX¶
Keep local memory allocation after launch. CUDA 3.2 and newer. Rumored to decrease Fermi launch overhead?
New in version 2011.1.
- FLAGS_MASK¶
Mask of valid flags in this bitfield.
- class pycuda.driver.event_flags¶
Flags for
Event
. CUDA 2.2 and newer.- DEFAULT¶
- BLOCKING_SYNC¶
- DISABLE_TIMING¶
CUDA 3.2 and newer.
New in version 0.94.
- INTERPROCESS¶
CUDA 4.1 and newer.
New in version 2011.2.
- class pycuda.driver.device_attribute¶
- MAX_THREADS_PER_BLOCK¶
- MAX_BLOCK_DIM_X¶
- MAX_BLOCK_DIM_Y¶
- MAX_BLOCK_DIM_Z¶
- MAX_GRID_DIM_X¶
- MAX_GRID_DIM_Y¶
- MAX_GRID_DIM_Z¶
- TOTAL_CONSTANT_MEMORY¶
- WARP_SIZE¶
- MAX_PITCH¶
- CLOCK_RATE¶
- TEXTURE_ALIGNMENT¶
- GPU_OVERLAP¶
- MULTIPROCESSOR_COUNT¶
CUDA 2.0 and above only.
- SHARED_MEMORY_PER_BLOCK¶
Deprecated as of CUDA 2.0. See below for replacement.
- MAX_SHARED_MEMORY_PER_BLOCK¶
CUDA 2.0 and above only.
- REGISTERS_PER_BLOCK¶
Deprecated as of CUDA 2.0. See below for replacement.
- MAX_REGISTERS_PER_BLOCK¶
CUDA 2.0 and above.
- KERNEL_EXEC_TIMEOUT¶
CUDA 2.2 and above.
- INTEGRATED¶
CUDA 2.2 and above.
- CAN_MAP_HOST_MEMORY¶
CUDA 2.2 and above.
- COMPUTE_MODE¶
CUDA 2.2 and above. See
compute_mode
.
- MAXIMUM_TEXTURE1D_WIDTH¶
- MAXIMUM_TEXTURE2D_WIDTH¶
- MAXIMUM_TEXTURE2D_HEIGHT¶
- MAXIMUM_TEXTURE3D_WIDTH¶
- MAXIMUM_TEXTURE3D_HEIGHT¶
- MAXIMUM_TEXTURE3D_DEPTH¶
- MAXIMUM_TEXTURE2D_ARRAY_WIDTH¶
- MAXIMUM_TEXTURE2D_ARRAY_HEIGHT¶
- MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES¶
CUDA 3.0 and above.
New in version 0.94.
- MAXIMUM_TEXTURE2D_LAYERED_WIDTH¶
- MAXIMUM_TEXTURE2D_LAYERED_HEIGHT¶
- MAXIMUM_TEXTURE2D_LAYERED_LAYERS¶
- MAXIMUM_TEXTURE1D_LAYERED_WIDTH¶
- MAXIMUM_TEXTURE1D_LAYERED_LAYERS¶
CUDA 4.0 and above.
New in version 2011.1.
- SURFACE_ALIGNMENT¶
CUDA 3.0 (post-beta) and above.
New in version 0.94.
- CONCURRENT_KERNELS¶
CUDA 3.0 (post-beta) and above.
New in version 0.94.
- ECC_ENABLED¶
CUDA 3.0 (post-beta) and above.
New in version 0.94.
- PCI_BUS_ID¶
CUDA 3.2 and above.
New in version 0.94.
- PCI_DEVICE_ID¶
CUDA 3.2 and above.
New in version 0.94.
- TCC_DRIVER¶
CUDA 3.2 and above.
New in version 0.94.
- MEMORY_CLOCK_RATE¶
- GLOBAL_MEMORY_BUS_WIDTH¶
- L2_CACHE_SIZE¶
- MAX_THREADS_PER_MULTIPROCESSOR¶
- ASYNC_ENGINE_COUNT¶
- UNIFIED_ADDRESSING¶
CUDA 4.0 and above.
New in version 2011.1.
- MAXIMUM_TEXTURE2D_GATHER_WIDTH¶
- MAXIMUM_TEXTURE2D_GATHER_HEIGHT¶
- MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE¶
- MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE¶
- MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE¶
- PCI_DOMAIN_ID¶
- TEXTURE_PITCH_ALIGNMENT¶
- MAXIMUM_TEXTURECUBEMAP_WIDTH¶
- MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH¶
- MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS¶
- MAXIMUM_SURFACE1D_WIDTH¶
- MAXIMUM_SURFACE2D_WIDTH¶
- MAXIMUM_SURFACE2D_HEIGHT¶
- MAXIMUM_SURFACE3D_WIDTH¶
- MAXIMUM_SURFACE3D_HEIGHT¶
- MAXIMUM_SURFACE3D_DEPTH¶
- MAXIMUM_SURFACE1D_LAYERED_WIDTH¶
- MAXIMUM_SURFACE1D_LAYERED_LAYERS¶
- MAXIMUM_SURFACE2D_LAYERED_WIDTH¶
- MAXIMUM_SURFACE2D_LAYERED_HEIGHT¶
- MAXIMUM_SURFACE2D_LAYERED_LAYERS¶
- MAXIMUM_SURFACECUBEMAP_WIDTH¶
- MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH¶
- MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS¶
- MAXIMUM_TEXTURE1D_LINEAR_WIDTH¶
- MAXIMUM_TEXTURE2D_LINEAR_WIDTH¶
- MAXIMUM_TEXTURE2D_LINEAR_HEIGHT¶
- MAXIMUM_TEXTURE2D_LINEAR_PITCH¶
CUDA 4.1 and above.
New in version 2011.2.
- MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH¶
- MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT¶
- COMPUTE_CAPABILITY_MAJOR¶
- COMPUTE_CAPABILITY_MINOR¶
- MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH¶
CUDA 5.0 and above.
New in version 2014.1.
- STREAM_PRIORITIES_SUPPORTED¶
CUDA 5.5 and above.
New in version 2014.1.
- class pycuda.driver.pointer_attribute¶
-
CUDA 4.0 and above.
New in version 2011.1.
- class pycuda.driver.profiler_output_mode¶
-
CUDA 4.0 and above.
New in version 2011.1.
- class pycuda.driver.function_attribute¶
Flags for
Function.get_attribute()
. CUDA 2.2 and newer.- MAX_THREADS_PER_BLOCK¶
- SHARED_SIZE_BYTES¶
- CONST_SIZE_BYTES¶
- LOCAL_SIZE_BYTES¶
- NUM_REGS¶
- PTX_VERSION¶
CUDA 3.0 (post-beta) and above.
New in version 0.94.
- BINARY_VERSION¶
CUDA 3.0 (post-beta) and above.
New in version 0.94.
- MAX¶
- class pycuda.driver.func_cache¶
See
Function.set_cache_config()
. CUDA 3.0 (post-beta) and above.New in version 0.94.
- PREFER_NONE¶
- PREFER_SHARED¶
- PREFER_L1¶
- PREFER_EQUAL¶
CUDA 4.1 and above.
New in version 2011.2.
See
Function.set_shared_config()
. CUDA 4.2 and above.
- class pycuda.driver.array_format¶
- UNSIGNED_INT8¶
- UNSIGNED_INT16¶
- UNSIGNED_INT32¶
- SIGNED_INT8¶
- SIGNED_INT16¶
- SIGNED_INT32¶
- HALF¶
- FLOAT¶
- class pycuda.driver.array3d_flags¶
-
- LAYERED¶
CUDA 4.0 and above.
New in version 2011.1.
- SURFACE_LDST¶
CUDA 3.1 and above.
New in version 0.94.
- CUBEMAP TEXTURE_GATHER
CUDA 4.1 and above.
New in version 2011.2.
- class pycuda.driver.address_mode¶
- WRAP¶
- CLAMP¶
- MIRROR¶
- BORDER¶
CUDA 3.2 and above.
New in version 0.94.
- class pycuda.driver.compute_mode¶
CUDA 2.2 and newer.
- DEFAULT¶
- PROHIBITED¶
- EXCLUSIVE_PROCESS¶
CUDA 4.0 and above.
New in version 2011.1.
- class pycuda.driver.jit_option¶
CUDA 2.1 and newer.
- MAX_REGISTERS¶
- THREADS_PER_BLOCK¶
- WALL_TIME¶
- INFO_LOG_BUFFER¶
- INFO_LOG_BUFFER_SIZE_BYTES¶
- ERROR_LOG_BUFFER¶
- ERROR_LOG_BUFFER_SIZE_BYTES¶
- OPTIMIZATION_LEVEL¶
- TARGET_FROM_CUCONTEXT¶
- TARGET¶
- FALLBACK_STRATEGY¶
- class pycuda.driver.jit_target¶
CUDA 2.1 and newer.
- COMPUTE_10¶
- COMPUTE_11¶
- COMPUTE_12¶
- COMPUTE_13¶
- COMPUTE_20¶
CUDA 3.0 and above.
New in version 0.94.
- COMPUTE_21¶
CUDA 3.2 and above.
New in version 0.94.
- class pycuda.driver.host_alloc_flags¶
Flags to be used to allocate Pagelocked Host Memory.
- PORTABLE¶
- DEVICEMAP¶
- WRITECOMBINED¶
- class pycuda.driver.mem_attach_flags¶
Flags to be used to allocate Managed Memory.
..versionadded:: 2014.1
- GLOBAL¶
- HOST¶
- SINGLE¶
- class pycuda.driver.mem_host_register_flags¶
- PORTABLE¶
- DEVICEMAP¶
CUDA 4.0 and newer.
New in version 2011.1.
- class pycuda.driver.limit¶
Limit values for
Context.get_limit()
andContext.set_limit()
.CUDA 3.1 and newer.
New in version 0.94.
- STACK_SIZE¶
- PRINTF_FIFO_SIZE¶
- MALLOC_HEAP_SIZE¶
CUDA 3.2 and above.
Devices and Contexts¶
- pycuda.driver.get_version()¶
Obtain the version of CUDA against which PyCuda was compiled. Returns a 3-tuple of integers as (major, minor, revision).
- pycuda.driver.get_driver_version()¶
Obtain the version of the CUDA driver on top of which PyCUDA is running. Returns an integer version number.
- pycuda.driver.init(flags=0)¶
Initialize CUDA.
Warning
This must be called before any other function in this module.
See also
pycuda.autoinit
.
- class pycuda.driver.Device(number)¶
- class pycuda.driver.Device(pci_bus_id)
A handle to the number’th CUDA device. See also
pycuda.autoinit
.Changed in version 2011.2: The pci_bus_id version of the constructor is new in CUDA 4.1.
- static count()¶
Return the number of CUDA devices found.
- name()¶
- pci_bus_id()¶
CUDA 4.1 and newer.
New in version 2011.2.
- compute_capability()¶
Return a 2-tuple indicating the compute capability version of this device.
- total_memory()¶
Return the total amount of memory on the device in bytes.
- get_attribute(attr)¶
Return the (numeric) value of the attribute attr, which may be one of the
device_attribute
values.All
device_attribute
values may also be directly read as (lower-case) attributes on theDevice
object itself, e.g. dev.clock_rate.
- get_attributes()¶
Return all device attributes in a
dict
, with keys fromdevice_attribute
.
- make_context(flags=ctx_flags.SCHED_AUTO)¶
Create a
Context
on this device, with flags taken from thectx_flags
values.Also make the newly-created context the current context.
- retain_primary_context()¶
Return the
Context
obtained by retaining the device’s primary context, which is the one used by the CUDA runtime API. UnlikeContext.make_context()
, the newly-created context is not made current.CUDA 7.0 and newer.
New in version 2020.1.
- can_access_peer(dev)¶
CUDA 4.0 and newer.
New in version 2011.1.
- __hash__()¶
- __eq__()¶
- __ne__()¶
- class pycuda.driver.Context¶
An equivalent of a UNIX process on the compute device. Create instances of this class using
Device.make_context()
. See alsopycuda.autoinit
.- detach()¶
Decrease the reference count on this context. If the reference count hits zero, the context is deleted.
- push()¶
Make self the active context, pushing it on top of the context stack. CUDA 2.0 and above only.
- static pop()¶
Remove any context from the top of the context stack, deactivating it. CUDA 2.0 and above only.
- static get_device()¶
Return the device that the current context is working on.
- static synchronize()¶
Wait for all activity in the current context to cease, then return.
- static set_limit(limit, value)¶
See
limit
for possible values of limit.CUDA 3.1 and above.
New in version 0.94.
- static get_limit(limit)¶
See
limit
for possible values of limit.CUDA 3.1 and above.
New in version 0.94.
- static set_cache_config(cc)¶
See
func_cache
for possible values of cc.CUDA 3.2 and above.
New in version 0.94.
- static get_cache_config()¶
Return a value from
func_cache
.CUDA 3.2 and above.
New in version 0.94.
See
shared_config
for possible values of sc.CUDA 4.2 and above.
New in version 2013.1.
Return a value from
shared_config
.CUDA 4.2 and above.
New in version 2013.1.
- get_api_version()¶
Return an integer API version number.
CUDA 3.2 and above.
New in version 0.94.
- enable_peer_access(peer, flags=0)¶
CUDA 4.0 and above.
New in version 2011.1.
- disable_peer_access(peer, flags=0)¶
CUDA 4.0 and above.
New in version 2011.1.
Concurrency and Streams¶
- class pycuda.driver.Stream(flags=0)¶
A handle for a queue of operations that will be carried out in order.
- synchronize()¶
Wait for all activity on this stream to cease, then return.
- is_done()¶
Return True iff all queued operations have completed.
- class pycuda.driver.Event(flags=0)¶
An event is a temporal ‘marker’ in a
Stream
that allows taking the time between two events–such as the time required to execute a kernel. An event’s time is recorded when theStream
has finished all tasks enqueued before therecord()
call.See
event_flags
for values for the flags parameter.- synchronize()¶
Wait until the device execution stream reaches this event. Return self.
- query()¶
Return True if the device execution stream has reached this event.
- time_since(event)¶
Return the time in milliseconds that has passed between self and event. Use this method as end.time_since(start). Note that this method will fail with an “invalid value” error if either of the events has not been reached yet. Use
synchronize()
to ensure that the event has been reached.
- time_till(event)¶
Return the time in milliseconds that has passed between event and self. Use this method as start.time_till(end). Note that this method will fail with an “invalid value” error if either of the events has not been reached yet. Use
synchronize()
to ensure that the event has been reached.
- ipc_handle()¶
Return a
bytes
object representing an IPC handle to this event. Requires Python 2.6 and CUDA 4.1.
- static from_ipc_handle(handle)¶
Requires Python 2.6 and CUDA 4.1.
Memory¶
Global Device Memory¶
- pycuda.driver.mem_get_info()¶
Return a tuple (free, total) indicating the free and total memory in the current context, in bytes.
- pycuda.driver.mem_alloc(bytes)¶
Return a
DeviceAllocation
object representing a linear piece of device memory.
- pycuda.driver.to_device(buffer)¶
Allocate enough device memory for buffer, which adheres to the Python
buffer
interface. Copy the contents of buffer onto the device. Return aDeviceAllocation
object representing the newly-allocated memory.
- pycuda.driver.from_device(devptr, shape, dtype, order='C')¶
Make a new
numpy.ndarray
from the data at devptr on the GPU, interpreting them using shape, dtype and order.
- pycuda.driver.from_device_like(devptr, other_ary)¶
Make a new
numpy.ndarray
from the data at devptr on the GPU, interpreting them as having the same shape, dtype and order as other_ary.
- pycuda.driver.mem_alloc_pitch(width, height, access_size)¶
Allocates a linear piece of device memory at least width bytes wide and height rows high that an be accessed using a data type of size access_size in a coalesced fashion.
Returns a tuple (dev_alloc, actual_pitch) giving a
DeviceAllocation
and the actual width of each row in bytes.
- class pycuda.driver.DeviceAllocation¶
An object representing an allocation of linear device memory. Once this object is deleted, its associated device memory is freed.
Objects of this type can be cast to
int
to obtain a linear index into thisContext
’s memory.- free()¶
Release the held device memory now instead of when this object becomes unreachable. Any further use of the object is an error and will lead to undefined behavior.
- as_buffer(size, offset=0)¶
Return the pointer encapsulated by self as a Python buffer object, with the given size and, optionally, offset.
New in version 2014.1.
- pycuda.driver.mem_get_ipc_handle(devptr)¶
Return an opaque
bytes
object representing an IPC handle to the device pointer devptr.New in version 2011.2.
Requires CUDA 4.1 and Python 2.6.
- class pycuda.driver.IPCMemoryHandle(ipc_handle, flags=ipc_mem_flags.LAZY_ENABLE_PEER_ACCESS)¶
New in version 2011.2.
Requires CUDA 4.1 and Python 2.6.
Objects of this type can be used in the same ways as a
DeviceAllocation
.- close()¶
- class pycuda.driver.PointerHolderBase¶
A base class that facilitates casting to pointers within PyCUDA. This allows the user to construct custom pointer types that may have been allocated by facilities outside of PyCUDA proper, but still need to be objects to facilitate RAII. The user needs to supply one method to facilitate the pointer cast:
- get_pointer()¶
Return the pointer encapsulated by self.
- as_buffer(size, offset=0)¶
Return the pointer encapsulated by self as a Python buffer object, with the given size and, optionally, offset.
New in version 2014.1.
Pagelocked Host Memory¶
Pagelocked Allocation¶
- pycuda.driver.pagelocked_empty(shape, dtype, order='C', mem_flags=0)¶
Allocate a pagelocked
numpy.ndarray
of shape, dtype and order.mem_flags may be one of the values in
host_alloc_flags
. It may only be non-zero on CUDA 2.2 and newer.For the meaning of the other parameters, please refer to the
numpy
documentation.
- pycuda.driver.pagelocked_zeros(shape, dtype, order='C', mem_flags=0)¶
Like
pagelocked_empty()
, but initialized to zero.
- pycuda.driver.pagelocked_empty_like(array, mem_flags=0)¶
- pycuda.driver.pagelocked_zeros_like(array, mem_flags=0)¶
The numpy.ndarray
instances returned by these functions
have an attribute base that references an object of type
- class pycuda.driver.PagelockedHostAllocation¶
Inherits from
HostPointer
.An object representing an allocation of pagelocked host memory. Once this object is deleted, its associated device memory is freed.
- free()¶
Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated
numpy
array) is an error and will lead to undefined behavior.
- get_flags()¶
Return a bit field of values from
host_alloc_flags
.Only available on CUDA 3.2 and newer.
New in version 0.94.
- class pycuda.driver.HostAllocation¶
A deprecated name for
PagelockedHostAllocation
.
Aligned Host Memory¶
- pycuda.driver.aligned_empty(shape, dtype, order='C', alignment=4096)¶
Allocate an
numpy.ndarray
of shape, dtype and order, with data aligned to alignment bytes.For the meaning of the other parameters, please refer to the
numpy
documentation.New in version 2011.1.
- pycuda.driver.aligned_zeros(shape, dtype, order='C', alignment=4096)¶
Like
aligned_empty()
, but with initialization to zero.New in version 2011.1.
- pycuda.driver.aligned_empty_like(array, alignment=4096)¶
New in version 2011.1.
- pycuda.driver.aligned_zeros_like(array, alignment=4096)¶
New in version 2011.1.
The numpy.ndarray
instances returned by these functions
have an attribute base that references an object of type
- class pycuda.driver.AlignedHostAllocation¶
Inherits from
HostPointer
.An object representing an allocation of aligned host memory.
Post-Allocation Pagelocking¶
- pycuda.driver.register_host_memory(ary, flags=0)¶
Returns a
numpy.ndarray
which shares memory with ary. This memory will be page-locked as long as the return value of this function is alive.The returned array’s base attribute contains a
RegisteredHostMemory
instance, whose base attribute in turn contains ary.CUDA 4.0 and newer.
ary’s data address and size must be page-aligned. One way to achieve this is to use the functions in Aligned Host Memory.
New in version 2011.1.
- class pycuda.driver.RegisteredHostMemory¶
Inherits from
HostPointer
.CUDA 4.0 and newer.
New in version 2011.1.
- unregister()¶
Unregister the page-lock on the host memory held by this instance. Note that this does not free the memory, it only frees the page-lock.
- base¶
Contains the Python object from which this instance was constructed.
Managed Memory¶
CUDA 6.0 adds support for a “Unified Memory” model, which creates a managed virtual memory space that is visible to both CPUs and GPUs. The OS will migrate the physical pages associated with managed memory between the CPU and GPU as needed. This allows a numpy array on the host to be passed to kernels without first creating a DeviceAllocation and manually copying the host data to and from the device.
Note
Managed memory is only available for some combinations of CUDA device, operating system, and host compiler target architecture. Check the CUDA C Programming Guide and CUDA release notes for details.
Warning
This interface to managed memory should be considered experimental. It is provided as a preview, but for now the same interface stability guarantees as for the rest of PyCUDA do not apply.
Managed Memory Allocation¶
- pycuda.driver.managed_empty(shape, dtype, order='C', mem_flags=0)¶
Allocate a managed
numpy.ndarray
of shape, dtype and order.mem_flags may be one of the values in
mem_attach_flags
.For the meaning of the other parameters, please refer to the
numpy
documentation.Only available on CUDA 6.0 and newer.
New in version 2014.1.
- pycuda.driver.managed_zeros(shape, dtype, order='C', mem_flags=0)¶
Like
managed_empty()
, but initialized to zero.Only available on CUDA 6.0 and newer.
New in version 2014.1.
- pycuda.driver.managed_empty_like(array, mem_flags=0)¶
Only available on CUDA 6.0 and newer.
New in version 2014.1.
- pycuda.driver.managed_zeros_like(array, mem_flags=0)¶
Only available on CUDA 6.0 and newer.
New in version 2014.1.
The numpy.ndarray
instances returned by these functions
have an attribute base that references an object of type
- class pycuda.driver.ManagedAllocation¶
An object representing an allocation of managed host memory. Once this object is deleted, its associated CUDA managed memory is freed.
- free()¶
Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated
numpy
array) is an error and will lead to undefined behavior.
- get_device_pointer()¶
Return a device pointer that indicates the address at which this memory is mapped into the device’s address space. For managed memory, this is also the host pointer.
- attach(mem_flags, stream=None)¶
Alter the visibility of the managed allocation to be one of the values in
mem_attach_flags
. A managed array can be made visible to the host CPU and the entire CUDA context with mem_attach_flags.GLOBAL, or limited to the CPU only with mem_attach_flags.HOST. If mem_attach_flags.SINGLE is selected, then the array will only be visible to CPU and the provided instance ofStream
.
Managed Memory Usage¶
A managed numpy array is constructed and used on the host in a similar manner to a pagelocked array:
from pycuda.autoinit import context
import pycuda.driver as cuda
import numpy as np
a = cuda.managed_empty(shape=10, dtype=np.float32, mem_flags=cuda.mem_attach_flags.GLOBAL)
a[:] = np.linspace(0, 9, len(a)) # Fill array on host
It can be passed to a GPU kernel, and used again on the host without an explicit copy:
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void doublify(float *a)
{
a[threadIdx.x] *= 2;
}
""")
doublify = mod.get_function("doublify")
doublify(a, grid=(1,1), block=(len(a),1,1))
context.synchronize() # Wait for kernel completion before host access
median = np.median(a) # Computed on host!
Warning
The CUDA Unified Memory model has very specific rules regarding concurrent access of managed memory allocations. Host access to any managed array is not allowed while the GPU is executing a kernel, regardless of whether the array is in use by the running kernel. Failure to follow the concurrency rules will generate a segmentation fault, causing the Python interpreter to terminate immediately.
Users of managed numpy arrays should read the “Unified Memory Programming” appendix of the CUDA C Programming Guide for further details on the concurrency restrictions.
If you are encountering interpreter terminations due to concurrency issues, the faulthandler <http://pypi.python.org/pypi/faulthandler> module may be helpful in locating the location in your Python program where the faulty access is occurring.
Arrays and Textures¶
- class pycuda.driver.ArrayDescriptor¶
- width¶
- height¶
- format¶
A value of type
array_format
.
- num_channels¶
- class pycuda.driver.ArrayDescriptor3D¶
- width¶
- height¶
- depth¶
- format¶
A value of type
array_format
. CUDA 2.0 and above only.
- num_channels¶
- class pycuda.driver.Array(descriptor)¶
A 2D or 3D memory block that can only be accessed via texture references.
descriptor can be of type
ArrayDescriptor
orArrayDescriptor3D
.- free()¶
Release the array and its device memory now instead of when this object becomes unreachable. Any further use of the object is an error and will lead to undefined behavior.
- get_descriptor()¶
Return a
ArrayDescriptor
object for this 2D array, like the one that was used to create it.
- get_descriptor_3d()¶
Return a
ArrayDescriptor3D
object for this 3D array, like the one that was used to create it. CUDA 2.0 and above only.
- class pycuda.driver.SurfaceReference¶
Note
Instances of this class can only be constructed through
Module.get_surfref()
.CUDA 3.1 and above.
New in version 0.94.
- set_array(array, flags=0)¶
Bind self to the
Array
array.As long as array remains bound to this texture reference, it will not be freed–the texture reference keeps a reference to the array.
- get_array()¶
Get back the
Array
to which self is bound.Note
This will be a different object than the one passed to
set_array()
, but it will compare equal.
- class pycuda.driver.TextureReference¶
A handle to a binding of either linear memory or an
Array
to a texture unit.- set_array(array)¶
Bind self to the
Array
array.As long as array remains bound to this texture reference, it will not be freed–the texture reference keeps a reference to the array.
- set_address(devptr, bytes, allow_offset=False)¶
Bind self to the a chunk of linear memory starting at the integer address devptr, encompassing a number of bytes. Due to alignment requirements, the effective texture bind address may be different from the requested one by an offset. This method returns this offset in bytes. If allow_offset is
False
, a nonzero value of this offset will cause an exception to be raised.Unlike for
Array
objects, no life support is provided for linear memory bound to texture references.
- set_address_2d(devptr, descr, pitch)¶
Bind self as a 2-dimensional texture to a chunk of global memory at devptr. The line-to-line offset in bytes is given by pitch. Width, height and format are given in the
ArrayDescriptor
descr.set_format()
need not and should not be called in addition to this method.
- set_format(fmt, num_components)¶
Set the texture to have
array_format
fmt and to have num_components channels.
- set_address_mode(dim, am)¶
Set the address mode of dimension dim to am, which must be one of the
address_mode
values.
- set_flags(flags)¶
Set the flags to a combination of the TRSF_XXX values.
- get_array()¶
Get back the
Array
to which self is bound.Note
This will be a different object than the one passed to
set_array()
, but it will compare equal.
- get_address_mode(dim)¶
- get_filter_mode()¶
- get_format()¶
Return a tuple (fmt, num_components), where fmt is of type
array_format
, and num_components is the number of channels in this texture.(Version 2.0 and above only.)
- get_flags()¶
- pycuda.driver.TRSA_OVERRIDE_FORMAT¶
- pycuda.driver.TRSF_READ_AS_INTEGER¶
- pycuda.driver.TRSF_NORMALIZED_COORDINATES¶
- pycuda.driver.TR_DEFAULT¶
- pycuda.driver.matrix_to_array(matrix, order)¶
Turn the two-dimensional
numpy.ndarray
object matrix into anArray
. The order argument can be either “C” or “F”. If it is “C”, then tex2D(x,y) is going to fetch matrix[y,x], and vice versa for for “F”.
- pycuda.driver.np_to_array(nparray, order, allowSurfaceBind=False)¶
Turn a
numpy.ndarray
with 2D or 3D structure, into anArray
. The order argument can be either “C” or “F”. If allowSurfaceBind is passed as True the returnedArray
can be read and write withSurfaceReference
in addition of reads byTextureReference
. Function automatically detect dtype and adjust channels to supportedarray_format
. Also add direct support for np.float64, np.complex64 and np.complex128 formats.Example of use:
#include <pycuda-helpers.hpp> texture<fp_tex_double, 3, cudaReadModeElementType> my_tex; // complex128: fp_tex_cdouble // complex64 : fp_tex_cfloat // float64 : fp_tex_double surface<void, 3, cudaReadModeElementType> my_surf; // Surfaces in 2D needs 'cudaSurfaceType2DLayered' __global__ void f() { ... fp_tex3D(my_tex, i, j, k); fp_surf3Dwrite(myvar, my_surf, i, j, k, cudaBoundaryModeClamp); // fp extensions don't need width in bytes fp_surf3Dread(&myvar, my_surf, i, j, k, cudaBoundaryModeClamp); ... }
New in version 2015.1.
- pycuda.driver.gpuarray_to_array(gpuparray, order, allowSurfaceBind=False)¶
Turn a
GPUArray
with 2D or 3D structure, into anArray
. Same structure and use ofnp_to_array()
New in version 2015.1.
- pycuda.driver.make_multichannel_2d_array(matrix, order)¶
Turn the three-dimensional
numpy.ndarray
object matrix into an 2DArray
with multiple channels.Depending on order, the matrix’s shape is interpreted as
height, width, num_channels for order == “C”,
num_channels, width, height for order == “F”.
Note
This function assumes that matrix has been created with the memory order order. If that is not the case, the copied data will likely not be what you expect.
Initializing Device Memory¶
- pycuda.driver.memset_d8(dest, data, count)¶
- pycuda.driver.memset_d16(dest, data, count)¶
- pycuda.driver.memset_d32(dest, data, count)¶
Note
count is the number of elements, not bytes.
- pycuda.driver.memset_d2d8(dest, pitch, data, width, height)¶
- pycuda.driver.memset_d2d16(dest, pitch, data, width, height)¶
- pycuda.driver.memset_d2d32(dest, pitch, data, width, height)¶
Unstructured Memory Transfers¶
- pycuda.driver.memcpy_htod(dest, src)¶
Copy from the Python buffer src to the device pointer dest (an
int
or aDeviceAllocation
). The size of the copy is determined by the size of the buffer.
- pycuda.driver.memcpy_htod_async(dest, src, stream=None)¶
Copy from the Python buffer src to the device pointer dest (an
int
or aDeviceAllocation
) asynchronously, optionally serialized via stream. The size of the copy is determined by the size of the buffer.src must be page-locked memory, see, e.g.
pagelocked_empty()
.New in 0.93.
- pycuda.driver.memcpy_dtoh(dest, src)¶
Copy from the device pointer src (an
int
or aDeviceAllocation
) to the Python buffer dest. The size of the copy is determined by the size of the buffer.
- pycuda.driver.memcpy_dtoh_async(dest, src, stream=None)¶
Copy from the device pointer src (an
int
or aDeviceAllocation
) to the Python buffer dest asynchronously, optionally serialized via stream. The size of the copy is determined by the size of the buffer.dest must be page-locked memory, see, e.g.
pagelocked_empty()
.New in 0.93.
- pycuda.driver.memcpy_dtod(dest, src, size)¶
- pycuda.driver.memcpy_dtod_async(dest, src, size, stream=None)¶
CUDA 3.0 and above.
New in version 0.94.
- pycuda.driver.memcpy_peer(dest, src, size, dest_context=None, src_context=None)¶
- pycuda.driver.memcpy_peer_async(dest, src, size, dest_context=None, src_context=None, stream=None)¶
CUDA 4.0 and above.
New in version 2011.1.
- pycuda.driver.memcpy_dtoa(ary, index, src, len)¶
- pycuda.driver.memcpy_atod(dest, ary, index, len)¶
- pycuda.driver.memcpy_htoa(ary, index, src)¶
- pycuda.driver.memcpy_atoh(dest, ary, index)¶
- pycuda.driver.memcpy_atoa(dest, dest_index, src, src_index, len)¶
Structured Memory Transfers¶
- class pycuda.driver.Memcpy2D¶
- src_x_in_bytes¶
X Offset of the origin of the copy. (initialized to 0)
- src_y¶
Y offset of the origin of the copy. (initialized to 0)
- src_pitch¶
Size of a row in bytes at the origin of the copy.
- set_src_host(buffer)¶
Set the buffer, which must be a Python object adhering to the buffer interface, to be the origin of the copy.
- set_src_device(devptr)¶
Set the device address devptr (an
int
or aDeviceAllocation
) as the origin of the copy.
- set_src_unified(buffer)¶
Same as
set_src_host()
, except that buffer may also correspond to device memory.CUDA 4.0 and above. Requires unified addressing.
New in version 2011.1.
- dst_x_in_bytes¶
X offset of the destination of the copy. (initialized to 0)
- dst_y¶
Y offset of the destination of the copy. (initialized to 0)
- dst_pitch¶
Size of a row in bytes at the destination of the copy.
- set_dst_host(buffer)¶
Set the buffer, which must be a Python object adhering to the buffer interface, to be the destination of the copy.
- set_dst_device(devptr)¶
Set the device address devptr (an
int
or aDeviceAllocation
) as the destination of the copy.
- set_dst_unified(buffer)¶
Same as
set_dst_host()
, except that buffer may also correspond to device memory.CUDA 4.0 and above. Requires unified addressing.
New in version 2011.1.
- width_in_bytes¶
Number of bytes to copy for each row in the transfer.
- height¶
Number of rows to copy.
- __call__([aligned=True])¶
Perform the specified memory copy, waiting for it to finish. If aligned is False, tolerate device-side misalignment for device-to-device copies that may lead to loss of copy bandwidth.
- __call__(stream)
Perform the memory copy asynchronously, serialized via the
Stream
stream. Any host memory involved in the transfer must be page-locked.
- class pycuda.driver.Memcpy3D¶
Memcpy3D
has the same members asMemcpy2D
, and additionally all of the following:- src_z¶
Z offset of the origin of the copy. (initialized to 0)
- dst_z¶
Z offset of the destination of the copy. (initialized to 0)
- depth¶
Memcpy3D
is supported on CUDA 2.0 and above only.
- class pycuda.driver.Memcpy3DPeer¶
Memcpy3DPeer
has the same members asMemcpy3D
, and additionally all of the following:- set_src_context(ctx)¶
- set_dst_context(ctx)¶
CUDA 4.0 and newer.
New in version 2011.1.
Code on the Device: Modules and Functions¶
- class pycuda.driver.Module¶
Handle to a CUBIN module loaded onto the device. Can be created with
module_from_file()
andmodule_from_buffer()
.- get_function(name)¶
Return the
Function
name in this module.Warning
While you can obtain different handles to the same function using this method, these handles all share the same state that is set through the
set_XXX
methods ofFunction
. This means that you can’t obtain two different handles to the same function andFunction.prepare()
them in two different ways.
- get_global(name)¶
Return a tuple (device_ptr, size_in_bytes) giving the device address and size of the global name.
The main use of this method is to find the address of pre-declared __constant__ arrays so they can be filled from the host before kernel invocation.
- get_texref(name)¶
Return the
TextureReference
name from this module.
- get_surfref(name)¶
Return the
SurfaceReference
name from this module.CUDA 3.1 and above.
New in version 0.94.
- pycuda.driver.module_from_buffer(buffer, options=[], message_handler=None)¶
Create a
Module
by loading a PTX or CUBIN module from buffer, which must support the Python buffer interface. (For example,str
andnumpy.ndarray
do.)- Parameters:
options – A list of tuples (
jit_option
, value).message_handler – A callable that is called with a arguments of
(compile_success_bool, info_str, error_str)
which allows the user to process error and warning messages from the PTX compiler.
Loading PTX modules as well as non-default values of options and message_handler are only allowed on CUDA 2.1 and newer.
- class pycuda.driver.Function¶
Handle to a __global__ function in a
Module
. Create usingModule.get_function()
.- __call__(arg1, ..., argn, block=block_size[, grid=(1, 1)[, stream=None[, shared=0[, texrefs=[][, time_kernel=False]]]]])¶
Launch self, with a thread block size of block. block must be a 3-tuple of integers.
arg1 through argn are the positional C arguments to the kernel. See
param_set()
for details. See especially the warnings there.grid specifies, as a 2-tuple, the number of thread blocks to launch, as a two-dimensional grid. stream, if specified, is a
Stream
instance serializing the copying of input arguments (if any), execution, and the copying of output arguments (again, if any). shared gives the number of bytes available to the kernel in extern __shared__ arrays. texrefs is alist
ofTextureReference
instances that the function will have access to.The function returns either None or the number of seconds spent executing the kernel, depending on whether time_kernel is True.
This is a convenience interface that can be used instead of the
param_*()
andlaunch_*()
methods below. For a faster (but mildly less convenient) way of invoking kernels, seeprepare()
andprepared_call()
.arg1 through argn are allowed to be of the following types:
Subclasses of
numpy.number
. These are sized number types such asnumpy.uint32
ornumpy.float32
.DeviceAllocation
instances, which will become a device pointer to the allocated memory.Instances of
ArgumentHandler
subclasses. These can be used to automatically transfernumpy
arrays onto and off of the device.Objects supporting the Python
buffer
interface. These chunks of bytes will be copied into the parameter space verbatim.GPUArray
instances.
Warning
You cannot pass values of Python’s native
int
orfloat
types to param_set. Since there is no unambiguous way to guess the size of these integers or floats, it complains with aTypeError
.Note
This method has to guess the types of the arguments passed to it, which can make it somewhat slow. For a kernel that is invoked often, this can be inconvenient. For a faster (but mildly less convenient) way of invoking kernels, see
prepare()
andprepared_call()
.
- param_set_texref(texref)¶
Make the
TextureReference
texref available to the function.
- prepare(arg_types, shared=None, texrefs=[])¶
Prepare the invocation of this function by
setting up the argument types as arg_types. arg_types is expected to be an iterable containing type characters understood by the
struct
module ornumpy.dtype
objects.(In addition, PyCUDA understands ‘F’ and ‘D’ for single- and double precision floating point numbers.)
Registering the texture references texrefs for use with this functions. The
TextureReference
objects in texrefs will be retained, and whatever these references are bound to at invocation time will be available through the corresponding texture references within the kernel.
Return self.
- prepared_call(grid, block, *args, shared_size=0)¶
Invoke self using
launch_grid()
, with args a grid size of grid, and a block size of block. Assumes thatprepare()
was called on self. The texture references given toprepare()
are set up as parameters, as well.Changed in version 2012.1: shared_size was added.
- prepared_timed_call(grid, block, *args, shared_size=0)¶
Invoke self using
launch_grid()
, with args, a grid size of grid, and a block size of block. Assumes thatprepare()
was called on self. The texture references given toprepare()
are set up as parameters, as well.Return a 0-ary callable that can be used to query the GPU time consumed by the call, in seconds. Once called, this callable will block until completion of the invocation.
Changed in version 2012.1: shared_size was added.
- prepared_async_call(grid, block, stream, *args, shared_size=0)¶
Invoke self using
launch_grid_async()
, with args, a grid size of grid, and a block size of block, serialized into thepycuda.driver.Stream
stream. If stream is None, do the same asprepared_call()
. Assumes thatprepare()
was called on self. The texture references given toprepare()
are set up as parameters, as well.Changed in version 2012.1: shared_size was added.
- get_attribute(attr)¶
Return one of the attributes given by the
function_attribute
value attr.All
function_attribute
values may also be directly read as (lower-case) attributes on theFunction
object itself, e.g. func.num_regs.CUDA 2.2 and newer.
New in version 0.93.
- set_cache_config(fc)¶
See
func_cache
for possible values of fc.CUDA 3.0 (post-beta) and newer.
New in version 0.94.
See
shared_config
for possible values of sc.CUDA 4.2 and newer.
New in version 2013.1.
- local_size_bytes¶
The number of bytes of local memory used by this function.
On CUDA 2.1 and below, this is only available if this function is part of a
pycuda.compiler.SourceModule
. It replaces the now-deprecated attribute lmem.
The number of bytes of shared memory used by this function.
On CUDA 2.1 and below, this is only available if this function is part of a
pycuda.compiler.SourceModule
. It replaces the now-deprecated attribute smem.
- num_regs¶
The number of 32-bit registers used by this function.
On CUDA 2.1 and below, this is only available if this function is part of a
pycuda.compiler.SourceModule
. It replaces the now-deprecated attribute registers.
Set shared to be the number of bytes available to the kernel in extern __shared__ arrays.
Warning
Deprecated as of version 2011.1.
- set_block_shape(x, y, z)¶
Set the thread block shape for this function.
Warning
Deprecated as of version 2011.1.
- param_set(arg1, ... argn)¶
Set the thread block shape for this function.
Warning
Deprecated as of version 2011.1.
- param_set_size(bytes)¶
Size the parameter space to bytes.
Warning
Deprecated as of version 2011.1.
- param_seti(offset, value)¶
Set the integer at offset in the parameter space to value.
Warning
Deprecated as of version 2011.1.
- param_setf(offset, value)¶
Set the float at offset in the parameter space to value.
Warning
Deprecated as of version 2011.1.
- launch()¶
Launch a single thread block of self.
Warning
Deprecated as of version 2011.1.
- launch_grid(width, height)¶
Launch a width*height grid of thread blocks of self.
Warning
Deprecated as of version 2011.1.
- class pycuda.driver.ArgumentHandler(array)¶
- class pycuda.driver.In(array)¶
Inherits from
ArgumentHandler
. Indicates thatbuffer
array should be copied to the compute device before invoking the kernel.
- class pycuda.driver.Out(array)¶
Inherits from
ArgumentHandler
. Indicates thatbuffer
array should be copied off the compute device after invoking the kernel.
- class pycuda.driver.InOut(array)¶
Inherits from
ArgumentHandler
. Indicates thatbuffer
array should be copied both onto the compute device before invoking the kernel, and off it afterwards.
Profiler Control¶
CUDA 4.0 and newer.
- pycuda.driver.initialize_profiler(config_file, output_file, output_mode)¶
output_mode is one of the attributes of
profiler_output_mode
.New in version 2011.1.
- pycuda.driver.start_profiler()¶
New in version 2011.1.
- pycuda.driver.stop_profiler()¶
New in version 2011.1.
Just-in-time Compilation¶
- pycuda.compiler.DEFAULT_NVCC_FLAGS¶
New in version 2011.1.
If no options are given in the calls below, the value of this list-type variable is used instead. This may be useful for injecting necessary flags into the compilation of automatically compiled kernels, such as those used by the module
pycuda.gpuarray
.The initial value of this variable is taken from the environment variable
PYCUDA_DEFAULT_NVCC_FLAGS
.If you modify this variable in your code, please be aware that this is a globally shared variable that may be modified by multiple packages. Please exercise caution in such modifications–you risk breaking other people’s code.
- class pycuda.compiler.SourceModule(source, nvcc='nvcc', options=None, keep=False, no_extern_c=False, arch=None, code=None, cache_dir=None, include_dirs=[])¶
Create a
Module
from the CUDA source code source. The Nvidia compiler nvcc is assumed to be on thePATH
if no path to it is specified, and is invoked with options to compile the code. If keep is True, the compiler output directory is kept, and a line indicating its location in the file system is printed for debugging purposes.Unless no_extern_c is True, the given source code is wrapped in extern “C” { … } to prevent C++ name mangling.
arch and code specify the values to be passed for the
-arch
and-code
options on the nvcc command line. If arch is None, it defaults to the current context’s device’s compute capability. If code is None, it will not be specified.cache_dir gives the directory used for compiler caching. If None then cache_dir is taken to be
PYCUDA_CACHE_DIR
if set or a sensible per-user default. If passed as False, caching is disabled.If the environment variable
PYCUDA_DISABLE_CACHE
is set to any value then caching is disabled. This preference overrides any value of cache_dir and can be used to disable caching globally.This class exhibits the same public interface as
pycuda.driver.Module
, but does not inherit from it.Change note:
SourceModule
was moved frompycuda.driver
topycuda.compiler
in version 0.93.
- compile(source, nvcc="nvcc", options=None, keep=False,
- no_extern_c=False, arch=None, code=None, cache_dir=None,
- include_dirs=[])
Perform the same compilation as the corresponding
SourceModule
constructor, but only return resulting cubin file as a string. In particular, do not upload the code to the GPU.