Init the GPGPU server (Anaconda, CUDA ...)
Install Anaconda
CUDA tool kit installation on Ubuntu
Step 1. (Optional) Follow the Installation Guide from NVIDIA to check the pre-condition.
Step 1. (Optional) Follow the Installation Guide from NVIDIA to check the pre-condition.
There are two types of errors in CUDA Runtime: sticky and non-sticky ones.
Sticky | Non-Sticky | |
---|---|---|
Description | The behavior is undefined in the event of a CUDA error which corrupts the CUDA context. This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed. |
Non-sticky errors are cleared automatically after they are returned by a cuda API call |
Examples | Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) | An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error. |
How to recover | The only method to restore proper device functionality after a non-recoverable ("sticky") CUDA error is to terminate the host process that initiated. | The error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. |
When we start a CUDA kernel, it just runs like the Rud Bull racing car and can never be stopped before the kernel finishes. However, we need a timeout machenism to stop the kernel when the the kernel goes to an infinite loop or deadlock. However, the NVIDIA doesn't offer a way to stop the kernel in a decent way, which is especially required when I am doing mutation testing on CUDA kernel.
在CUDA中,启动内核函数时可以传递以下参数:
dim3
类型的变量来表示,可以指定三个维度(x、y和z)。例如,dim3 gridDim(16, 8, 1);
表示启动了一个大小为16x8的二维线程块网格。dim3
类型的变量来表示。例如,dim3 blockDim(256, 1, 1);
表示每个线程块中有256个线程。<<<...>>>
语法来指定共享内存的大小。例如,myKernel<<<gridDim, blockDim, sharedMemSize>>>();
表示为每个线程块分配了sharedMemSize
字节的动态共享内存。cudaStream_t
类型的变量来表示流。提问:以下代码每个函数各由一个线程运行,AB有几种组合?
__device__ int X = 1, Y = 2;
__device__ void writeXY()
{
X = 10;
Y = 20;
}
__device__ void readXY()
{
int B = Y;
int A = X;
}
注:本文中内存在硬件上指的是显存。
为了提高内存并发利用率,共享内存(Shared Memory)被抽象划分为一个个相同大小的“内存模块”,该“内存模块”具有以下性质:
这个“内存模块”被称为Memory Bank(以下Bank)。因此,如果有n个内存地址指向n个不同的Bank,则这n个地址可以被同时访问,理论上带宽利用率也就是访问单个Bank的n倍。
Goal: How existing test theory can be used and adapted to the specific use case of GPU kernels.
Two directions