Tag: GPGPU

Init the GPGPU server (Anaconda, CUDA ...)

Install Anaconda

Guide

CUDA tool kit installation on Ubuntu

Step 1. (Optional) Follow the Installation Guide from NVIDIA to check the pre-condition.

Yujie LiuLess than 1 minute

CUDA Runtime Error and Restore

There are two types of errors in CUDA Runtime: sticky and non-sticky ones.

CUDA Error: Sticky V.S. Non-sticky

	Sticky	Non-Sticky
Description	The behavior is undefined in the event of a CUDA error which corrupts the CUDA context. This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.	Non-sticky errors are cleared automatically after they are returned by a cuda API call
Examples	Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.)	An example of a non-sticky error might be an attempt to `cudaMalloc` more data than is available in device memory. Such an operation will return an out-of-memory error.
How to recover	The only method to restore proper device functionality after a non-recoverable ("sticky") CUDA error is to terminate the host process that initiated.	The error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error.

Yujie LiuLess than 1 minute

How to stop a running CUDA kernel (when timeout)

When we start a CUDA kernel, it just runs like the Rud Bull racing car and can never be stopped before the kernel finishes. However, we need a timeout machenism to stop the kernel when the the kernel goes to an infinite loop or deadlock. However, the NVIDIA doesn't offer a way to stop the kernel in a decent way, which is especially required when I am doing mutation testing on CUDA kernel.

Yujie LiuAbout 4 min

CUDA Tutorial Chinese

内核启动

在CUDA中，启动内核函数时可以传递以下参数：

网格维度（Grid Dimension）：指定了在GPU上启动的线程块的数量和排列方式。使用dim3类型的变量来表示，可以指定三个维度（x、y和z）。例如，dim3 gridDim(16, 8, 1);表示启动了一个大小为16x8的二维线程块网格。
块维度（Block Dimension）：指定了每个线程块中的线程数量和排列方式。同样，使用dim3类型的变量来表示。例如，dim3 blockDim(256, 1, 1);表示每个线程块中有256个线程。
动态共享内存大小（Dynamic Shared Memory）：在启动内核时，可以为每个线程块分配一定大小的动态共享内存。通过在内核函数调用中使用<<<...>>>语法来指定共享内存的大小。例如，myKernel<<<gridDim, blockDim, sharedMemSize>>>();表示为每个线程块分配了sharedMemSize字节的动态共享内存。
流（Stream）：可选参数，用于指定内核执行的流。流可以用于控制内核的执行顺序和并发性。默认情况下，使用默认流。可以使用cudaStream_t类型的变量来表示流。

Yujie LiuAbout 2 min

Memory Fence

问题产生

提问：以下代码每个函数各由一个线程运行，AB有几种组合？

__device__ int X = 1, Y = 2;

__device__ void writeXY()
{
    X = 10;
    Y = 20;
}

__device__ void readXY()
{
    int B = Y;
    int A = X;
}

Yujie LiuAbout 3 min

CUDA Shared Memory Bank

注：本文中内存在硬件上指的是显存。

Memory Bank性质

为了提高内存并发利用率，共享内存（Shared Memory）被抽象划分为一个个相同大小的“内存模块”，该“内存模块”具有以下性质：

单个“内存模块”内的内存只能被内核（Kernel）内同一个Warp中的线程序列化访问（一次一个线程访问一个地址）；
不同“内存模块”的内存可以被内核同时访问。

这个“内存模块”被称为Memory Bank（以下Bank）。因此，如果有n个内存地址指向n个不同的Bank，则这n个地址可以被同时访问，理论上带宽利用率也就是访问单个Bank的n倍。

Yujie LiuAbout 2 min