昨天了解到CUDA 并行计算的强大,突然想试试手,就搭建环境测试了下
搭建完环境之后运行deviceQuery.exe 可以看到显卡的信息(费了不少功夫):
C:\Users\Pouee>"D:\Program Files\CUDA\Development\extras\demo_suite\deviceQuery.exe"
D:\Program Files\CUDA\Development\extras\demo_suite\deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1060 6GB"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 6144 MBytes (6442450944 bytes)
(10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores
GPU Max Clock rate: 1709 MHz (1.71 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: zu bytes
Total amount of shared memory per block: zu bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: zu bytes
Texture alignment: zu bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1, Device0 = GeForce GTX 1060 6GB
Result = PASS
上面信息表明我的显卡有1280个CUDA核心,65535个Block,没有Block最大允许运行1024个线程(也就是说该显卡可以同时启动6000W+个线程并行计算)
下面是测试程序(两个1024*1024的矩阵相乘,没有优化的算法)
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include "iostream"
#include <chrono>
using namespace std;
const int N = 1024;
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = blockIdx.x;
int j = threadIdx.x;
c[i*N + j] = 0;
for(int k = 0 ; k < N ; ++k)
c[i*N + j] += a[i*N+k] * b[k*N+j];
}
int main()
{
int *a = (int *)malloc(sizeof(int)*N*N);
int *b = (int *)malloc(sizeof(int)*N*N);
int *c = (int *)malloc(sizeof(int)*N*N);
int *d = (int *)malloc(sizeof(int)*N*N);
for (int i = 0; i < N*N; ++i) {
a[i] = rand()%100000;
b[i] = rand()%100000;
}
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaSetDevice(0);
cudaMalloc((void**)&dev_a, N*N * sizeof(int));
cudaMalloc((void**)&dev_b, N*N * sizeof(int));
cudaMalloc((void**)&dev_c, N*N * sizeof(int));
cudaMemcpy(dev_a, a, N*N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N*N * sizeof(int), cudaMemcpyHostToDevice);
using namespace chrono;
auto start = system_clock::now();
addKernel << <N, N >> > (dev_c, dev_a, dev_b);
cudaDeviceSynchronize();
cudaMemcpy(c, dev_c, sizeof(int) * N*N, cudaMemcpyDeviceToHost);
auto end = system_clock::now();
auto duration = duration_cast<microseconds>(end - start);
auto gpu = duration.count();
cout << "GPU总花费时间为:" << gpu << endl;
cout << "CPU 开始执行" << endl;
start = system_clock::now();
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
d[i*N + j] = 0;
for (int k = 0; k < N; ++k) {
d[i*N + j] += a[i*N + k] * b[k*N + j];
}
}
}
end = system_clock::now();
duration = duration_cast<microseconds>(end - start);
auto cpu = duration.count();
cout << "CPU总花费时间为:" << cpu << endl;
cout << "时间比:" << cpu / 1.0 / gpu << endl;
for (int i = 0; i < N*N; i++) {
if (d[i] != c[i]) {
cout << "error" << i << endl;
break;
}
}
cudaDeviceReset();
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return 0;
}
执行结果如下(CPU是i5-8400,时间单位为微秒):
C:\Users\Pouee>C:\Users\Pouee\source\repos\MatrixMul\x64\Release\MatrixMul.exe
GPU总花费时间为:19022
CPU 开始执行
CPU总花费时间为:3649257
时间比:191.844
C:\Users\Pouee>C:\Users\Pouee\source\repos\MatrixMul\x64\Release\MatrixMul.exe
GPU总花费时间为:18769
CPU 开始执行
CPU总花费时间为:3718320
时间比:198.11
C:\Users\Pouee>C:\Users\Pouee\source\repos\MatrixMul\x64\Release\MatrixMul.exe
GPU总花费时间为:18809
CPU 开始执行
CPU总花费时间为:3802177
时间比:202.147
可以发现GPU的速度在本例子中大概是CPU的200倍.
这里还有一个不错的文章:link