昨天了解到CUDA 并行计算的强大,突然想试试手,就搭建环境测试了下

搭建完环境之后运行deviceQuery.exe 可以看到显卡的信息(费了不少功夫):

C:\Users\Pouee>"D:\Program Files\CUDA\Development\extras\demo_suite\deviceQuery.exe"
D:\Program Files\CUDA\Development\extras\demo_suite\deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060 6GB"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6144 MBytes (6442450944 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               zu bytes
  Total amount of shared memory per block:       zu bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          zu bytes
  Texture alignment:                             zu bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1, Device0 = GeForce GTX 1060 6GB
Result = PASS

上面信息表明我的显卡有1280个CUDA核心,65535个Block,没有Block最大允许运行1024个线程(也就是说该显卡可以同时启动6000W+个线程并行计算)

下面是测试程序(两个1024*1024的矩阵相乘,没有优化的算法)


#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include "iostream"
#include <chrono>   
using namespace std;
const int N = 1024;

__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = blockIdx.x;
    int j = threadIdx.x;
    c[i*N + j] = 0;
    for(int k = 0 ; k < N ; ++k)
        c[i*N + j] += a[i*N+k] * b[k*N+j];
}

int main()
{
    int *a = (int *)malloc(sizeof(int)*N*N);
    int *b = (int *)malloc(sizeof(int)*N*N);
    int *c = (int *)malloc(sizeof(int)*N*N);
    int *d = (int *)malloc(sizeof(int)*N*N);
    for (int i = 0; i < N*N; ++i) {
        a[i] = rand()%100000;
        b[i] = rand()%100000;
    }
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaSetDevice(0);
    cudaMalloc((void**)&dev_a, N*N * sizeof(int));
    cudaMalloc((void**)&dev_b, N*N * sizeof(int));
    cudaMalloc((void**)&dev_c, N*N * sizeof(int));

    cudaMemcpy(dev_a, a, N*N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, N*N * sizeof(int), cudaMemcpyHostToDevice);

    using namespace chrono;

    auto start = system_clock::now();

    addKernel << <N, N >> > (dev_c, dev_a, dev_b);
    cudaDeviceSynchronize();
    cudaMemcpy(c, dev_c, sizeof(int) * N*N, cudaMemcpyDeviceToHost);
    auto end = system_clock::now();
    auto duration = duration_cast<microseconds>(end - start);
    auto gpu = duration.count();
    cout << "GPU总花费时间为:" << gpu << endl;

    cout << "CPU 开始执行" << endl;
    start = system_clock::now();
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            d[i*N + j] = 0;
            for (int k = 0; k < N; ++k) {
                d[i*N + j] += a[i*N + k] * b[k*N + j];
            }
        }
    }
    end = system_clock::now();
    duration = duration_cast<microseconds>(end - start);
    auto cpu = duration.count();
    cout << "CPU总花费时间为:" << cpu << endl;
    cout << "时间比:" << cpu / 1.0 / gpu << endl;

    for (int i = 0; i < N*N; i++) {
        if (d[i] != c[i]) {
            cout << "error" << i << endl;
            break;
        }
    }
    cudaDeviceReset();

    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
    return 0;
}

执行结果如下(CPU是i5-8400,时间单位为微秒):

C:\Users\Pouee>C:\Users\Pouee\source\repos\MatrixMul\x64\Release\MatrixMul.exe
GPU总花费时间为:19022
CPU 开始执行
CPU总花费时间为:3649257
时间比:191.844

C:\Users\Pouee>C:\Users\Pouee\source\repos\MatrixMul\x64\Release\MatrixMul.exe
GPU总花费时间为:18769
CPU 开始执行
CPU总花费时间为:3718320
时间比:198.11

C:\Users\Pouee>C:\Users\Pouee\source\repos\MatrixMul\x64\Release\MatrixMul.exe
GPU总花费时间为:18809
CPU 开始执行
CPU总花费时间为:3802177
时间比:202.147

可以发现GPU的速度在本例子中大概是CPU的200倍.

这里还有一个不错的文章:link