Programación CUDA C con 2 tarjetas de video

Soy muy nuevo en la programación de CUDA y estaba leyendo la 'Guía de programación de CUDA C' provista por nvidia. (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf)Programación CUDA C con 2 tarjetas de video

En la página 25, tiene el siguiente código C que realiza la multiplicación de la matriz. ¿Puedes decirme cómo puedo hacer que ese código se ejecute en dos dispositivos? (si tengo dos tarjetas habilitadas para nvida CUDA en mi computadora). ¿Puedes mostrarme un ejemplo?

// Matrices are stored in row-major order: 
// M(row, col) = *(M.elements + row * M.stride + col) 
typedef struct { 
    int width; 
    int height; 
    int stride; 
    float* elements; 
} Matrix; 

// Get a matrix element 
__device__ float GetElement(const Matrix A, int row, int col) 
{ 
    return A.elements[row * A.stride + col]; 
} 

// Set a matrix element 
__device__ void SetElement(Matrix A, int row, int col, float value) 
{ 
    A.elements[row * A.stride + col] = value; 
} 

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is 
// located col sub-matrices to the right and row sub-matrices down 
// from the upper-left corner of A 
__device__ Matrix GetSubMatrix(Matrix A, int row, int col) 
{ 
    Matrix Asub; 
    Asub.width = BLOCK_SIZE; 
    Asub.height = BLOCK_SIZE; 
    Asub.stride = A.stride; 
    Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row + BLOCK_SIZE * col]; 
    return Asub; 
    } 

// Thread block size 
#define BLOCK_SIZE 16 

// Forward declaration of the matrix multiplication kernel 
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix); 

// Matrix multiplication - Host code 
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE 
void MatMul(const Matrix A, const Matrix B, Matrix C) 
{ 
    // Load A and B to device memory 
    Matrix d_A; 
    d_A.width = d_A.stride = A.width; d_A.height = A.height; 
    size_t size = A.width * A.height * sizeof(float); 
    cudaMalloc(&d_A.elements, size); 
    cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice); 
    Matrix d_B; 
    d_B.width = d_B.stride = B.width; d_B.height = B.height; 
    size = B.width * B.height * sizeof(float); 
    cudaMalloc(&d_B.elements, size); 
    cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice); 

    // Allocate C in device memory 
    Matrix d_C; 
    d_C.width = d_C.stride = C.width; d_C.height = C.height; 
    size = C.width * C.height * sizeof(float); 
    cudaMalloc(&d_C.elements, size); 

    // Invoke kernel 
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 
    dim3 dimGrid(B.width/dimBlock.x, A.height/dimBlock.y); 
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); 

    // Read C from device memory 
    cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost); 

    // Free device memory 
    cudaFree(d_A.elements); 
    cudaFree(d_B.elements); 
    cudaFree(d_C.elements); 
} 

// Matrix multiplication kernel called by MatMul() 
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) 
{ 
    // Block row and column 
    int blockRow = blockIdx.y; 
    int blockCol = blockIdx.x; 

    // Each thread block computes one sub-matrix Csub of C 
    Matrix Csub = GetSubMatrix(C, blockRow, blockCol); 

    // Each thread computes one element of Csub 
    // by accumulating results into Cvalue 
    float Cvalue = 0; 

    // Thread row and column within Csub 
    int row = threadIdx.y; 
    int col = threadIdx.x; 

    // Loop over all the sub-matrices of A and B that are 
    // required to compute Csub 
    // Multiply each pair of sub-matrices together 
    // and accumulate the results 
    for (int m = 0; m < (A.width/BLOCK_SIZE); ++m) 
    { 
     // Get sub-matrix Asub of A 
     Matrix Asub = GetSubMatrix(A, blockRow, m); 
     // Get sub-matrix Bsub of B 
     Matrix Bsub = GetSubMatrix(B, m, blockCol); 

     // Shared memory used to store Asub and Bsub respectively 
     __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; 
     __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; 

     // Load Asub and Bsub from device memory to shared memory 
     // Each thread loads one element of each sub-matrix 
     As[row][col] = GetElement(Asub, row, col); 
     Bs[row][col] = GetElement(Bsub, row, col); 

     // Synchronize to make sure the sub-matrices are loaded 
     // before starting the computation 
     __syncthreads(); 

     // Multiply Asub and Bsub together 
     for (int e = 0; e < BLOCK_SIZE; ++e) 
      Cvalue += As[row][e] * Bs[e][col]; 

     // Synchronize to make sure that the preceding 
     // computation is done before loading two new 
     // sub-matrices of A and B in the next iteration 
     __syncthreads(); 
    } 

    // Write Csub to device memory 
    // Each thread writes one element 
    SetElement(Csub, row, col, Cvalue); 
}

Fuente

2012-07-16 Dale B

Sé que NVIDIA trajo algunas mejoras a su API de dispositivos múltiples con CUDA 3.0. Cada vez que CUDA interactúa con una GPU, lo hace en el contexto de un hilo, si quiere interactuar con múltiples GPU, debe hacerlo usted mismo manualmente, ambos en código, pero también debe descomponer manualmente la operación matemática específica que desee. realizar (en este caso, matriz múltiple, que probablemente no sea tan difícil, pero tampoco es exactamente trivial, ya que necesita un enfoque de mapa/reducir). Editar: será más fácil ayudarte, si solo quieres lo que estás buscando. – Svend

No hay una forma "automática" de ejecutar un kernel CUDA en múltiples GPU.

Tendrá que idear una forma de descomponer el problema de la multiplicación de la matriz en operaciones independientes que se pueden ejecutar en paralelo (por lo que una en cada GPU en paralelo). Como un simple ejemplo:

C = A.B es equivalente a C = [A].[B1|B2] = [A.B1|A.B2] donde B1 y B2 son adecuadamente matrices dimensionadas que contienen las columnas de la matriz B y | denota concantenation por columnas. Puede calcular A.B1 y A.B2 como operaciones de multiplicación de matrices separadas, y luego realizar la concatenación al copiar las sub matrices resultantes a la memoria del host.

Una vez que tenga un esquema de descomposición adecuado, a continuación, impleméntelo usando las instalaciones estándar multi-gpu en la API CUDA 4.x. Para una excelente descripción general de la programación multi-GPU usando las API de CUDA, recomiendo ver la excelente charla de Paulius Micikevicius de GTC 2012, que está disponible como transmisión de video y PDF here.

Fuente

2012-07-16 09:45:32 talonmies

Muchas gracias por todas sus respuestas. Ellos ayudaron mucho. –

Los fundamentos se describen en la CUDA C Programming Guide under section 3.2.6.

Básicamente, se puede establecer en el que el hilo GPU host actual opera sobre llamando cudaSetDevice(). De todos modos, debe escribir su propio código para descomponer sus rutinas y dividirlas en varias GPU.

Fuente

2012-07-16 09:47:37 inf

Programación CUDA C con 2 tarjetas de video

Respuesta

Cuestiones relacionadas