Basic Tutorial

1. GASPI Execution model

  • GASPI features a SPMD / MPMD style of execution.
  • all GASPI procedures come with a prefix gaspi_
  • all procedures have a return value.
  • all potentially blocking procedures have a timeout mechanism.

2. Error handling

Procedure return values:

  typedef enum
  {
    GASPI_ERROR   = -1,
    GASPI_SUCCESS =  0,
    GASPI_TIMEOUT =  1
  }
  • GASPI_ERROR
    • designated operation failed -> check error vector
    • advice: Always check return value !
  • GASPI_SUCCESS
    • designated operation successfully completed
  • GASPI_TIMEOUT
    • designated operation could not be finished in the given period of time
    • not necessarily an error
    • the procedure has to be invoked subsequently in order to fully complete the designated operation

3. The GASPI timeout mechanism for potentially blocking procedures

Timeout: gaspi_timeout_t

  • GASPI_TEST ( value = 0 )
    • procedure completes local operations
    • procedure does not wait for data from other processes
  • GASPI_BLOCK ( value = -1 )
    • wait indefinitely (blocking)
  • value > 0
    • Maximum time in msec the procedure is going to wait for data from other ranks to make progress. Does not equal hard execution time

Procedure is guaranteed to return

4. Process management

gaspi_proc_init
gaspi_return_t
gaspi_proc_init ( gaspi_timeout_t const timeout )
  • initialization of resources
    • set up of communication infrastructure if requested
    • set up of default group GASPI_GROUP_ALL
  • rank assignment
    • position in machinefile corresponds to rank ID
    • no default segment creation
gaspi_proc_term
gaspi_return_t
gaspi_proc_term ( gaspi_timeout_t timeout )
  • clean up
    • wait for outstanding communication to be finished
    • release resources
  • not a collective operation !
gaspi_proc_rank
gaspi_return_t
gaspi_proc_rank ( gaspi_rank_t *rank )
  • process/rank identification, returns rank id.
gaspi_proc_num
gaspi_return_t
gaspi_proc_num ( gaspi_rank_t *proc_num )
  • returns total number of processes/ranks,

5. Hello world

Example: Hello world in GASPI

/*
 * This file is part of a small series of tutorial,
 * which aims to demonstrate key features of the GASPI
 * standard by means of small but expandable examples.
 * Conceptually the tutorial follows a MPI course
 * developed by EPCC and HLRS.
 *
 * Contact point for the MPI tutorial:
 *                 rabenseifner@hlrs.de
 * Contact point for the GASPI tutorial:
 *                 daniel.gruenewald@itwm.fraunhofer.de
 *                 mirko.rahn@itwm.fraunhofer.de
 *                 christian.simmendinger@t-systems.com
 */

#include "success_or_die.h"

#include <GASPI.h>
#include <stdlib.h>

int
main(int argc,
     char *argv[])
{
  SUCCESS_OR_DIE(gaspi_proc_init (GASPI_BLOCK));

  gaspi_rank_t rank;
  gaspi_rank_t num;

  SUCCESS_OR_DIE(gaspi_proc_rank (&rank));
  SUCCESS_OR_DIE(gaspi_proc_num (&num));

  printf ("Hello world from rank %d of %d\n",
		rank,
		num);

  SUCCESS_OR_DIE(gaspi_proc_term (GASPI_BLOCK));

  return EXIT_SUCCESS;
}
#ifndef SUCCESS_OR_DIE_H
#define SUCCESS_OR_DIE_H

#include <GASPI.h>
#include <stdlib.h>

#define SUCCESS_OR_DIE(f...)                                            \
  do                                                                    \
  {                                                                     \
    const gaspi_return_t r = f;                                         \
    if (r != GASPI_SUCCESS)                                             \
    {                                                                   \
      printf("Error: '%s' [%s:%i]: %i\n",#f,__FILE__,__LINE__,r);       \
      exit (EXIT_FAILURE);                                              \
    }                                                                   \
  } while (0)
#endif

6. Segments

  • software abstraction of hardware memory hierarchy
    • NUMA
    • GPU
    • Xeon Phi
  • a single partition of the Partitioned Global Address Space (PGAS)
    • contiguous block of virtual memory
    • no pre-defined memory model
  • memory management up to the application
    • locally / remotely accessible
    • local access by ordinary memory operations
    • remote access by GASPI communication routines

gaspi segments

GASPI provides only a few relatively large segments

  • segment allocation is expensive
  • the total number of supported segments is limited by hardware constraints

GASPI segments have an allocation policy

  • GASPI_MEM_UNINITIALIZED
    • memory is not initialized
  • GASPI_MEM_INITIALIZED
    • memory is initialized (zeroed)
gaspi_segment_create
gaspi_return_t
gaspi_segment_create ( gaspi_segment_id_t segment_id
                     , gaspi_size_t size
                     , gaspi_group_t group
                     , gaspi_timeout_t timeout
                     , gaspi_alloc_t alloc_policy )
  • collective short cut to
    • gaspi_segment_alloc (for more details - see GASPI specification)
    • gaspi_segment_register (for more details - see GASPI specification)

After successful completion, the segment is locally and remotely accessible by all ranks in the group.

gaspi_segment_ptr
gaspi_return_t
gaspi_segment_ptr ( gaspi_segment_id_t segment_id
                  , gaspi_pointer_t *pointer )
  • returns local pointer to allocated segment

Example: Segment allocation in GASPI

/*
 * This file is part of a small series of tutorial,
 * which aims to demonstrate key features of the GASPI
 * standard by means of small but expandable examples.
 * Conceptually the tutorial follows a MPI course
 * developed by EPCC and HLRS.
 *
 * Contact point for the MPI tutorial:
 *                 rabenseifner@hlrs.de
 * Contact point for the GASPI tutorial:
 *                 daniel.gruenewald@itwm.fraunhofer.de
 *                 mirko.rahn@itwm.fraunhofer.de
 *                 christian.simmendinger@t-systems.com
 */

#include "success_or_die.h"

#include <GASPI.h>
#include <stdlib.h>

int
main(int argc,
     char *argv[])
{
  static const int VLEN = 1 << 2;
  SUCCESS_OR_DIE(gaspi_proc_init (GASPI_BLOCK));

  gaspi_rank_t iProc, nProc;
  SUCCESS_OR_DIE(gaspi_proc_rank (&iProc));
  SUCCESS_OR_DIE(gaspi_proc_num (&nProc));

  gaspi_segment_id_t const segment_id = 0;
  gaspi_size_t const segment_size = VLEN * sizeof(double);

  SUCCESS_OR_DIE(gaspi_segment_create
		 (segment_id,
		  segment_size,
		  GASPI_GROUP_ALL,
		  GASPI_BLOCK,
		  GASPI_MEM_UNINITIALIZED
		  ));

  gaspi_pointer_t array;
  SUCCESS_OR_DIE(gaspi_segment_ptr (segment_id,
				    &array));

  for (int j = 0; j < VLEN; ++j)
    {
      ((double *) array)[j] = (double) (iProc * VLEN + j);
      printf ("rank %d elem %d: %f \n",
		    iProc,
		    j,
		    ((double *) array)[j]);
    }

  SUCCESS_OR_DIE(gaspi_proc_term (GASPI_BLOCK));
  return EXIT_SUCCESS;
}

7. Queues in GASPI

Different queues available to handle the communication requests

  • requests to be submitted to one of the supported queues

Advantages

  • more scalability
  • channels for different types of requests
  • similar types of requests are queued and synchronized together but independently from other ones
  • separation of concerns

Fairness of transfers posted to different queues is guaranteed

  • no queue should see ist communication requests delayed indefinitely
  • a queue is identified by its ID
  • synchronization of calls by the queue
  • queue order does not imply message order on the network / remote memory
  • a subsequent notify call is guaranteed to be nonovertaking for all previous posts to the same queue and rank.
gaspi_wait
gaspi_return_t
gaspi_wait ( gaspi_queue_id_t queue
           , gaspi_timeout_t timeout )
  • wait on local completion of all requests in a given queue
  • after successfull completion, all involved local buffers are valid

8. GASPI One-sided Communication

One sided-communication:

  • entire communication managed by the local process only
  • remote process is not involved
  • advantage: no inherent synchronization between the local and the remote process in every communication request
  • still: At some point the remote process needs knowledge about data availability
  • managed by weak synchronization primitives

Several notifications for a given segment

  • identified by notification ID
  • logical association of memory location and notification
gaspi_write_notify
gaspi_return_t
gaspi_write_notify ( gaspi_segment_id_t segment_id_local
                   , gaspi_offset_t offset_local
                   , gaspi_rank_t rank
                   , gaspi_segment_id_t segment_id_remote
                   , gaspi_offset_t offset_remote
                   , gaspi_size_t size
                   , gaspi_notification_id_t notification_id
                   , gaspi_notification_t notification_value
                   , gaspi_queue_id_t queue
                   , gaspi_timeout_t timeout )
  • post a put request into a given queue for transfering data from a local segment into a remote segment
  • posts a notification with a given value to a given queue
  • remote visibility guarantees remote data visibility of all previously posted writes in the same queue, the same segment and the same process rank
gaspi_notify_waitsome
gaspi_return_t
gaspi_notify_waitsome ( gaspi_segment_id_t segment_id
                      , gaspi_notification_id_t notific_begin
                      , gaspi_number_t notification_num
                      , gaspi_notification_id_t *first_id
                      , gaspi_timeout_t timeout )
  • monitors a contiguous subset of notification id‘s for a given segment
  • returns successfull if at least one of the monitored id‘s is remotely updated to a value unequal zero
gaspi_notify_reset
gaspi_return_t
gaspi_notify_reset ( gaspi_segment_id_t segment_id
                   , gaspi_notification_id_t notification_id
                   , gaspi_notification_t *old_notification_val)

  • atomically resets a given notification id and yields the old value

9. Communication Example

Round robin communicaiton with gaspi_write_notify and gaspi_waitsome.

gaspi segments

  • init local buffer
  • write to remote buffer
  • wait for data availability
  • print result

Example: Round robin communication in GASPI

/*
 * This file is part of a small series of tutorial,
 * which aims to demonstrate key features of the GASPI
 * standard by means of small but expandable examples.
 * Conceptually the tutorial follows a MPI course
 * developed by EPCC and HLRS.
 *
 * Contact point for the MPI tutorial:
 *                 rabenseifner@hlrs.de
 * Contact point for the GASPI tutorial:
 *                 daniel.gruenewald@itwm.fraunhofer.de
 *                 mirko.rahn@itwm.fraunhofer.de
 *                 christian.simmendinger@t-systems.com
 */

#include "success_or_die.h"
#include "waitsome.h"

#include <GASPI.h>
#include <stdlib.h>

#define RIGHT(iProc,nProc) ((iProc + nProc + 1) % nProc)
#define LEFT(iProc,nProc) ((iProc + nProc - 1) % nProc)

int
main(int argc,
     char *argv[])
{
  static const int VLEN = 1 << 2;

  SUCCESS_OR_DIE(gaspi_proc_init (GASPI_BLOCK));

  gaspi_rank_t iProc, nProc;
  SUCCESS_OR_DIE(gaspi_proc_rank (&iProc));
  SUCCESS_OR_DIE(gaspi_proc_num (&nProc));

  gaspi_segment_id_t const segment_id = 0;
  gaspi_size_t const segment_size = 2 * VLEN * sizeof(double);

  SUCCESS_OR_DIE(gaspi_segment_create
		 (segment_id,
		  segment_size,
		  GASPI_GROUP_ALL,
		  GASPI_BLOCK,
		  GASPI_MEM_UNINITIALIZED
		  ));

  gaspi_pointer_t array;
  SUCCESS_OR_DIE(gaspi_segment_ptr (segment_id,
				    &array));

  double * src_array = (double *) (array);
  double * rcv_array = src_array + VLEN;

  for (int j = 0; j < VLEN; ++j)
    {
      src_array[j] = (double) (iProc * VLEN + j);
    }

  gaspi_notification_id_t data_available = 0;
  gaspi_queue_id_t queue_id = 0;

  gaspi_offset_t loc_off = 0;
  gaspi_offset_t rem_off = VLEN * sizeof(double);


  SUCCESS_OR_DIE(gaspi_wait (queue_id,
			     GASPI_BLOCK));

  SUCCESS_OR_DIE(gaspi_write_notify( segment_id,
		 loc_off,
		 RIGHT (iProc, nProc),
		 segment_id,
		 rem_off,
		 VLEN * sizeof (double),
		 data_available,
		 1 + iProc,
		 queue_id,
		 GASPI_BLOCK
		 ));

  gaspi_notification_id_t id;
  gaspi_notification_t expected = 1 + LEFT(iProc,
					   nProc);
  
  SUCCESS_OR_DIE(gaspi_notify_waitsome (segment_id,
					data_available,
					1,
					&id,
					GASPI_BLOCK));
  ASSERT(id == data_available);

  gaspi_notification_t value;
  SUCCESS_OR_DIE(gaspi_notify_reset (segment_id,
				     id,
				     &value));
  ASSERT(value == expected);

  for (int j = 0; j < VLEN; ++j)
    {
      printf ("rank %d rcv elem %d: %f \n",
		    iProc,
		    j,
		    rcv_array[j]);
    }


  SUCCESS_OR_DIE(gaspi_wait (queue_id,
			     GASPI_BLOCK));

  SUCCESS_OR_DIE(gaspi_proc_term (GASPI_BLOCK));

  return EXIT_SUCCESS;
}
#ifndef ASSERT_H
#define ASSERT_H

#include <stdio.h>
#include <stdlib.h>

#define ASSERT(x...)                                                    \
  if (!(x))                                                             \
  {                                                                     \
    fprintf (stderr, "Error: '%s' [%s:%i]\n", #x, __FILE__, __LINE__);  \
    exit (EXIT_FAILURE);                                                \
  }

#endif
#ifndef WAITSOME_H
#define WAITSOME_H

#include <GASPI.h>
void wait_or_die ( gaspi_segment_id_t
                 , gaspi_notification_id_t
                 , gaspi_notification_t expected
                 );

#endif
#include "waitsome.h"
#include "assert.h"
#include "success_or_die.h"

void
wait_or_die(gaspi_segment_id_t segment_id,
	    gaspi_notification_id_t notification_id,
	    gaspi_notification_t expected
	    )
{
  gaspi_notification_id_t id;
  SUCCESS_OR_DIE(gaspi_notify_waitsome (segment_id,
					notification_id,
					1,
					&id,
					GASPI_BLOCK));

  ASSERT(id == notification_id);

  gaspi_notification_t value;
  SUCCESS_OR_DIE(gaspi_notify_reset (segment_id,
				     id,
				     &value));

  ASSERT(value == expected);
}

Advanced Tutorial

10. Pipelined Matrix Transpose

  • highlights interoperability between GASPI and MPI
  • demonstrates how to use notified communication for improved scalability.

Pipelined Transpose

Pipelined global matrix transpose (column-based matrix distribution)

  • hybrid implementation with global transpose followed by local transpose
  • required communication to all target ranks is issued in single communication step
  • download Pipelined Matrix Transpose