-
Notifications
You must be signed in to change notification settings - Fork 4
PBS with ALPS Launch Example
Kevin Huck edited this page Oct 6, 2017
·
5 revisions
Below is a sample PBS script that uses ALPS (aprun) to launch applications. In this example, one application is using 4 nodes, another is using 1 node, and the SOS aggregator is running on 1 node. In this example, the applications were compiled with TAU measurement support, and TAU was configured to send data to SOS. The TAU and SOS integration is such that TAU will fork-exec the sosd
listener daemons on each node if there isn't one running already. This is necessary because ALPS will only allow 1 application to be launched with aprun
on each node.
#!/bin/bash -l
#PBS -N test
#PBS -q debug
#PBS -A ACCOUNTID
#PBS -l nodes=6,walltime=00:10:00
#PBS -j oe
#PBS -o both.out
# echo all commands - for debugging purposes
set -x
###
# -------------------- set the environment -------------------- #
###
# change to the directory where we were launched
cd $PBS_O_WORKDIR
# get the current working directory
export cwd=`pwd`
echo ${cwd}
# load modules
module swap PrgEnv-pgi PrgEnv-gnu
module load flexpath/1.12
module load adios/1.12.0
module load python/2.7.9
# set some output file names
export xmainOut="xmain.out"
export readOut="read2.out"
# SOS_FORK_COMMAND is used when TAU can't connect an SOS client to an existing SOS listener.
# The application MPI ranks will self-organize, and only 1 rank from each node will
# attempt to launch the SOS listener on that node, using this command.
# @LISTENER_RANK@ will get subsitited for the node rank within all MPI ranks for
# that application, also using an SOS_LISTENER_RANK_OFFSET value.
export sos_cmd="${cwd}/sosd -l 5 -a 1 -w ${cwd}"
export SOS_FORK_COMMAND="${sos_cmd} -k @LISTENER_RANK@ -r listener"
# Set the TCP port that the listener will listen to, and the port that clients will
# attempt to connect to.
export SOS_CMD_PORT=22500
# Set the directory where the SOS listeners and aggregators will use to establish
# EVPath links to each other. This needs to be a shared filesystem path, and it
# needs to be writable, of course.
export SOS_EVPATH_MEETUP=${cwd}
# Tell TAU that it should connect to SOS and send TAU data to SOS when adios_close(),
# adios_advance_step() calls are made, and when the application terminates.
export TAU_SOS=1
# If Verbose TAU output is required for debugging.
# export TAU_VERBOSE=1
# Make sure sosd can find libenet.so, if rpath wasn't used
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/sw/xk6/flexpath/1.12/cle5.2_gnu4.9.3/lib
# clean up old files, if necessary
rm -rf sosd.* profile* *.out
###
# -------------------- Launch the SOS aggregator -------------- #
###
# launch the aggregator - ALPS will take the first node of the allocation
# and the aggregator will be "rank" 0 in the SOS processes. Launch
# in the background, so we can continue launching other aprun calls.
aprun -n 1 -N 1 ${sos_cmd} -k 0 -r aggregator &
# sleep a few seconds, just to allow the aggregator to safely boot
sleep 5
###
# --------- Launch the listener app in the pipeline ----------- #
###
# 1 node doing reader
# ALPS will take the second node of the allocation
# Tell SOS how many application ranks per node there are
export SOS_APP_RANKS_PER_NODE=16
# Tell SOS what "rank" it's listeners should start with - the
# aggregator was "rank" 0, so this node's listener will be 1
export SOS_LISTENER_RANK_OFFSET=1
# Where should TAU write the profile data?
export PROFILEDIR=profiles_reader2
# Go! in the background, so we can continue launching aprun calls
aprun -n 16 -N 16 ./reader2 5 x z > ${readOut} 2>&1 &
# Wait a bit. Just to be safe...
sleep 3
# last 4 nodes doing xmain
# ALPS will take the third, fourth, fifth, sixth nodes of the allocation
# Tell SOS how many application ranks per node there are
export SOS_APP_RANKS_PER_NODE=16
# Tell SOS what "rank" it's listeners should start with - the
# aggregator was "rank" 0, and the reader node was 1,
# so this node's listeners will start at 2 and be 2,3,4,5
export SOS_LISTENER_RANK_OFFSET=2
# Where should TAU write the profile data?
export PROFILEDIR=profiles_xmain
# Go! in the foreground, so when xmain exits, the PBS allocation will exit.
aprun -n 64 -N 16 ./xmain > ${xmainOut} 2>&1
# wait for clean exit
sleep 3
wait