System Information
SN1 Software
Embedded Support Partner
Help
SN1 Applications Performance
Unreleased System Information
SN1 Home

Applications Performance

Results of running a standard NCOM ocean model case on Neo, Janeway, and a few other machines.

The Fortran compiler does not yet know about the SN1 (i.e. you can't use f77 -Ofast=ip35), so further optimization may be available later.

The single processor time is in-line with the 300 to 400 MHz speedup. (Note that it is still significantly slower than Alpha processors.)

It was very difficult to get 0.5GB of memory for a single processor job from Miser on Janeway, and there were much slower times than reported here on one processor - probably because the allocated memory was not local to the processor (or the local memory bandwidth was shared by another job). This suggests that the smaller per-processor memory capacity of Neo is going to be a significant factor on NRL's workload.

Taking the 16-processor time as the most representative of scaling (since it probably isn't optimal to use all the processors of the SN1 on one job), the SN1 is about 2x as fast as the 300MHz O2K and 3x as fast as the T3E. As usual, SHMEM scales better than MPI but both MPI and SHMEM performance improves on the SN1.

The following were runs on Janeway with ssusage:

janeway 183> pwd
/scratch/wallcraf/NCOM1/med12h
janeway 184> ll *log
-rw-r--r--    1 wallcraf   28638 Jul  3 10:59 ncom1_sgimpisr04.log
-rw-r--r--    1 wallcraf   31421 Jul  3 11:01 ncom1_sgimpisr16.log
-rw-r--r--    1 wallcraf   35142 Jul  3 11:05 ncom1_sgimpisr32.log
-rw-r--r--    1 wallcraf   26991 Jul  3 10:56 ncom1_sgione.log
-rw-r--r--    1 wallcraf   28425 Jul  3 11:13 ncom1_sgishmem04.log
-rw-r--r--    1 wallcraf   30111 Jul  3 11:14 ncom1_sgishmem16.log
-rw-r--r--    1 wallcraf   32335 Jul  3 11:15 ncom1_sgishmem32.log
janeway 185> cd SN1_400
/scratch/wallcraf/NCOM1/med12h/SN1_400
janeway 186> ll *log
-rw-r--r--    1 wallcraf   28536 Jul  1 08:51 ncom1_sn1mpisr04.log
-rw-r--r--    1 wallcraf   31241 Jul  1 08:52 ncom1_sn1mpisr16.log
-rw-r--r--    1 wallcraf   34740 Jul  1 08:55 ncom1_sn1mpisr32.log
-rw-r--r--    1 wallcraf   26920 Jun 30 17:54 ncom1_sn1one.log
-rw-r--r--    1 wallcraf   27916 Jul  2 12:22 ncom1_sn1shmem04.log
-rw-r--r--    1 wallcraf   29242 Jul  2 12:22 ncom1_sn1shmem16.log
-rw-r--r--    1 wallcraf   31019 Jul  2 12:57 ncom1_sn1shmem32.log

Wall time for 20 time steps of NCOM (NRL Coastal Ocean Model) on a 432x208x31 domain (requires about 0.5GB of memory).

A) Single processor times.
ncom1_alpha7one.log: time = 86.61230 (750 MHz)
ncom1_alpha6one.log: time = 96.45800 (666 MHz)
ncom1_sn1one.log: time = 143.50466 (400 MHz)
ncom1_sgione.log: time = 198.60332 (300 MHz)
ncom1_sp3one.log: time = 214.59000 (222 MHz)
ncom1_E10Kone.log: time = 301.49290 (400 MHz)
ncom1_sgione.log: time = 330.10620 (195 MHz)
ncom1_t3eone.log: time = 355.62373 (450 MHz)

Machine types:
Samsung Alpha 750MHz (Linux),
Compaq Alpha 667MHz (Linux),
SGI SN1 400MHz (IRIX64),
SGI Origin 300MHz (IRIX64),
IBM SP3 222MHz (AIX),
Sun E10000 400MHz (Solaris),
SGI Origin 195MHz (IRIX64),
Cray T3E 450MHz (unicosmk),

B) Sorted by machine type, ordered by 16-processor speed. Parallel method is either SHMEM or MPI using sendrecv. SN1 times are on a 32-processor beta machine.
ncom1_sn1one.log: time = 143.50466 (400 MHz)
ncom1_sn1shmem04.log: time = 38.79059
ncom1_sn1shmem16.log: time = 10.44291
ncom1_sn1shmem32.log: time = 6.52013
ncom1_sn1one.log: time = 143.50466 (400 MHz)
ncom1_sn1mpisr04.log: time = 39.76463
ncom1_sn1mpisr16.log: time = 11.96889
ncom1_sn1mpisr32.log: time = 9.45634
ncom1_sgione.log: time = 198.60332 (300 MHz)
ncom1_sgishmem04.log: time = 56.82219
ncom1_sgishmem16.log: time = 19.09513
ncom1_sgishmem32.log: time = 10.54070
ncom1_sgione.log: time = 198.60332 (300 MHz)
ncom1_sgimpisr04.log: time = 57.44376
ncom1_sgimpisr16.log: time = 21.48812
ncom1_sgimpisr32.log: time = 13.65315
ncom1_sgione.log: time = 330.10620 (195 MHz)
ncom1_sgishmem04.log: time = 68.05619
ncom1_sgishmem16.log: time = 31.25177
ncom1_sgishmem32.log: time = 25.68538
ncom1_t3eone.log: time = 355.62373 (450 MHz)
ncom1_t3eshmem04.log: time = 97.62309
ncom1_t3eshmem16.log: time = 32.56402
ncom1_t3eshmem32.log: time = 23.75500
ncom1_sgione.log: time = 330.10620 (195 MHz)
ncom1_sgimpisr04.log: time = 70.24065
ncom1_sgimpisr16.log: time = 41.54240
ncom1_sgimpisr32.log: time = 24.28571


Table of standard NLOM benchmark times on Neo compared to O2K's and the T3E (Note that the 300MHz O2K times quoted here were run on a SGI benchmark machine (not Janeway). The 195MHz times are from Odyssey when it was 128 processors.)

This is a larger problem than the NCOM benchmark, and it is also a completely realistic run (not just timing a few time steps). The benchmark was described in the Fall NAVO MSRC Navigator, see:

http://www.navo.hpc.mil/cgi-bin/Navigator/navigator.cgi

Taking 28-processor times as representative, the SN1 is 1.5x as fast of the O2K-300 (in-line with clock speed) and 2x as fast as the T3E. Note that is it 2.8x faster than the 195MHz O2K, so probably the 8MB cache on the 300 and 400 MHz R12000s give a boost for this code over the 4MB cache on the 195MHz R10000.

Performance of NRL Layered Ocean Model on HPC Platforms
MACHINE PARALLEL METHOD NUM. CPUS TIME MFLOPS SPEEDUP
Cray T3E-900 SHMEM 14 44.1 mins 1,064 (450 MHz)
Cray T3E-900 SHMEM 28 21.0 mins 2,236 2.10x 14 nodes
Cray T3E-900 SHMEM 56 10.2 mins 4,591 2.06x 28 nodes
Cray T3E-900 SHMEM 112 5.7 mins 8,184 1.79x 56 nodes
Cray T3E-900 SHMEM 224 3.4 mins 13,601 1.68x112 nodes
SGI SN1 SHMEM 14 24.1 mins 1,948 (400 MHz)
SGI SN1 SHMEM 28 10.3 mins 4,534 2.33x 14 nodes
SGI SN1 MPI 14 24.3 mins 1,926 (400 MHz)
SGI SN1 MPI 28 10.6 mins 4,423 2.30x 14 nodes
SGI O2K SHMEM 14 39.1 mins 1,198 (300 MHz)
SGI O2K SHMEM 28 15.9 mins 2,940 2.45x 14 nodes
SGI O2K SHMEM 56 7.4 mins 6,322 2.15x 28 nodes
SGI O2K SHMEM 112 4.1 mins 11,478 1.82x 56 nodes
SGI O2K MPI 14 40.1 mins 1,168 (300 MHz)
SGI O2K MPI 28 16.2 mins 2,900 2.48x 14 nodes
SGI O2K MPI 56 8.6 mins 5,472 1.89x 28 nodes
SGI O2K MPI 112 7.0 mins 6,705 1.23x 56 nodes
SGI O2K SHMEM 14 57.6 mins 814 (195 MHz)
SGI O2K SHMEM 28 28.9 mins 1,625 2.00x 14 nodes
SGI O2K SHMEM 56 15.6 mins 3,015 1.86x 28 nodes
SGI O2K SHMEM 112 7.8 mins 6,030 2.00x 56 nodes

  • Times are for a 3-day 1/32 degree Atlantic STG ocean model run
  • Grid size 2048 x 1344 x 5
  • Run includes typical I/O and data sampling
  • Does not include initialization time (before first tiem step)
  • MFLOPS based on hardware trace of single processor O2K run, MFLOPS = 2,813,188 / (Time ins seconds)

View Halo Benchmark Graph


Send comments or questions to ccshelp@cmf.nrl.navy.mil

NRL ~ Code 5000 ~ Code 5500 ~ Code 5590