Holly Cluster

UCLA Mathematics Consulting Group



HOLLY Cluster


The HOLLY cluster is a set of ten Intel Pentium III rack machines, with 1.0 GHz speed, and 2.0GB memory per node, plus a fileserver with a 10 disk raid array. These machines run the current version of SuSE Linux, 8.2. The nodes are numbered, holly01 - holly10, and the file server is hollyfs.

The purpose of these machines is to run short term serial jobs, generally for parametric studies and algorithmic development, for the Applied Mathematics Group.


Running jobs on HOLLY


Summary: Compile your code, rsh to hollyfs, run the batch job submitter, view your job in the queue, view your output and/or job execution statistics.

1) First, it is assumed that you have already complied your code (in C, C++, F77, F90, etc.)
2) From the same directory as your code is located, spawn a remote shell to hollyfs. To do this type rsh hollyfs
3) To run the batch job submitter, simply type:

job.q


and then a queueing script will appear in your UNIX window. The script will guide you through a series of questions for submitting your job. First and foremost, you should "build" your command script to run the job. This can be done by selecting "b" for build. Afterwards, it will create the command script for you.

If instead of using the batch submission (queueing) procedure, you intend to log directly in to one of the holly machines, you can also run jobs that way. However, you run the risk of overloading a machine that could already be running jobs. In addition, if you a running a job there, the queueing system will overlook that machine when submitting a new job (as long as your job is running on it). I.e., if others want to submit to holly, they may not be apprised of the status of the available machines if you are running your job outside of the queueing system.


Reviewing job queue status on HOLLY


To review the job queue status on HOLLY, simply type:

qstat


Then, you will see a screen which displays the jobs running on the various HOLLY nodes. It will look something like this:

job-ID    prior name        user p         state submit/start at      queue       master   ja-task-ID
---------------------------------------------------------------------------------------------
      21       0   hello.sh.c   ra      t       09/26/2003   15:47:48   holly03.q    MASTER

To interpret this: job-ID is 21, user is ra, file submitted is hello.sh.cmd, time of submission is 3:47 on 9/26, and the job ran on queue 'holly03.q' (one of the HOLLY nodes). You may also see jobs from other users submitted in the holly queue as well.


Sample job run on HOLLY



The following are a sequence of windows which appear in your command line screen after running 'job.q':


Notice to users of the job.queue script:
The output for SGE jobs generated by the job.queue script
will be written to two files:


'jobname'.joblog will contain the output from the 'jobname'.cmd script.


'jobname'.output will contain the output from the program or script being executed.


Enter to continue.




Functions (acceptable abbreviations are shown in CAPS)
Menu: Display this menu
Build: Build a SGE .cmd file for Serial
Submit: Submit a SGE .cmd file for execution
STatus: Display the status of SGE jobs for ra
SYsstat: Display the status of SGE jobs for the system
Hold: Hold a SGE job
RELease: Release a SGE job that is held
RESet Reset the priority of a SGE job
Cancel: Cancel a SGE job
Quit: Exit this script

Command: b (<-- for build)




Enter the name of the program or script to be executed : hello.sh (<-- for example)



Checking for duplicate queue control files.
You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.cmd" file.

Do you want to remove this file and continue (y or n)?
'default n': y (<-- for example)




Enter any arguments for the hello.sh program or script (default none):



The "hello.sh.cmd" file has been built. Would you like to submit it (y or n)? : y (<-- for example)



Checking for duplicate output files.
You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.joblog" file.
You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.output" file.

Do you want to overwrite these files and continue (y or n)?
'default n': y (<-- for example)




your job 19 ("hello.sh.cmd") has been submitted
Current SGE job status for ra

job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
19 0 hello.sh.c ra qw 09/25/2003 12:00:05

Enter to continue.




BRIEF EXPLANATION OF WHAT WENT ON ABOVE.

1) You ran 'job.q'.
2) It launched a notice screen, you hit enter
3) It gave you a menu; you selected 'b' to build the command script
4) It asked for the name of the file to run (e.g., hello.sh)
5) It found that you already have a command script; asked to remove it; you answered yes
6) It asked you to enter any arguments. You had none, and hit enter
7) It told you the command file was built; it asked you to submit it; you answered yes
8) It checked for duplicate files; it asked to overwrite them; you answered yes
   (please note, each output file is NOW tagged with the job ID number at the end.
9) It submitted your job, and gave you a message indicating the status in the queue
10) If you hit enter again, you'll return to the main menu, and then hit 'q' for quit



FURTHER NOTES

1) If you have built your command script, and just want to run it (perhaps with different input
   only, then you type 's' for submit instead of 'b' for build. If you 'submit' a job, you
   still need only type the filename (e.g., hello.sh), and NOT the command script name hello.sh.cmd.
2) You will receive 2 emails to your address in regards to the job: 1) tells you when the job
   was submitted, and 2) tells you full details of the job (runtime, completion time, etc.).
   Samples of these emails are below.



The EMAILS that you will receive regarding the job you just ran

Subject: Job 19 (hello.sh.cmd) Started

Job 19 (hello.sh.cmd) Started
User = ra
Queue = holly03.q
Host = holly03.math.ucla.edu
Start Time = 09/25/2003 12:00:13

Subject: Job 19 (hello.sh.cmd) Complete

Job 19 (hello.sh.cmd) Complete
User = ra
Queue = holly03.q
Host = holly03.math.ucla.edu
Start Time = 09/25/2003 12:00:13
End Time = 09/25/2003 12:00:13
User Time = 00:00:00
System Time = 00:00:00
Wallclock Time = 00:00:00
CPU = NA
Max vmem = NA
Exit Status = 0