[Lintula home] -> [Instructions] -> [N1 Grid Engine]
[Tämä sivu suomeksi]

Lintula

N1 Grid Engine


General information

N1 Grid Enginr is meant for running computational tasks as batch jobs. System distributes jobs automatically to computers where user has access, and which have free resources. Grid includes the following computers: outolintu cluster and classrooms TC217 and TC407 from Institute of Signal Prosessing, tltmatlab nodes from Institute of Communications Engineering, and some of the Lintula ssh servers. By default jobs are allowed to take 24 hours, for longer jobs user needs to set new limit, see Resource requirements.

Before using Grid Engine user must add few enviroment variables to current shell. For bash users this can be done as follows:

source /share/sge_root/outolintu/common/settings.sh
And for tcsh users like this:
source /share/sge_root/outolintu/common/settings.csh
This row can be added to users shell initialization file.

One user can have 10 jobs running at the same time. Number of jobs in queue is not limited.

Sending a job

Jobs are submitted to Grid Engine with qsub command. Job can be single command or shell script. Jobs can be submitted from Lintula ssh servers (harakka, kiuru), Sun Ray clients, employee workstations and outolintu- and tltmatlab-nodes.

Submitting a job:

mustaharakka:~$ qsub ./testi.sh
Your job 155 ("testi.sh") has been submitted
mustaharakka:~$
If job is binary file, it must be submitted with qsub -b y.
mustaharakka:~$ qsub -b y uname
Your job 165 ("uname") has been submitted
mustaharakka:~$
Job consisting of multiple commands can also be submitted without the need to build a shell script. qsub without parameters reads the commands from the stdin stream. This way commands can be entered by keyboard or from file, and job is submitted when input ends.
mustaharakka:~$ qsub
uname
ls -l
<CTRL-D>
Your job 166 ("STDIN") has been submitted
mustaharakka:~$
More instructions for submitting a job is available with qsub -help and man qsub.

Results

stdout stream of a finished job can be found in users home directory in file <jobname>.o<jobid>[.taskid], and stderr in <jobname>.e<jobid>[.taskid] respectively.

For example previous job named "STDIN" outputs are in files STDIN.o166 and STDIN.e166.

Resource requirements

Jobs can be given soft and hard resource requirements. Hard requirements must be fullfilled in order for the job to be scheduled. Grid Engine tries to satisfy soft requirements, but job can also be run on some machine which doesn't meet all of them. By default requirements are considered hard, soft requirements are implied by -soft parameter. Resource requirements are given by submitting a job with -l parameter.

Architecture which jobs are run on is either sol-sparc64, lx24-x86 or lx24-amd64.

Here we submit a job which can only be run on Solaris machine:

mustaharakka:~$ qsub -l arch=sol-sparc64 ./job.sh
Your job 175 ("job.sh") has been submitted
mustaharakka:~$
Multiple requirements can be given separated with comma:
mustaharakka:~$ qsub -l arch=sol-sparc64,mf=3G ./job.sh
Your job 176 ("job.sh") has been submitted
mustaharakka:~$

Some of the available resources are listed here:

Resource Explanation
arch Architecture the job will be run on, either lx24-x86, lx24-amd64 or sol-sparc64.
h_cpu=HH:MM:SS cputime, time job is allowed to spend on cpu. When the limit is reached job will be killed.
h_rt=HH:MM:SS realtime, time job is allowed to spend. When the limit is reached job will be killed. By default the limit is 24 hours.
Bigger limits, for example 4 days, can be given like this:
qsub -l h_rt=96:0:0 ./skripti.sh
mf=<amount> memoryfree, required amount of free physical memory.
For example require host running the job to have at least 1 GB of free memory:
qsub -l mf=1G ./skripti.sh
mt=<amount> memorytotal, required amount of total physical memory.
vf=<amount> virtualfree, required amount of free virtual memory.
vt=<amount> virtualtotal, required amount of total virtual memory.

Forbidding the restart of the job

If computer running jobs crashes or is restarted, jobs are sent to some other available computer. If for some reason this is not desirable, restarting of a job can be forbidden with qsub -r no.

Selecting queue to use

If job needs to be sent on specific queue, it can be done with qsub -q <queue>.
For example submit job to be run on outolintu-cluster:
qsub -q outolintu ./testi.sh
Multiple queues can also be specified, and one of them is used:
qsub -q TC407,TC217 ./testi.sh
Available queues are: Queues which utilize computers in workstation classes may be suspended during daytime, if classrooms are in use. New jobs can be submitted any time, even if the queues are suspended.

Array job

If user wants to run same program multiple times with different input, it can be submitted as array job. This way one job consist of multiple tasks, which can be distributed to different machines. Script skripti.sh which is being run as array job can look like this:
./prog input_${SGE_TASK_ID}.txt >> output_${SGE_TASK_ID}.txt
Desired amount of tasks is given with -t parameter:
qsub -t 1-20 ./skripti.sh
Program prog is now run as 20 different tasks with inputs input_1.txt - input_20.txt, and they all direct their output to different files.

Status of the jobs

Status of the queued and running jobs can be checked with qstat command:
mustaharakka:~$ qstat -u aoksavuo
job-ID  prior   name       user         state submit/start at     queue
                         slots ja-task-ID
-
-----------------------------------------------------------------------------------------------------------------
    156 0.55500 testi.sh   aoksavuo     r     04/02/2007 16:06:14 tltmatlab@tltmatlab2.cs.tut.fi     1
    157 0.55500 testi.sh   aoksavuo     r     04/02/2007 16:06:14 tltmatlab@tltmatlab2.cs.tut.fi     1
    158 0.00000 testi.sh   aoksavuo     qw    04/02/2007 16:14:45 1
mustaharakka:~$
Qstat doesn't show jobs which have already finished, but information about those can be seen with qacct.
mustaharakka:~$ qacct -j 155
==============================================================
qname        tltmatlab
hostname     tltmatlab2.cs.tut.fi
group        aoksavuo
owner        aoksavuo
project      NONE
department   defaultdepartment
jobname      testi.sh
jobnumber    155
taskid       undefined
account      sge
priority     0
qsub_time    Mon Apr  2 15:37:45 2007
start_time   Mon Apr  2 15:37:59 2007
end_time     Mon Apr  2 15:47:59 2007
.
.
maxvmem      8.273M
mustaharakka:~$

Removing of the jobs

Job submitted with qsub can be removed from the queue with qdel command. Job must be identified with job-ID which can be acquired with qstat command.
mustaharakka:~$ qdel 158
aoksavuo has registered the job 158 for deletion
mustaharakka:~$

Default settings

User can store default values for qsub in .sge_request file in his home directory. This way there's no need to give these parameters on command line every time job is submitted. Parameters given on command line will still override the ones in .sge_request file.

Example of sge_request file which makes qsub use only Linux-machines.

mustaharakka:~$ cat .sge_request
-l arch=lx24-x86
mustaharakka:~$

Interactive job

In addition to batch jobs, it is possible to run interactive jobs on outolintu- and tltmatlab-nodes. Interactive job can be started with qlogin command, which opens an ssh connection to least loaded node. Job is considered done when the connection is closed. Session opened with qlogin command should not be used to run screen or any similar programs, because after the connection is closed, there is no guarantee that the next interactive connection will be to the same machine. Interactive jobs use SSH to connect to the execution host, so it is recommended to use SSH-keys and SSH-agent to make life easier.
mustaharakka:~$ qlogin
Your job 274 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 274 has been successfully scheduled.
Establishing /share/sge_root/bin/qlogin_command session to host tltmatlab2.cs.tut.fi ...
Last login: Fri Apr 20 12:11:59 2007 from jalokotka.cs.tut.fi
-----------------------------------------------------------------------
Welcome to Lintula! (TUT/CS)              http://www.cs.tut.fi/lintula/
-----------------------------------------------------------------------

* You can change your password with passwd -command on
  Lintula Solaris machines (for example kaarne.cs.tut.fi).

* You can check your disk usage with vquota -command.

-----------------------------------------------------------------------
tltmatlab2:~$

It is also possible to give resource requirements to qlogin with -l parameter.
For example machine to run the interactive job must have at least 2 GB of free ram:

qlogin -l mf=2G

If user doesn't have permissions to use outolintu- or tltmatlab-nodes, or user has requirements in .sge_request that those machines can't fulfill, qlogin will give error message like this.

mustaharakka:~$ qlogin
Your job 278 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...timeout (3 s) expired while waiting on socket fd 5


Your "qlogin" request could not be scheduled, try again later.
mustaharakka:~$

25.05.2011