批量删除xgrid jobs及其他
http://mind.qbi.uq.edu.au/xgrid/index.html
Xgrid - Beyond the hype
Introduction
There are websites which explain what Xgrid is and give you a bit of hype. There are websites which tell you how to set Xgrid up from a systems administration perspective. There are a websites that get you started with some simple jobs What there isn't is a website that allows you to get beyond these initial tasks and get some real work done. Hopefully this website fills the gap. To make my own biases explicit, I am examining Xgrid from the perspective of a scientific programmer.What is Xgrid
XGrid is a technology to make building a compute cluster easy. It was introduced by Apple with OS X 10.4 (with technology previews during 10.3). Xgrid is not a way of making your word processor or game run faster. Xgrid is simply a way of connecting computers together and it provides a queue to which jobs can be submitted. It is a way to allow developers to easily farm out jobs that can be done in parallel. It should be noted that Xgrid does not provide interprocess communication (IPC). If the parallel jobs need to communicate then the developer is responsible for including some method of IPC into their application. For the remainder of this article I'm going to assume that either your friendly sys admin has set up an Xgrid for you or you have gone to the websites above, followed the instructions, and got an Xgrid up and running.Review of the simple Xgrid jobs
Apple has provided a command line tool xgrid to submit jobs to an Xgrid. The easiest job you can run to test the grid is provided on the xgrid man page.xgrid -p somepassword -h computername.domainname -job run /usr/bin/cal 6 2006where computername.domainname is the name of the computer that the Xgrid controller is running on and somepassword is the password that the controller is expecting. The expected result is, of course,
June 2006If something goes wrong here check that you are using the Xgrid password and not your own user password :). You leave out the password and controller name if you set them as environment variables. I'm going to assume that you have set these environment variables for the rest of this document. This example assumes that you are running the bash shell. If you don't know what shell you are running then you are running the bash shell.
S M Tu W Th F S
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30
export XGRID_CONTROLLER_HOSTNAME=controllername.domainnameYou can see what jobs are running on the Xgrid by using the Xgrid Admin tool which Apple provides in the Server Admin Tools. If you use sleep as your job it will run long enough for you to be able to see it in the Xgrid Admin tool.
export XGRID_CONTROLLER_PASSWORD=somepassword
xgrid -job run /bin/sleep 20
A small touch of realism
Hopefully everything has gone well and you are quite excited by the potential that Xgrid offers. Let's now try to do something more realistic than running commands that you can assume are installed on every Xgrid agent machine. Suppose you have created your own program that you want to run on the grid and furthermore the program needs an input file. How do you get that going? We can simulate having your own program by copying cat to the local directorycp /bin/cat cat2and create an input file by
echo "hello world" > file.txtRun our new cat2 command on Xgrid by
xgrid -job run cat2 file.txtwhich gives the expected output
hello worldIt should be noted here that Xgrid has automagically shipped file.txt off to the agent that needed it. We were not required to tell Xgrid what input files were needed for the job. All that was specified was the command to run and the command line arguments to give to the command.
Running jobs asynchronously
Using the run command after -job means that the job runs synchronously in your terminal. This is not what you really want if you are going to run a lot of jobs. Luckily, running a job asynchronously is simple.xgrid -job submit cat2 file.txtwhich gives the job identifier as its output, for example,
{jobIdentifier = 842; }To collect the results of this job when you are ready you use
xgrid -job results -id 842which prints the results onto the screen. If you want the standard output and standard error to go into files then
xgrid -job results -id 842 -so stdoutfilename.txt -se stderrfilename.txtdoes the job.
First Gotcha
Try running the cal example again but this time leave off the explicit path. That is, runxgrid -job submit cal 6 2006Which gave the expected output
{jobIdentifier = 943; }If you check that the job is actually finished and not just waiting to execute with either the Xgrid Admin tool or
xgrid -job attributes -id 943which produces
{we can see that the job is actually finished. Now try to retrieve the results
jobAttributes = {
activeCPUPower = 0;
applicationIdentifier = "com.apple.xgrid.cli";
dateNow = 2006-06-08 17:34:21 +1000;
dateStarted = 2006-06-08 17:33:39 +1000;
dateStopped = 2006-06-08 17:33:40 +1000;
dateSubmitted = 2006-06-08 17:33:38 +1000;
jobStatus = Finished;
name = cal;
percentDone = 100;
taskCount = 1;
undoneTaskCount = 0;
};
}
xgrid -job results -id 943and you get absolutely nothing. There is no indication anywhere that the job failed to execute properly. Just a null output. The lesson to learn here is that if you are writing your own code to run on Xgrid then make sure you output something so that you know a job has truly been successful.
Its all a matter of timing
So far we have only submitted one job at a time to the grid. This is obviously an unrealistic example as if you only have a single job you may as well execute it on you local box. Also, one of the parameters that we need to know in order to know when it is appropriate to use Xgrid is how fast can we submit jobs to the controller. That is, what sort of overhead is there in just farming out the work. To test the submission speed I used the following script.#!/bin/bashWhen this script is used to submitting 30 jobs the typical difference in submission time, as reported by the Xgrid Admin tool, from the first job to the last job is 10 seconds. (Note that if you leave the & off the xgrid command and force the submission to be serial then the time is approximately 15 seconds). This overhead time can also be measured by the time that it takes to simply delete the finished jobs from the queue. The following script was used to delete the jobs after they had finished.
# multistartasync.sh
# The command line argument $1 is the total number of jobs to submit
ii=0
while [ $ii -lt $1 ]
do
xgrid -job submit /bin/sleep 35 &
ii=$[$ii+1]
done
#!/bin/bashThis script too took 10 seconds on a typical run. We can conclude that the overhead is just simply talking to the controller is about 1/3 of a second.
# xgriddelete.sh
# delete all xgrid jobs between $1 and $2 inclusive
ii=$1
iiEnd=$[1+$2]
while [ $ii -lt $iiEnd ]
do
xgrid -job delete -id $ii &
ii=$[$ii+1]
done
If you get the following error
./multistartasync.sh: fork: Resource temporarily unavailableit is due to the maximum number of user processes that OS X will allow you to have running. By default an individual user can have 100 processes. This limit can be raised but I will now argue against this approach. I intend to use Xgrid to process one or two hundred jobs at a time. The timing information presented means that it will take about 30 seconds (or one minute) to simply submit the jobs. If the jobs were extremely time consuming (say an hour) then this extra minute overhead would be negligible. Unfortunately a fair number of my jobs can be run in under 30 seconds. It would be nice if there was a way of reducing the overhead. Fortunately there is.
Batch Processing
The idea behind Xgrid batch processing is that a single job can have many tasks. This means that I could submit one hundred tasks in one job submission which would only take a fraction of the time to submit each task as an individual job. The job and all the tasks can be specified in either a plist or an xml version of a plist. The xgrid man page has a simple example and a complicated example of the job specification. The one I used to submit 3 jobs is as follows<?xml version="1.0"?>Save this file to job3.xml and submit it to Xgrid by
<plist version="1.0">
<array>
<dict>
<key>name</key>
<string>MultiJob</string>
<key>taskSpecifications</key>
<dict>
<key>0</key>
<dict>
<key>command</key>
<string>/usr/bin/uname</string>
<key>arguments</key>
<array>
<string>-a</string>
</array>
</dict>
<key>1</key>
<dict>
<key>command</key>
<string>/usr/bin/uname</string>
<key>arguments</key>
<array>
<string>-a</string>
</array>
</dict>
<key>2</key>
<dict>
<key>command</key>
<string>/usr/bin/uname</string>
<key>arguments</key>
<array>
<string>-a</string>
</array>
</dict>
</dict>
</dict>
</array>
</plist>
xgrid -job batch job3.xmlThe submission time was typically about 0.6 seconds. When 100 tasks were submitted in the job the time for submission typically increased to 1.5 seconds. This is a substantial improvement over the above 30 second alternative and doesn't require messing about with the maximum number of allowed processes.