index.html

Xgrid - Beyond the hype

Introduction

There are websites which explain what Xgrid is and give you a bit of hype.

There are websites which tell you how to set Xgrid up from a systems administration perspective.

There are a websites that get you started with some simple jobs

What there isn't is a website that allows you to get beyond these initial tasks and get some real work done. Hopefully this website fills the gap. To make my own biases explicit, I am examining Xgrid from the perspective of a scientific programmer.

What is Xgrid

XGrid is a technology to make building a compute cluster easy. It was introduced by Apple with OS X 10.4 (with technology previews during 10.3). Xgrid is not a way of making your word processor or game run faster. Xgrid is simply a way of connecting computers together and it provides a queue to which jobs can be submitted. It is a way to allow developers to easily farm out jobs that can be done in parallel. It should be noted that Xgrid does not provide interprocess communication (IPC). If the parallel jobs need to communicate then the developer is responsible for including some method of IPC into their application. For the remainder of this article I'm going to assume that either your friendly sys admin has set up an Xgrid for you or you have gone to the websites above, followed the instructions, and got an Xgrid up and running.

Review of the simple Xgrid jobs

Apple has provided a command line tool xgrid to submit jobs to an Xgrid. The easiest job you can run to test the grid is provided on the xgrid man page.

xgrid -p somepassword -h computername.domainname -job run /usr/bin/cal 6 2006

where computername.domainname is the name of the computer that the Xgrid controller is running on and somepassword is the password that the controller is expecting. The expected result is, of course,

     June 2006
 S  M Tu  W Th  F  S
             1  2  3
 4  5  6  7  8  9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30

If something goes wrong here check that you are using the Xgrid password and not your own user password :). You leave out the password and controller name if you set them as environment variables. I'm going to assume that you have set these environment variables for the rest of this document. This example assumes that you are running the bash shell. If you don't know what shell you are running then you are running the bash shell.

export XGRID_CONTROLLER_HOSTNAME=controllername.domainname
export XGRID_CONTROLLER_PASSWORD=somepassword

You can see what jobs are running on the Xgrid by using the Xgrid Admin tool which Apple provides in the Server Admin Tools. If you use sleep as your job it will run long enough for you to be able to see it in the Xgrid Admin tool.

xgrid -job run /bin/sleep 20

A small touch of realism

Hopefully everything has gone well and you are quite excited by the potential that Xgrid offers. Let's now try to do something more realistic than running commands that you can assume are installed on every Xgrid agent machine. Suppose you have created your own program that you want to run on the grid and furthermore the program needs an input file. How do you get that going? We can simulate having your own program by copying cat to the local directory

cp /bin/cat cat2

and create an input file by

echo "hello world" > file.txt

Run our new cat2 command on Xgrid by

xgrid -job run cat2 file.txt

which gives the expected output

hello world

It should be noted here that Xgrid has automagically shipped file.txt off to the agent that needed it. We were not required to tell Xgrid what input files were needed for the job. All that was specified was the command to run and the command line arguments to give to the command.

Running jobs asynchronously

Using the run command after -job means that the job runs synchronously in your terminal. This is not what you really want if you are going to run a lot of jobs. Luckily, running a job asynchronously is simple.

xgrid -job submit cat2 file.txt

which gives the job identifier as its output, for example,

{jobIdentifier = 842; }

To collect the results of this job when you are ready you use

xgrid -job results -id 842

which prints the results onto the screen. If you want the standard output and standard error to go into files then

xgrid -job results -id 842 -so stdoutfilename.txt -se stderrfilename.txt

does the job.

First Gotcha

Try running the cal example again but this time leave off the explicit path. That is, run

xgrid -job submit cal 6 2006

Which gave the expected output

{jobIdentifier = 943; }

If you check that the job is actually finished and not just waiting to execute with either the Xgrid Admin tool or

xgrid -job attributes -id 943

which produces

{
    jobAttributes = {
        activeCPUPower = 0; 
        applicationIdentifier = "com.apple.xgrid.cli"; 
        dateNow = 2006-06-08 17:34:21 +1000; 
        dateStarted = 2006-06-08 17:33:39 +1000;  
        dateStopped = 2006-06-08 17:33:40 +1000; 
        dateSubmitted = 2006-06-08 17:33:38 +1000; 
        jobStatus = Finished; 
        name = cal; 
        percentDone = 100; 
        taskCount = 1;  
        undoneTaskCount = 0; 
    }; 
}

we can see that the job is actually finished. Now try to retrieve the results

xgrid -job results -id 943

and you get absolutely nothing. There is no indication anywhere that the job failed to execute properly. Just a null output. The lesson to learn here is that if you are writing your own code to run on Xgrid then make sure you output something so that you know a job has truly been successful.

Its all a matter of timing

So far we have only submitted one job at a time to the grid. This is obviously an unrealistic example as if you only have a single job you may as well execute it on you local box. Also, one of the parameters that we need to know in order to know when it is appropriate to use Xgrid is how fast can we submit jobs to the controller. That is, what sort of overhead is there in just farming out the work. To test the submission speed I used the following script.

#!/bin/bash
# multistartasync.sh
# The command line argument $1 is the total number of jobs to submit
ii=0
while [ $ii -lt $1 ]
do  
xgrid -job submit /bin/sleep 35 &
ii=$[$ii+1]
done

When this script is used to submitting 30 jobs the typical difference in submission time, as reported by the Xgrid Admin tool, from the first job to the last job is 10 seconds. (Note that if you leave the & off the xgrid command and force the submission to be serial then the time is approximately 15 seconds). This overhead time can also be measured by the time that it takes to simply delete the finished jobs from the queue. The following script was used to delete the jobs after they had finished.

#!/bin/bash
# xgriddelete.sh
# delete all xgrid jobs between $1 and $2 inclusive

ii=$1
iiEnd=$[1+$2]

while [ $ii -lt $iiEnd ]
do  
xgrid -job delete -id $ii &
ii=$[$ii+1]
done

This script too took 10 seconds on a typical run. We can conclude that the overhead is just simply talking to the controller is about 1/3 of a second.

If you get the following error

./multistartasync.sh: fork: Resource temporarily unavailable

it is due to the maximum number of user processes that OS X will allow you to have running. By default an individual user can have 100 processes. This limit can be raised but I will now argue against this approach. I intend to use Xgrid to process one or two hundred jobs at a time. The timing information presented means that it will take about 30 seconds (or one minute) to simply submit the jobs. If the jobs were extremely time consuming (say an hour) then this extra minute overhead would be negligible. Unfortunately a fair number of my jobs can be run in under 30 seconds. It would be nice if there was a way of reducing the overhead. Fortunately there is.

Batch Processing

The idea behind Xgrid batch processing is that a single job can have many tasks. This means that I could submit one hundred tasks in one job submission which would only take a fraction of the time to submit each task as an individual job. The job and all the tasks can be specified in either a plist or an xml version of a plist. The xgrid man page has a simple example and a complicated example of the job specification. The one I used to submit 3 jobs is as follows

<?xml version="1.0"?>
<plist version="1.0">
  <array>
    <dict>
      <key>name</key>
      <string>MultiJob</string>
      <key>taskSpecifications</key> 
      <dict>
        <key>0</key>
        <dict>
          <key>command</key>
          <string>/usr/bin/uname</string>
	  <key>arguments</key> 
          <array>
            <string>-a</string>
          </array>
        </dict>
        <key>1</key>
        <dict>
          <key>command</key> 
          <string>/usr/bin/uname</string>
	  <key>arguments</key>
          <array>
            <string>-a</string>
          </array>
        </dict> 
        <key>2</key>
        <dict>
          <key>command</key>
          <string>/usr/bin/uname</string>
	  <key>arguments</key>
          <array> 
            <string>-a</string>
          </array>
        </dict>
      </dict>
    </dict>
  </array>
</plist>

Save this file to job3.xml and submit it to Xgrid by

xgrid -job batch job3.xml

The submission time was typically about 0.6 seconds. When 100 tasks were submitted in the job the time for submission typically increased to 1.5 seconds. This is a substantial improvement over the above 30 second alternative and doesn't require messing about with the maximum number of allowed processes.

Stepping up the complexity

Really using Xgrid

Gotchas

If developing on G5 and there are G4s in the grid then make sure that mtune, mcpu and 64 bit integer math are not used. Must get a screenshot of this. These are not the default settings but if you have been playing around with your settings on a project to get good performance on your desktop G5 then you must remember to undo them.

Xgrid学习小组

Monday, July 2, 2007

批量删除xgrid jobs及其他