Monday, July 2, 2007

批量删除xgrid jobs及其他

index.html

Xgrid - Beyond the hype

Introduction

There are websites which explain what Xgrid is and give you a bit of hype.

There are websites which tell you how to set Xgrid up from a systems administration perspective.

There are a websites that get you started with some simple jobs

What there isn't is a website that allows you to get beyond these initial tasks and get some real work done. Hopefully this website fills the gap. To make my own biases explicit, I am examining Xgrid from the perspective of a scientific programmer.

What is Xgrid

XGrid is a technology to make building a compute cluster easy. It was introduced by Apple with OS X 10.4 (with technology previews during 10.3). Xgrid is not a way of making your word processor or game run faster. Xgrid is simply a way of connecting computers together and it provides a queue to which jobs can be submitted. It is a way to allow developers to easily farm out jobs that can be done in parallel. It should be noted that Xgrid does not provide interprocess communication (IPC). If the parallel jobs need to communicate then the developer is responsible for including some method of IPC into their application. For the remainder of this article I'm going to assume that either your friendly sys admin has set up an Xgrid for you or you have gone to the websites above, followed the instructions, and got an Xgrid up and running.

Review of the simple Xgrid jobs

Apple has provided a command line tool xgrid to submit jobs to an Xgrid. The easiest job you can run to test the grid is provided on the xgrid man page.

xgrid -p somepassword -h computername.domainname -job run /usr/bin/cal 6 2006

where computername.domainname is the name of the computer that the Xgrid controller is running on and somepassword is the password that the controller is expecting. The expected result is, of course,

     June 2006
 S  M Tu  W Th  F  S
             1  2  3
 4  5  6  7  8  9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30

If something goes wrong here check that you are using the Xgrid password and not your own user password :). You leave out the password and controller name if you set them as environment variables. I'm going to assume that you have set these environment variables for the rest of this document. This example assumes that you are running the bash shell. If you don't know what shell you are running then you are running the bash shell.

export XGRID_CONTROLLER_HOSTNAME=controllername.domainname
export XGRID_CONTROLLER_PASSWORD=somepassword

You can see what jobs are running on the Xgrid by using the Xgrid Admin tool which Apple provides in the Server Admin Tools. If you use sleep as your job it will run long enough for you to be able to see it in the Xgrid Admin tool.

xgrid -job run /bin/sleep 20

A small touch of realism

Hopefully everything has gone well and you are quite excited by the potential that Xgrid offers. Let's now try to do something more realistic than running commands that you can assume are installed on every Xgrid agent machine. Suppose you have created your own program that you want to run on the grid and furthermore the program needs an input file. How do you get that going? We can simulate having your own program by copying cat to the local directory

cp /bin/cat cat2

and create an input file by

echo "hello world" > file.txt

Run our new cat2 command on Xgrid by

xgrid -job run cat2 file.txt

which gives the expected output

hello world

It should be noted here that Xgrid has automagically shipped file.txt off to the agent that needed it. We were not required to tell Xgrid what input files were needed for the job. All that was specified was the command to run and the command line arguments to give to the command.

Running jobs asynchronously

Using the run command after -job means that the job runs synchronously in your terminal. This is not what you really want if you are going to run a lot of jobs. Luckily, running a job asynchronously is simple.

xgrid -job submit cat2 file.txt

which gives the job identifier as its output, for example,

{jobIdentifier = 842; }

To collect the results of this job when you are ready you use

xgrid -job results -id 842

which prints the results onto the screen. If you want the standard output and standard error to go into files then

xgrid -job results -id 842 -so stdoutfilename.txt -se stderrfilename.txt

does the job.

First Gotcha

Try running the cal example again but this time leave off the explicit path. That is, run

xgrid -job submit cal 6 2006

Which gave the expected output

{jobIdentifier = 943; }

If you check that the job is actually finished and not just waiting to execute with either the Xgrid Admin tool or

xgrid -job attributes -id 943

which produces

{
    jobAttributes = {
        activeCPUPower = 0; 
        applicationIdentifier = "com.apple.xgrid.cli"; 
        dateNow = 2006-06-08 17:34:21 +1000; 
        dateStarted = 2006-06-08 17:33:39 +1000;  
        dateStopped = 2006-06-08 17:33:40 +1000; 
        dateSubmitted = 2006-06-08 17:33:38 +1000; 
        jobStatus = Finished; 
        name = cal; 
        percentDone = 100; 
        taskCount = 1;  
        undoneTaskCount = 0; 
    }; 
}

we can see that the job is actually finished. Now try to retrieve the results

xgrid -job results -id 943

and you get absolutely nothing. There is no indication anywhere that the job failed to execute properly. Just a null output. The lesson to learn here is that if you are writing your own code to run on Xgrid then make sure you output something so that you know a job has truly been successful.

Its all a matter of timing

So far we have only submitted one job at a time to the grid. This is obviously an unrealistic example as if you only have a single job you may as well execute it on you local box. Also, one of the parameters that we need to know in order to know when it is appropriate to use Xgrid is how fast can we submit jobs to the controller. That is, what sort of overhead is there in just farming out the work. To test the submission speed I used the following script.

#!/bin/bash
# multistartasync.sh
# The command line argument $1 is the total number of jobs to submit
ii=0
while [ $ii -lt $1 ]
do  
xgrid -job submit /bin/sleep 35 &
ii=$[$ii+1]
done

When this script is used to submitting 30 jobs the typical difference in submission time, as reported by the Xgrid Admin tool, from the first job to the last job is 10 seconds. (Note that if you leave the & off the xgrid command and force the submission to be serial then the time is approximately 15 seconds). This overhead time can also be measured by the time that it takes to simply delete the finished jobs from the queue. The following script was used to delete the jobs after they had finished.

#!/bin/bash
# xgriddelete.sh
# delete all xgrid jobs between $1 and $2 inclusive

ii=$1
iiEnd=$[1+$2]

while [ $ii -lt $iiEnd ]
do  
xgrid -job delete -id $ii &
ii=$[$ii+1]
done

This script too took 10 seconds on a typical run. We can conclude that the overhead is just simply talking to the controller is about 1/3 of a second.

If you get the following error

./multistartasync.sh: fork: Resource temporarily unavailable

it is due to the maximum number of user processes that OS X will allow you to have running. By default an individual user can have 100 processes. This limit can be raised but I will now argue against this approach. I intend to use Xgrid to process one or two hundred jobs at a time. The timing information presented means that it will take about 30 seconds (or one minute) to simply submit the jobs. If the jobs were extremely time consuming (say an hour) then this extra minute overhead would be negligible. Unfortunately a fair number of my jobs can be run in under 30 seconds. It would be nice if there was a way of reducing the overhead. Fortunately there is.

Batch Processing

The idea behind Xgrid batch processing is that a single job can have many tasks. This means that I could submit one hundred tasks in one job submission which would only take a fraction of the time to submit each task as an individual job. The job and all the tasks can be specified in either a plist or an xml version of a plist. The xgrid man page has a simple example and a complicated example of the job specification. The one I used to submit 3 jobs is as follows

<?xml version="1.0"?>
<plist version="1.0">
  <array>
    <dict>
      <key>name</key>
      <string>MultiJob</string>
      <key>taskSpecifications</key> 
      <dict>
        <key>0</key>
        <dict>
          <key>command</key>
          <string>/usr/bin/uname</string>
	  <key>arguments</key> 
          <array>
            <string>-a</string>
          </array>
        </dict>
        <key>1</key>
        <dict>
          <key>command</key> 
          <string>/usr/bin/uname</string>
	  <key>arguments</key>
          <array>
            <string>-a</string>
          </array>
        </dict> 
        <key>2</key>
        <dict>
          <key>command</key>
          <string>/usr/bin/uname</string>
	  <key>arguments</key>
          <array> 
            <string>-a</string>
          </array>
        </dict>
      </dict>
    </dict>
  </array>
</plist>

Save this file to job3.xml and submit it to Xgrid by

xgrid -job batch job3.xml

The submission time was typically about 0.6 seconds. When 100 tasks were submitted in the job the time for submission typically increased to 1.5 seconds. This is a substantial improvement over the above 30 second alternative and doesn't require messing about with the maximum number of allowed processes.

Stepping up the complexity

Really using Xgrid

Gotchas

If developing on G5 and there are G4s in the grid then make sure that mtune, mcpu and 64 bit integer math are not used. Must get a screenshot of this. These are not the default settings but if you have been playing around with your settings on a project to get good performance on your desktop G5 then you must remember to undo them.

Friday, June 15, 2007

批量执行任务 batch jobs

It is based on a format used all over the place in Mac OS X, called the 'plist' format.

The format is very picky on commas, semi-columns and brackets, but if you are careful, you should be able to easily modify it and expand it. The example above shows how you would write a batch file for one job with just 2 tasks. For both tasks, this will result in the execution of the command

/Users/Shared/fasta-tutorial/fasta

with the following 3 arguments for the first task (and similar for the second task)

-q
/Users/Shared/fasta-tutorial/magic-worm-gene.seq
/Users/Shared/fasta-tutorial/chromosomeX.fa

Submitting the batch format

Save the specification file and save it to the Desktop. Then,

xgrid -h localhost -job batch ~/Desktop/fasta-job.txt

取回结果：

xgrid -h localhost -job results -id 6752 -out ~/result_files/

A very useful trick is to use xgrid itself to generate examples of well-formed plist files. Here is for instance what you could do in the Terminal:

xgrid -h localhost -job submit /usr/bin/cal 10 2005
   {jobIdentifier = 412; }
xgrid -h localhost -job specification -id 412 > ~/Desktop/xgrid-cal-job.xml

Now, you have a fresh new file on your Desktop called 'xgrid-cal-job.xml', perfectly formatted and ready to be tweaked for your own purposes. In fact, the specification of any job submitted to Xgrid can be retrieved back this way

Wednesday, May 9, 2007

Apple-Xgrid邮件列表上关于环境变量的讨论

I am hoping someone can point me in the right direction. Recently I set up a cluster of 6 XServe nodes. I am trying to perform a series of Monte Carlo simulations on the cluster, submitting the jobs via Xgrid (and am rather new at using Xgrid). The code requires certain user defined environment variables to be set at run time. I actually set these manually within the /etc/profile and etc/bashrc on each node. The executable I am trying to run was compiled with g++ 3.3 and takes a series of values at the command line as input. Every time I submit the code throws an exception saying that an environment variable is not set. I am at a loss of what to do.

Recently I began using GridStuffer for job submission. In an attempt to bypass the environment variable problem I wrote a simple shell script which first sets the variable and then calls the program with the command line input. Here it is:

----------------------

#!/bin/sh

export $G4LEDATA=/usr/local/Geant4.7.0/DataFiles/G4EMLOW2.3

echo $G4LEDATA

./proton true true false ICRU-49p false false monoenergetic pencil 400 400.0 154.0 150.0 0.0 30.0 20.0 water water 1.0 1.0 1000

----------------------

The name of the script is then listed in the first and only line of a text file that I use as input to GridStuffer. I do indeed get the variable $G4LEDATA send to stdout, but the program will not run beyond a certain point. There is nothing written to stderr. If I don't set $G4LEDATA, the exception I mentioned above is sent to stderr.

Another thing I tried was to copy the executable (proton) to each node, all in the same directory. Then in the script I have /directory/proton true true .....

instead of ./proton true true.... However, as far as I can tell the program never executes.

My apologies for being rather verbose, but I am really stuck at this point. I openly acknowledge my lack of knowledge with Xgrid and GridStuffer, and think that the problem is in my not fully understanding how either really works. I have completely turned off the all password authentification between the controller and agents since the cluster is not online (completely stand-alone). A couple of questions :

(1) How does the controller log into the agent to transfer files? I am assuming it is as a generic user. Shouldn't the user have full access to environment variables defined in /etc/profile?

(2) With GridStuffer, is there a better way than what I am doing to submit the job? For instance, using the directives -dirs and -files to force certain files to be copied to the agent?

Any help will be greatly appreciated.

P.S. Charles, I hope you see this because your help would be very beneficial..

Thanks,

Dan

--------

Dr. Dan J. Fry

Physicist

Henry M. Jackson Foundation For The Advancement of Military Medicine

Walter Reed Army Medical Center

6900 Georgia Avenue, NW

Washington, DC 20307

I did some testing with environment variables to make sure, but it seems quite certain that Xgrid will not load environment variables, or more precisely the shell won't, even if explicitely called using / bin/sh somewhere in the text. This is not too surprising.

The shell script approach you propose should work better for this purpose. If you really need to set up the environment from the information specific for the agent, you might alternatively read /etc/ profile manually to load the env var there in your script (not sure how to do that _exactly_).

Now, it seems your program still won't run "beyond a certain point" within the script. What exactly happens then? Your program only need to have $G4LEDATA? Or is another env var missing? Look for messages in the agent Console. Anyway, I would definitely go ahead with the script wrapping approach and iron out the other problems then, which might be different.

Regarding the GridStuffer format, you only need -files to explicitely force the addition of a file to the job (and you probably don't need - dirs). If 'proton' is in the same path as the input file, and if you don't need any other files to run the program, then you are fine, no - files needed, GridStuffer will figure it out. If you can have everything set up on the agent, even better, then use only full paths for the program and files in the job submission.

Finally, Xgrid agent will usually run as user 'nobody' (unless you are using Kerberos auth or you manually start the agent as a different user).

hope that helps,

charles

需要在每个Xgrid的Agent上安装Gate

From Apple's Xgrid mailing list: 1. Question: Is it possible NOT to duplicate the whole environment (binaries, libraries, etc...), and just send the input files? In other words, is it possible to have the executable already on the agents (as we have), and just the input files on the controller? Answer:Yes. The simplest way to do this is to pre-install the executables, and simply use a shell-script which invokes them as the 'command' 2.One thing about the darwin ports version, it builds povray using shared libraries, which means that when you run that version of povray, it looks in /usr/lib (or /opt/usr/lib--I don't remember) to find the required libraries (libpng for example). That means you have to pre install those libraries on all of the agent computers, which can be a problem if you don't have direct control over the agents. Anton Raves' compile instructions doesn't use shared libraries, but copies them into the executable (which makes the executable larger, but it makes it run better on the agents). 3.

For example, using Xgrid to run thousands of iterations of a Monte Carlo is a really good idea.  The trick is that you either need a stand-along program for the computation, or you need your 
 computational environment (Mathematica, Matlab, etc.) pre-installed on every agent machine.

网上xgrid教程

The Xgrid Tutorials (Part I): Xgrid Basics

http://www.macresearch.org/the_xgrid_tutorials_part_i_xgrid_basics

Tuesday, May 8, 2007

把自己的Mac OS X设成Xgrid Controller

目的是看看能否运行Gate...

运行 sudo xgridctl c start

然后在Xgrid Admin里可以看到新的controller，但是无法添加，可能是密码不对，但是不知道在哪里设置密码。

换一个办法：XgridLite

安装后在system preferences的other下面多了一个xgridlite，可以对controller进行控制和密码设置。。。

Xserve for Cluster Node的说明书

本地下载：XserveClusterNodeQuickStart.pdf

Name: Shawn@TutorSky.Net

Location: Philadelphia, PA, United States

Shawn, 中国人，目前在美国一所大学从事医疗器械方面的研究工作。这里是他根据自己的学习兴趣创建的学习小组，欢迎有共同兴趣的朋友加入。一起学习，一起进步。Sharing is good!

View my complete profile

Subscribe to
Posts [Atom]

Xgrid学习小组