Grid Engine

Troubleshooting and/or Quick Reference

From UGE810

Jump to: navigation, search

Contents

Troubleshooting

Windows Troubleshooting

Problem: The execution daemon running on your windows system is not starting.

The execution daemon running on your windows system is not starting. No messages file can be found in <execd_spool_dir>/messages and you get an error message like:

daemonize error: child exited before sending state.

Reason:
Your sge_execd can't daemonize. One reason for this issue might be a forking problem of your windows installation when it's running in DEP mode.
You can check this by starting any application which will do forks. For instance using the qconf -mconf (will start an external editor, or running the gcc). If these commands also terminating with a segfaults, or immediate termination this issue will be the reason of your problem.


Solution:
In case you are running Windows XP SP2 using Interix (SFU 3.5) and your processes can't do forks, then have a look at: http://support.microsoft.com/kb/929141

Problem: next problem.



Installation Troubleshooting



User Troubleshooting

Problem: The output of UGE jobs, seems to be buffered, because looking into the jobs output file, it will be filled after jobs end.

The output of UGE jobs, seems to be buffered, because looking into the jobs output file, it will be filled after jobs end. We would like to see the output starting during job run. What can I do?

Reason
NFS implements loose caching, this means that file are written with a delay of a few seconds or after the writing has finished. It look like UGE is buffering the output files.
This is not a UGE behaviour, it's by NFS design.


Solution:
This problem can be visible by doing a tail -f on the jobs output file, located on a local volume on one hand and on a nfs volume on the other hand. Local files are written immediately.




Admin Troubleshooting

Problem: Problems with hostname resolving. Some hosts returning long, some short hostnames.

Reason
Problems with hostname resolving is a very common problem which is often done at first UGE installations. In most cases it's a wrong setup and only in the rarest cases it was an UGE issue.


Solution:
1. Were do you get your hostnames from?

  - /etc/hosts
  - nis
  - dns

You will find it is /etc/resolve.conf. The entry hosts: files=/etc/host, dns=dns, nis=nis Compare this with other working hosts or if you are not sure, take files. This should be the same on all hosts which are part of your installation.

2. If you are using the files configuration:

  - edit the /etc/hosts file.
  do not map the hostname to the local host entry. like this: 127.0.0.1 localhost <your hostname>
  192.168.1.10 qmaster_hostname.your-domain qmaster_hostname <- this is optional
  192.168.1.11 execd/submit/admin_hostname.your-domain execd/submit/admin_hostname <- this is optional
  192.168.1.12 execd1/submit1/admin1_hostname.your-domain execd1/submit1/admin1_hostname <- this is optional
  .
  .
  .
  copy all entries into the /etc/hosts files of all your execd/submit/admin hosts

3. If you are using nis:

  - execute the command:   ypcat -t hosts.byname and check the output if all hosts of your cluster are in there and the hostnames are
    right and either long or short.

4. If you are using dns:

  - execute the command: nslookup qmaster/submit/execd/admin_hostname and check the output if all hosts of your cluster are in
     there and the hostnames are right and either long or short.

5. Then check if all hosts are answering with the same hostname (long or short)

  - on the master host using gethostbyname -aname <qmaster hostname>
  - on the master host using gethostbyname -aname <execd/submit/admin hostname>
  - on execd/admin/submit host using gethostbyname -aname <qmaster hostname>
  - on execd/admin/submit host using gethostbyname -aname <execd/submit/admin hostname>

Either ALL hosts in a cluster have to be resolved hosts as short hostnames (without domain) or ALL hosts in a cluster have to be resolved with long names (including the domain)

If you are using a mixed setup. eg a qmaster host with 2 network interfaces, were hosts from 2 different subnets submitting jobs to the cluster, then it's possible to setup a host_aliases file at: <sge_root>/<cell>/common/host_aliases


For documentation please look into the man page with: man host_aliases After that the hosts must be resolvable and return with the right hostnames. If this all is working then please remove the not working host and add it again. If possible eg. with testsystems and just a few hosts, a reinstallation could be done.

Problem: What does "Skipping remaining x orders" mean, do I have to be concerend?"

You get error messages in your qmaster messages file looking like this:

01/25/2012 16:10:21|worker|grid-master|W|Skipping remaining 29 orders 01/25/2012 16:10:22|schedu|grid-master|E|unable to find job 4961392 from the scheduler order package


Reason:

The job with JOB_ID 4961392 could not be found due to job deletion or any error.


Solution:
In case of this job has been deleted there is no reason for concerns otherwise it depends on the error messages at further checks why the job is gone have to be done.

If the job was deleted, this message is no error. So no solution can be provided.


Problem: UGE shows wrong m_socket counts for my execution hosts"

You are running a 2 socket machine with 4 cores each which should be reported with a topology:

SCCCCSCCCC but it's reported like this: SCSCSCSCSCSCSCSC

Reason:
Running an old kernel version. Kernel version < 2.6.16


Solution:
Updating you kernel to a version 2.6.16 and higher


Problem: Problems with running large OpenMPI jobs in UGE >129 slots (or any other fixed slot count)"

You are running into the problem that when running jobs in UGE, it is not possible to request more then 129 slots. The request works, also UGE shows no error or crashes, just the job is hanging. Running this job outside of UGE also larger jobs > 129 slots are working.

Reason:
The mpirun command looks like that:

mpirun -mca orte_rsh_agent ssh:rsh -mca ras_gridengine_verbose 100 -server-wait-time 60 /path/to/job/binary

The mcs_orte_rsh_agent parameter is set to ssh:rsh which might be the problem module here. OpenMPI seems to limit the number of concurrent rsh processes in this module.

Without UGE, the module is not loaded and no limit is set. More then 129 task are running. Used with UGE OpenMPI loads some additional modules setting this limit set and jobs using more the 129 slots won't run.


Solution:
To check this limit use the ompi_info command:

  % ompi_info -all | grep plm_rsh_num_concurrent

MCA plm: parameter "plm_rsh_num_concurrent" (current value: "128", data source: default value) <---- here 128 is set.


To workaround this problem the plm_rsh_num_concurrent parameter ca be set to be used by the mpirun call:

  mpirun -mca plm_rsh_num_concurrent 256 -np 2000   <---- here set to 256

Setting this parameter is fixing the problem.