歡迎您光臨本站 註冊首頁

運行mpirun的cpi例子,出現child process...錯誤信息

←手機掃碼閱讀     火星人 @ 2014-03-04 , reply:0

運行mpirun的cpi例子,出現child process...錯誤信息

背景:
0.操作系統為fedora
1.兩台工作站,可以互相ssh無密碼登錄.
2.均安裝mpich-1.2.7p1版本.
3.修改兩台工作站的/usr/local/mpich/share/machines.LINUX文件,添加如下信息:
                                                                                                                  n1:4
                                                                                                                  n2:4
4.僅使用一顆cpu,工作一切正常.
5.但使用一顆以上cpu: /usr/local/mpich/bin/mpirun -np 4 cpi
顯示如下錯誤信息:
p0_5250:p4_error:child process exited while making connection to remote process
on n2:0

p0_5250:(6.288410) net_send:could not write to fd=4, error=32

我嘗試僅使用一台工作站,即machines.LINUX文件中只有"n1:4",也出現類似上述信息.

我在google里查到了如下信息:
step 1.Installed ubuntu on my laptop. update all package/ software through internet.
step 2. As u have done making ssh password free from master to slave.
step 3. edit /etc/hosts with all master and slave host names
step 4. Before installation of mpich. configure NFS on master node
step 5 install mpich into shared directory also edit ../mpich/util/machines/machines.LINUX which will have nodes entry
step 6 run tstmachines -v LINUX

(http://www.artima.com/forums/flat.jsp?forum=264&thread=143616)
從上述過程中看,要運行mpirun,還需要定義NFS?不知道我的理解對不對?

感謝感謝!:em02:
《解決方案》

可能是NFS共享節點許可權的問題

你的ssh互相登錄是什麼用戶?登錄后許可權如何?
《解決方案》

謝謝runer:)
我使用ssh root@192.168.1.192和ssh rsync@192.168.1.192均可以登錄成功.
許可權為chmod 600 .ssh/authorized_keys
受累再幫我看看吧:em02:
《解決方案》

剛才運行了tstmachines LINUX來測試系統,信息如下:
---------------------------------------
# ./tstmachines LINUX
Errors while trying to run ssh n1 -n true
Unexpected response from n1:
--> Permission denied (publickey,gssapi-with-mic).
If your .cshrc, login, .bashrc, or other startup file
contains a command that generates any output when logging in,
such as fortune or hostname or even echo, you should modify
that startup file to only print such a message when the
process is attached to a terminal.  Examples of how to do
this are in the Users Manual.  If you do not do this, MPICH
will still work, but this script and the test programs will
report problems because they compare expected output from
what the programs produce.
    The test of ssh <machine> true  failed on some machines.
    This may be due to problems in your .login or .cshrc files;
    some common problems are described when detected.  Look at the
    output above to see what the problem is.

    If the problem is something like 'permission denied', then the
    remote shell command ssh does not allow you to run programs.
    See the documentation about remote shell and rhosts.

Errors while trying to run ssh n1 -n /bin/ls /usr/local/mpich/sbin/mpichfoo
Unexpected response from n1:
--> Permission denied (publickey,gssapi-with-mic).
Unexpected response from n2:
--> /bin/ls: 無法訪問 /usr/local/mpich/sbin/mpichfoo: 沒有那個文件或目錄
    The ls test failed on some machines.
    This usually means that you do not have a common filesystem on
    all of the machines in your machines list; MPICH requires this
    for mpirun (it is possible to handle this in a procgroup file; see
    the documentation for more details).

    Other possible problems include:
        The remote shell command ssh does not allow you to run ls.
           See the documentation about remote shell and rhosts.
        You have a common file system, but with inconsistent names.
           See the documentation on the automounter fix.


2 errors were encountered while testing the machines list for LINUX
No machines seem to be available!
------------------------------------------
是不是說需要閱讀remote shell and rhosts的相關內容?
《解決方案》

我按照手冊中說的解決辦法:
    If this fails, then you may need a 『.rhosts』 or 『/etc/hosts.equiv』 file (you may need
to see your system administrator) or you may need to use the p4 server. Another possible
problem is the choice of the remote shell program; some systems have several. Check with
your systems administrator about which version of rsh or remsh you should be using.
    If your system allows a 『.rhosts』 file, do the following:
    • Create a file .rhosts in your home directory
    • Change the protection on it to user read/write only: chmod og-rwx .rhosts.
    • Add one line to the .rhosts file for each processor that you want to use. The format is
host username
    For example, if your username is doe and you want to user machines a.our.org and
b.our.org, your .rhosts file should contain
a.our.org doe
b.our.org doe
    Note the use of fully qualified host names (some systems require this).
    On networks where the use of .rhosts files is not allowed, (such as the one in MCS at
Argonne), you should use the p4 server to run on machines that are not trusted by the
machine that you are initiating the job from.
    Finally, you may need to use a non-standard rsh command within MPICH. MPICH must be
reconfigured with -rsh=command name, and perhaps also with -rshnol if the remote shell
command does not support the -l argument. Systems using Kerberos and/or AFS may
need this.

在root文件夾下新建了.rhosts文件,並且加入了主機名和相應的用戶名,但錯誤依舊。
是不是還要重新編譯一下mpich?
《解決方案》

估計是你各個節點的/etc/hosts.equiv沒有修改好吧
《解決方案》

用 Rocks吧

[火星人 ] 運行mpirun的cpi例子,出現child process...錯誤信息已經有837次圍觀

http://coctec.com/docs/service/show-post-5442.html