运行example里的程序出错,请教

tao jiang
2010-04-10
2012-10-08
  • tao jiang
    tao jiang
    2010-04-10

    谷老师:
    您好!我在实验室集群上安装了sector/sphere,启动security server、master
    server、slave都显示成功,slave也都能周期性获得master的信息:
    recv cmd 192.168.0.21 6000 type 1
    recv cmd 192.168.0.21 6000 type 1
    我想系统应该是正常运行的吧。我在一个client上运行example里的testfs,其中一个的输出是:

    This Sector slave is successfully initialized and running now.
    slave process: GMP 32771 DATA 32772
    recv cmd 192.168.0.21 6000 type 105
    recv cmd 192.168.0.21 6000 type 105
    recv cmd 192.168.0.21 6000 type 203
    starting SPE ... 1 32796 randwriter 4
    rendezvous connect 192.168.0.24 32796
    connected
    new job /tmp/guide.dat 1 1
    rendezvous connect 192.168.0.20 32785
    rendezvous connect 192.168.0.20 32785
    completed 100 192.168.0.24 32794
    sending data back... 0
    report 192.168.0.21 6000 2
    recv cmd 192.168.0.21 6000 type 1
    recv cmd 192.168.0.21 6000 type 105
    comp server closed 192.168.0.24 32794 19
    reportSphere 192.168.0.21 6000 4
    recv cmd 192.168.0.21 6000 type 1
    recv cmd 192.168.0.21 6000 type 110
    ===> start file server 192.168.0.21 6000
    recv cmd 192.168.0.21 6000 type 110
    ===> start file server 192.168.0.21 6000
    rendezvous connect source 192.168.0.20 32785
    /home/rhett/data///test/sort_input.1.dat
    connected
    rendezvous connect source 192.168.0.20 32785
    /home/rhett/data///test/sort_input.1.dat.idx
    connected
    file server closed 192.168.0.20 32785 0
    report 192.168.0.21 6000 1
    file server closed 192.168.0.20 32785 2.33472
    report 192.168.0.21 6000 1

    另一个slave的输出是:
    recv cmd 192.168.0.21 6000 type 103
    recv cmd 192.168.0.21 6000 type 110
    ===> start file server 192.168.0.21 6000
    rendezvous connect source 192.168.0.24 32796 /home/rhett/data//tmp/guide.dat
    connected
    file server closed 192.168.0.24 32796 0
    report 192.168.0.21 6000 1
    recv cmd 192.168.0.21 6000 type 110
    ===> start file server 192.168.0.21 6000
    rendezvous connect source 192.168.0.24 32796
    /home/rhett/data//tmp/guide.dat.idx
    connected
    file server closed 192.168.0.24 32796 0
    report 192.168.0.21 6000 1
    recv cmd 192.168.0.21 6000 type 203
    starting SPE ... 0 32796 randwriter 3
    rendezvous connect 192.168.0.24 32796
    connected
    new job /tmp/guide.dat 0 1
    recv cmd 192.168.0.21 6000 type 110
    ===> start file server 192.168.0.21 6000
    rendezvous connect source 192.168.0.17 32772
    /home/rhett/data//tmp/guide.dat.idx
    connected
    file server closed 192.168.0.17 32772 4e-05
    report 192.168.0.21 6000 1
    recv cmd 192.168.0.21 6000 type 110
    ===> start file server 192.168.0.21 6000
    rendezvous connect source 192.168.0.17 32772 /home/rhett/data//tmp/guide.dat
    connected
    file server closed 192.168.0.17 32772 3.2e-05
    report 192.168.0.21 6000 1
    recv cmd 192.168.0.21 6000 type 1
    completed 100 192.168.0.24 32794
    sending data back... 0
    report 192.168.0.21 6000 2
    recv cmd 192.168.0.21 6000 type 105
    recv cmd 192.168.0.21 6000 type 105
    recv cmd 192.168.0.21 6000 type 105
    comp server closed 192.168.0.24 32794 21
    reportSphere 192.168.0.21 6000 3
    recv cmd 192.168.0.21 6000 type 1
    recv cmd 192.168.0.21 6000 type 111
    recv cmd 192.168.0.21 6000 type 111
    report 192.168.0.21 6000 1

    系统中暂时使用了两个slave,复制份数配置为2。我把.fucs中相应的UDF要生成的文件大小改为了1G(把两个for循环中的循环次数都减少十倍),我查看了
    两个slave中test目录下生成的文件:其中的一个slave的结果是:

    $ ll
    total 1055736
    -rw-rw-r-- 1 rhett rhett 1000000000 Apr 10 23:49 sort_input.0.dat
    -rw-rw-r-- 1 rhett rhett 80000008 Apr 10 23:50 sort_input.0.dat.idx

    另一个slave的查看结果是:
    $ ll
    total 1144
    -rw-rw-r-- 1 rhett rhett 1167360 Apr 11 06:25 sort_input.1.dat
    -rw-rw-r-- 1 rhett rhett 0 Apr 11 06:25 sort_input.1.dat.idx

    之前,我没有改testfs的UDF函数运行时结果类似:在两个slave中生成的文件大小不同,而且大小差别很大。请问,这是正确的吗?
    如果不正确,是哪里出错了呢?谢谢!

    另外,运行testfs时,client的输出是:

    $ ./testfs
    open file tmp/guide.dat 192.168.0.20 32785
    open file tmp/guide.dat.idx 192.168.0.20 32785
    start time 1270910785
    JOB 8 2
    2 spes found! 2 data seg total.
    connect SPE 192.168.0.20 3
    connect SPE 192.168.0.17 4

    不知道我把问题说清楚了没有。谢谢您百忙之中关注我很初级的提问!

     
  • tao jiang
    tao jiang
    2010-04-10

    我接着运行example中的testdc时,client 端出现如下信息:
    $ ./testdc
    start time 1270913085
    JOB 1001167360 -1
    You have specified the number of records to be processed each time, but there
    is no record index found.
    data segmentation error.
    failed to find any computing resources

    请问,这是由于运行testfs时在其中一个slave上生成的文件不完整的原因吗?

     
  • tao jiang
    tao jiang
    2010-04-10

    运行testdc时,两个slave上没有特殊的信息出现,一直是:
    recv cmd 192.168.0.21 6000 type 1
    recv cmd 192.168.0.21 6000 type 1
    recv cmd 192.168.0.21 6000 type 1
    recv cmd 192.168.0.21 6000 type 1
    recv cmd 192.168.0.21 6000 type 1
    recv cmd 192.168.0.21 6000 type 1

     
  • tao jiang
    tao jiang
    2010-04-13

    找到第一个问题的原因了:其中一个slave的存储空间不够了,所以创建的文件不完整。我调整了结点,把空间较大的结点运行slave,但是运行testfs时还是出
    错,四个slave的输出一样,如下:
    $ ./start_slave
    scanning /home/rhett/data/
    This Sector slave is successfully initialized and running now.
    slave process: GMP 32772 DATA 32773
    recv cmd 192.168.0.17 6000 type 1
    recv cmd 192.168.0.17 6000 type 1
    recv cmd 192.168.0.17 6000 type 1
    recv cmd 192.168.0.17 6000 type 1
    recv cmd 192.168.0.17 6000 type 1
    recv cmd 192.168.0.17 6000 type 105
    recv cmd 192.168.0.17 6000 type 203
    starting SPE ... 1 32776 randwriter 9
    rendezvous connect 192.168.0.17 32776
    connected
    Floating point exception

    都是浮点异常而退出,这是client的输出:
    $ ./testfs
    open file tmp/guide.dat 192.168.0.24 32771
    open file tmp/guide.dat.idx 192.168.0.20 32771
    start time 1271162220
    JOB 16 4
    4 spes found! 4 data seg total.
    connect SPE 192.168.0.20 279
    connect SPE 192.168.0.21 280
    connect SPE 192.168.0.23 281
    connect SPE 192.168.0.24 282
    SPE lost 192.168.0.21
    SPE lost 192.168.0.20
    SPE lost 192.168.0.23
    SPE lost 192.168.0.24
    Cannot allocate SPE for certain data segments. Process failed.
    all SPEs failed

    请问谷老师,这个问题的原因是?谢谢!

     
  • tao jiang
    tao jiang
    2010-04-13

    另外还有一个问题,如果slave意外退出,在不重启security
    server和master的情况下,意外退出的slave重新运行start_slave时失败,提示信息如下:
    $ ./start_slave
    scanning /home/rhett/data/
    slave join rejected. code: -102

    不应该这样吧

     
  • Yunhong Gu
    Yunhong Gu
    2010-04-14

    For the floating point error, can you run at least one slave using gdb and
    post the trace here (where or bt)?

    For the 2nd problem (-102), it is because the master requires some time (10
    minutes) to remove the lost slave from its metadata. If you restart slave
    immediately, you will get this error.

    Thanks
    Yunhong

     
  • tao jiang
    tao jiang
    2010-04-15

    Thanks, Mr.Gu! I will post the trace result here soon.