Thread: [Linuxptp-users] ptp4l is dead with Segmentation Fault
PTP IEEE 1588 stack for Linux
Brought to you by:
rcochran
From: Takahiro S. <tsh...@gm...> - 2012-03-08 06:57:50
|
Hello, I am trying ptp4l with hardware clock. The ieee1588 hardware is Intel EG20T. The sync seems OK. However ptp4l is sometimes dead when we transfer many data simultaneously. The error is Segmentation Fault. I checked coredump. [root@Fedora14-LinuxTeam-dev linuxptp-code]# ./ptp4l -i eth7 -m -v Segmentation fault (coredump) [root@Fedora14-LinuxTeam-dev linuxptp-code]# gdb ptp4l core.21111 GNU gdb (GDB) Fedora (7.2-51.fc14) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html > This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i686-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/okada/work/ieee1588/test-app/linuxptp-code/ptp4l...done. [New Thread 21111] Missing separate debuginfo for Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/65/726c76f0ef4466fe3e7f4b0b78d51d49bf149b Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/librt.so.1 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Loaded symbols for /lib/libpthread.so.0 Core was generated by `./ptp4l -i eth7 -m -v'. Program terminated with signal 11, Segmentation fault. #0 0x002a7f12 in __cmsg_nxthdr () from /lib/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.13-1.i686 (gdb) where #0 0x002a7f12 in __cmsg_nxthdr () from /lib/libc.so.6 #1 0x0804ec51 in receive (fd=9, buf=0xbff89258, buflen=44, hwts=0x9abe6e0, flags=8192) at udp.c:286 #2 0x0804edda in udp_send (fda=0x9a7b0e0, event=1, buf=0x9abe680, len=44, hwts=0x9abe6e0) at udp.c:343 #3 0x0804ca2c in port_delay_request (p=0x9a7b0d0) at port.c:354 #4 0x0804dd60 in port_event (p=0x9a7b0d0, fd_index=3) at port.c:886 #5 0x0804a3bb in clock_poll (c=0x8051360) at clock.c:316 #6 0x080493da in main (argc=5, argv=0xbff89ad4) at ptp4l.c:195 (gdb) q The error is occured in receive function in udp.c. I assumed that errno == EAGAIN is happened many times. I tried usleep(1000); instead of usleep(1); The result seems OK. My question are: 1. Did anyone face the same issue? 2. Is this modification correct? Thanks and Best regards, |
From: Richard C. <ric...@gm...> - 2012-03-08 18:11:57
|
On Thu, Mar 08, 2012 at 03:57:42PM +0900, Takahiro Shimizu wrote: > Hello, > > I am trying ptp4l with hardware clock. > The ieee1588 hardware is Intel EG20T. > The sync seems OK. > However ptp4l is sometimes dead when we transfer many data simultaneously. What do you mean by "many data simultaneously"? > The error is Segmentation Fault. > I checked coredump. > Core was generated by `./ptp4l -i eth7 -m -v'. | Wow, you have a lot of interfaces! ----+ > Program terminated with signal 11, Segmentation fault. > #0 0x002a7f12 in __cmsg_nxthdr () from /lib/libc.so.6 This is strange. It seems that the control buffer is corrupted. Did you modify the receive() function? Can you provide a hex dump of the control[] buffer and the values of 'struct msghdr msg' before and after the call to recvmsg? > The error is occured in receive function in udp.c. > I assumed that errno == EAGAIN is happened many times. > I tried usleep(1000); instead of usleep(1); > > The result seems OK. > My question are: > 1. Did anyone face the same issue? Yes, this issue appears when the driver is very slow to deliver a Tx time stamp. > 2. Is this modification correct? I think it is fine to increase the usleep time for testing, but you should find out why the time stamp is so slow. The better solution (for ptp4l) is to allow changing the value of 'try_again' as command line or config file option. HTH, Richard |
From: Takahiro S. <tsh...@gm...> - 2012-03-09 01:50:16
|
Hello Richard, Thank you for the response. What do you mean by "many data simultaneously"? > I connect 2 CrownBay(Intel E6xx CPU + Intel PCH EG20T) boards and 1 PC in the same network. I installed Linux kernel 3.3-rc6 on 2 CrownBays. I installed Windows 7 on PC. CrownBay-1 is a PTP master. CrownBay-2 is a PTP slave. Samba server is running on the CrownBay-2. The ptp4l is running on the both CrownBay. I mount the directory of CrownBay-2 on PC. I copied many files between PC and CrwonBay-2 when I am running ptp4l. > Wow, you have a lot of interfaces! ----+ > I am testing many boards (with different MAC addresses) using the same HDD. Therefore the interface is increased. :) This is strange. It seems that the control buffer is corrupted. > > Did you modify the receive() function? > No, I didn't. > Can you provide a hex dump of the control[] buffer and the values of > 'struct msghdr msg' before and after the call to recvmsg? > OK. I'll try it. > Yes, this issue appears when the driver is very slow to deliver a Tx > time stamp. > I see. > I think it is fine to increase the usleep time for testing, but you > should find out why the time stamp is so slow. > I see. This may be happened. The EG20T GbE DMA hardware is descriptor type DMA. The GbE driver just assign the Tx buffer address to DMA descriptor. The actual Tx is executed later. > The better solution (for ptp4l) is to allow changing the value of > 'try_again' as command line or config file option. > I think it seems better. Thanks and Best regards, Takahiro Shimizu |
From: Takahiro S. <tsh...@gm...> - 2012-03-09 09:59:40
|
Hello Richard, Can you provide a hex dump of the control[] buffer and the values of >> 'struct msghdr msg' before and after the call to recvmsg? >> > > OK. I'll try it. > > I am trying to dump them. However if I modify the source code to dump it, the issue is not happened. It depends on the time critically. I am continuing to try it. If I will be able to get the dump, I will send it. Thanks, Takahiro Shimizu |
From: Richard C. <ric...@gm...> - 2012-03-09 11:52:25
|
On Fri, Mar 09, 2012 at 06:59:29PM +0900, Takahiro Shimizu wrote: > Hello Richard, > > Can you provide a hex dump of the control[] buffer and the values of > >> 'struct msghdr msg' before and after the call to recvmsg? > >> > > > > OK. I'll try it. > > > > > > I am trying to dump them. > However if I modify the source code to dump it, the issue is not happened. > It depends on the time critically. > I am continuing to try it. If I will be able to get the dump, I will send > it. Please forget what I said. I think I see the bug. I will send a patch to try out soon. Thanks, Richard |
From: Takahiro S. <tsh...@gm...> - 2012-03-09 14:30:28
|
Hello Richard, Thank you for the information. Please forget what I said. > > I think I see the bug. I will send a patch to try out soon. > > I will wait your patch. Thanks, Takahiro Shimizu |
From: Richard C. <ric...@gm...> - 2012-03-10 07:26:21
|
On Fri, Mar 09, 2012 at 11:30:17PM +0900, Takahiro Shimizu wrote: > Hello Richard, > > Thank you for the information. > > Please forget what I said. > > > > I think I see the bug. I will send a patch to try out soon. > > > > > I will wait your patch. I think you are sometimes not getting any time stamp at all, and this causes the try_again loop to end without ever calling recvmsg() successfully. When this happens, the control[] buffer contains uninitialized data from the stack. Can you please try the following patch? Thanks, Richard --- diff --git a/udp.c b/udp.c index 8a3dc7b..d097f98 100644 --- a/udp.c +++ b/udp.c @@ -257,6 +257,7 @@ static int receive(int fd, void *buf, int buflen, struct msghdr msg; struct timespec *ts = NULL; + memset(control, 0, sizeof(control)); memset(&msg, 0, sizeof(msg)); msg.msg_iov = &iov; msg.msg_iovlen = 1; |
From: Richard C. <ric...@gm...> - 2012-03-11 20:19:24
|
I went ahead and pushed a fix for this issue, so please just try the latest git version of linuxptp (7421e74a). Thanks, Richard |
From: Takahiro S. <tsh...@gm...> - 2012-03-11 23:43:53
|
Hello Richard, Thank you. I'll try the latest one today. 2012/3/12 Richard Cochran <ric...@gm...> > I went ahead and pushed a fix for this issue, so please just try the > latest git version of linuxptp (7421e74a). > > Thanks, > Richard > Best regards, Takahiro Shimizu |
From: Takahiro S. <tsh...@gm...> - 2012-03-12 01:07:18
|
Hello Richard, I tried the latest linuxptp. The sync completion time is fine and it is robust even if the timestamp sometimes gets fail. I can see "received SYNC without timestamp" and "port 1: bad message" sometimes. Thanks and Best regards, 2012/3/12 Takahiro Shimizu <tsh...@gm...> > Hello Richard, > > Thank you. I'll try the latest one today. > > 2012/3/12 Richard Cochran <ric...@gm...> > >> I went ahead and pushed a fix for this issue, so please just try the >> latest git version of linuxptp (7421e74a). >> >> Thanks, >> Richard >> > > Best regards, > > Takahiro Shimizu > |