From: <dr...@bf...> - 2006-10-18 11:24:10
|
Hi I use Chromium on Windows XP to drive 6 servers from one client over a 1Gbps TCP/IP network. Every now and then Chromium locks up. The lock up occurres in __tcpip_read_exact()'s call to recv(). Even though select() did report the socket to be ready for reading the first call to recv() blocks. If I kill the server of this connection recv() returns with the correct error code. See the stack trace below for more details. Does anyone else experience similar problems? I suspect this might be a bug in some network driver (Intel) but I'm not sure. Michael NTDLL.DLL!_KiFastSystemCallRet@0() NTDLL.DLL!_ZwWaitForSingleObject@12() + 0xc bytes MSWSOCK.DLL!_SockWaitForSingleObject@16() + 0x3c8 bytes MSWSOCK.DLL!_WSPRecv@36() + 0x1487 bytes WS2_32.DLL!_recv@16() + 0x6f bytes crutil.dll!__tcpip_read_exact(unsigned int sock=1892, void * buf=0x0012fb98, unsigned int len=4) Line 212 + 0x14 bytes C crutil.dll!crTCPIPReceiveMessage(CRConnection * conn=0x00add6c0) Line 868 + 0xf bytes C crutil.dll!crTCPIPRecv() Line 1089 + 0xc bytes C crutil.dll!crNetRecv() Line 1183 + 0x5 bytes C tilesortspu.dll!tilesortspu_SwapBuffers(int window=1, int flags=0) Line 103 + 0x5 bytes C opengl32.dll!stubSwapBuffers(const window_info_t * window=0x00edfe98, int flags=0) Line 987 C opengl32.dll!wglSwapBuffers_prox(HDC__ * hdc=0xb5010b1e) Line 275 + 0xb bytes C gdi32.dll!_SwapBuffers@4() + 0x25 bytes glut32.dll!10009e55() [Frames below may be incorrect and/or missing, no symbols loaded for glut32.dll] atlantis.exe!Display() Line 349 C glut32.dll!100050c2() atlantis.exe!WhalePilot(_fishRec * fish=0x00411c60) Line 61 + 0x26 bytes C atlantis.exe!Animate() Line 169 + 0xa bytes C 0012ff58() glut32.dll!100048d6() atlantis.exe!main(int argc=3, char * * argv=0x00ad5d68) Line 426 C atlantis.exe!mainCRTStartup() Line 398 + 0x11 bytes C kernel32.dll!_BaseProcessStart@4() + 0x23 bytes |
From: Brian P. <bri...@tu...> - 2006-10-18 14:37:27
|
Michael D=FCrig wrote: > Hi >=20 > I use Chromium on Windows XP to drive 6 servers from one client over a=20 > 1Gbps TCP/IP network. Every now and then Chromium locks up. The lock up= =20 > occurres in __tcpip_read_exact()'s call to recv(). Even though select()= =20 > did report the socket to be ready for reading the first call to recv()=20 > blocks. If I kill the server of this connection recv() returns with th= e=20 > correct error code. See the stack trace below for more details. > Does anyone else experience similar problems? I suspect this might be a= =20 > bug in some network driver (Intel) but I'm not sure. If there's any bugs in Chromiums packer/unpacker code, a common=20 symptom is for the network layer to get stuck in recv() - waiting for=20 bytes that aren't coming. Is the lock-up only happening with certain apps? Do those apps work=20 ok on other Chromium systems? -Brian |
From: <dr...@bf...> - 2006-10-18 15:29:28
|
>> I use Chromium on Windows XP to drive 6 servers from one client over a >> 1Gbps TCP/IP network. Every now and then Chromium locks up. The lock up >> occurres in __tcpip_read_exact()'s call to recv(). Even though select() >> did report the socket to be ready for reading the first call to recv() >> blocks. If I kill the server of this connection recv() returns with the >> correct error code. See the stack trace below for more details. >> Does anyone else experience similar problems? I suspect this might be a >> bug in some network driver (Intel) but I'm not sure. > > If there's any bugs in Chromiums packer/unpacker code, a common > symptom is for the network layer to get stuck in recv() - waiting for > bytes that aren't coming. I dont think its a bug in Chromium since the block occurs on the first call to recv() after select() reported the socket to be readable. That is crTCPIPRecv() calls select(), finds the socket to be readable and calls crTCPIPReceiveMessage() which blocks right away on the statement if ( __tcpip_read_exact( sock, &len, sizeof(len)) <= 0 ). So this really shouldn't block or should it? > Is the lock-up only happening with certain apps? Do those apps work > ok on other Chromium systems? It does happen with at least a couple of apps I tried, one of which is atlantis. I didn't test on other systems though. Michael |
From: <dr...@bf...> - 2006-11-23 13:33:56
Attachments:
patch2.patch
|
Michael D=FCrig wrote: >>> I use Chromium on Windows XP to drive 6 servers from one client over = a=20 >>> 1Gbps TCP/IP network. Every now and then Chromium locks up. The lock = up=20 >>> occurres in __tcpip_read_exact()'s call to recv(). Even though select= ()=20 >>> did report the socket to be ready for reading the first call to recv(= )=20 >>> blocks. If I kill the server of this connection recv() returns with = the=20 >>> correct error code. See the stack trace below for more details. >>> Does anyone else experience similar problems? I suspect this might be= a=20 >>> bug in some network driver (Intel) but I'm not sure. >> If there's any bugs in Chromiums packer/unpacker code, a common=20 >> symptom is for the network layer to get stuck in recv() - waiting for=20 >> bytes that aren't coming. >=20 > I dont think its a bug in Chromium since the block occurs on the first=20 > call to recv() after select() reported the socket to be readable. That=20 > is crTCPIPRecv() calls select(), finds the socket to be readable and=20 > calls crTCPIPReceiveMessage() which blocks right away on the statement >=20 > if ( __tcpip_read_exact( sock, &len, sizeof(len)) <=3D 0 ). >=20 > So this really shouldn't block or should it? >=20 >> Is the lock-up only happening with certain apps? Do those apps work=20 >> ok on other Chromium systems? >=20 > It does happen with at least a couple of apps I tried, one of which is=20 > atlantis. I didn't test on other systems though. Ok, some news on this. It does only happen on one machine and on this=20 machine only when running Windows. It does not happen on Linux. Even=20 after changing to a different NIC from a different vendor and a=20 different driver the problem remained. I discussed the issue on=20 microsoft.public.win32.programmer.networks. Someone recommended to use=20 non-blocking sockets instead of blocking sockets because 'select is=20 typically not used with blocking sockets'. So I hacked tcpip.c such that it now uses non-blocking=20 sockets. I'm not sure if this is of interested to anyone else but just=20 in case, here is the patch. Michael |
From: Samuel T. <sam...@en...> - 2006-11-23 15:10:36
|
Michael D=FCrig, le Thu 23 Nov 2006 14:33:47 +0100, a =E9crit : > Someone recommended to use non-blocking sockets instead of blocking > sockets because 'select is typically not used with blocking > sockets'. Mmm, they have a strange notion of select() then. Select is particularly useful with blocking sockets precisely because it permits to know whether recv() will block or not... If that's not the case, it's a bug in the select() implementation. Samuel |
From: <dr...@bf...> - 2006-11-23 17:30:32
|
Samuel Thibault wrote: > Michael D=FCrig, le Thu 23 Nov 2006 14:33:47 +0100, a =E9crit : >> Someone recommended to use non-blocking sockets instead of blocking >> sockets because 'select is typically not used with blocking >> sockets'. >=20 > Mmm, they have a strange notion of select() then. Select is > particularly useful with blocking sockets precisely because it permits > to know whether recv() will block or not... If that's not the case, > it's a bug in the select() implementation. Yes thats what I thought. But it seemed easier for the moment to hack=20 around the bug instead of getting MS to fix this. Michael |