|
From: Leif M. <lei...@ta...> - 2009-05-25 04:57:32
|
Santo, We have been doing some more testing following all of your steps with "root" zones. But everything is working as expected. Rereading your email today however, I think I may know what the problem is. The error you show in #4. --- INFO | jvm 2 | 2009/05/20 12:12:34 | java.net.SocketException: Address already in use --- This is a bug in Solaris versions that was fixed in version 3.3.0. It actually has nothing to do with zones and should happen on any Solaris system if you start one copy of the Wrapper, stop it, and then immediately start a new copy. The first copy will bind to port 31000 and then the system puts that into TIME_WAIT state for two minutes. During that time, the second instance of the Wrapper was not correctly recognizing the cause of the SocketException that was thrown. On many platforms a BindException is thrown. But Solaris throws a SocketException, which could mean anything. It is necessary to check the text of the message to see if it is a bind problem. That text starts with "errno: 48" for older JVMs, but is "Address already in use" on newer ones. The bug can be found here. It also starts out thinking it is a zone problem: https://sourceforge.net/tracker/?func=detail&aid=1594073&group_id=39428&atid=425187 Could you please give this a try with the 3.3.5 release? Thanks, Leif On Wed, May 20, 2009 at 10:25 PM, Santo74 <gds...@de...> wrote: > > Leif, > > In the meantime our solaris system is up and running again (a sparc system > by the way) > and we already did some new tests with the wrapper and could reproduce the > following: > > zone1 runs 2 wrappers (our application consists of 2 separate components and > on some systems they need to run both, hence the 2 wrapper instances) > The default port ranges are used and everything runs as expected: > 2 connections between wrapper en jvm, 32000 - 31000 and 32001 - 31001 > > localhost.31000 localhost.32000 49152 0 49152 0 > ESTABLISHED > localhost.32000 localhost.31000 49152 0 49170 0 > ESTABLISHED > localhost.31001 localhost.32001 49152 0 49152 0 > ESTABLISHED > localhost.32001 localhost.31001 49152 0 49170 0 > ESTABLISHED > > We will keep this zone1 running as is and test some scenarios on zone2: > > 1) define an explicit port range for the wrapper (something outside 32000 > range) and start it. > This works: > > localhost.31000 localhost.42700 49152 0 49152 0 > ESTABLISHED > localhost.42700 localhost.31000 49152 0 49170 0 > ESTABLISHED > > 2) define an explicit port range for the wrapper (within the 32000 range, > but exclusive the 2 ports in use on zone 1 (i.e. 32000 and 32001)). > This works: > > localhost.31000 localhost.32002 49152 0 49152 0 > ESTABLISHED > localhost.32002 localhost.31000 49152 0 49170 0 > ESTABLISHED > > 3) remove the explicit port range and restart with the default settings. > This surprisingly also works > (apparently because the previous jvm port is in TIME_WAIT, which causes > the jvm to use another port (which is however also in use on zone1)): > > localhost.31000 localhost.42700 49170 0 49152 0 > TIME_WAIT > localhost.31001 localhost.32000 49152 0 49152 0 > ESTABLISHED > localhost.32000 localhost.31001 49152 0 49170 0 > ESTABLISHED > > 4) restart again without changing anything (i.e. again use all the > defaults). > Doesn't work this time (jvm tries to use port 31000 again): > > INFO | jvm 2 | 2009/05/20 12:12:34 | java.net.SocketException: Address > already in use > > 5) define an explicit port range for the jvm and start it > Doesn't work either > > -> There is only 1 situation where we were able to start a wrapper on a > second zone, using the same ports as on the first zone. > Strangely enough this is caused by the fact that the previously allocated > jvm port is in a TIME_WAIT state. > At least, that's the only explanation we have for this. > > We also verified the type of zones configured on our testserver and it > appears that our server is using "root" zones. > Therefore I asked one of our consultants to verify the type of zones used at > one of our customers. > If they also use "root" zones, it might be that this has something to do > with the issue. > Unfortunately because of the long weekend it will take at least until monday > before we have any news on this. > > Another interesting piece of info that we found is the following: > > ----- > On a Solaris system with zones installed, the zones can communicate with > each other over the > network. The zones all have separate bindings, or connections, and the zones > can all run their > own server daemons. These daemons can listen on the same port numbers > without any conflict. > The IP stack resolves conflicts by considering the IP addresses for incoming > connections. The > IP addresses identify the zone. > ----- > > Which would mean that solaris should take care of the port conflicts if they > arise. > And it seams to do this in case the initial port is in TIME_WAIT, but not in > "normal" cases. > > regards, > > gds > > > Leif Mortenson-3 wrote: >> >> Santo, >> We created "sparse" zones rather than "root" zones because they share >> much of the file system with the underlying OS. Our thinking was that >> this would be more likely to show any resource conflicts. >> >> I agree that there are likely some configuration differences between >> your systems and ours. We have been actively attempting to locate the >> cause of this problem, but any information that you could provide >> would be very helpful in narrowing this down. >> >> Our tests are being on an x86 server within a virtual machine. We >> also have one Sparc server. But that is running Solaris 9 natively >> and is being used for our build process. If possible, we would like >> to avoid reinstalling that with Solaris 10 as we would need to restore >> it later. I would be very surprised if a problem like this would work >> differently on x86 vs Sparc however. >> >> Cheers, >> Leif >> >> On Wed, May 20, 2009 at 5:50 PM, Santo74 >> <gds...@de...> wrote: >>> >>> Leif, >>> >>> This is very strange, because we haven't come across any solaris 10 >>> environment (with zones) >>> not having this issue. >>> Therefore it indeed looks like some configuration differences (or >>> something) >>> in comparison with your system. >>> However, I still find it strange that other applications (not using the >>> service wrapper) are >>> not having this issue on the same zones. >>> As for the security, it's true that our application runs under a >>> dedicated >>> user account (and therefore doesn't have full (root) privileges), but the >>> IBM Tivoli Policy Server (which I mentioned before) is also running under >>> a >>> dedicated account (with limited privileges) as far as I know. >>> >>> This morning I heard that most of the problems with our solaris server >>> should be solved later today, which >>> means that I can hopefully start testing again next monday (long weekend >>> over here). >>> >>> Thanks, >>> >>> gds >>> >>> >>> >>> Leif Mortenson-3 wrote: >>>> >>>> Santo, >>>> We have done some tests with a server configured with 3 Zones as well >>>> as done some more research. >>>> >>>> It does not appear to be possible to have multiple Zones "share" an IP >>>> address. So they will each have their own IP. For that reason, there >>>> should be no reason why any of the Zones would ever have any conflict >>>> with bound ports. As I understand it. >>>> >>>> Below you will find the netstat output from 3 Zones on the same >>>> machine each running a copy of the Wrapper. Each has an SSH >>>> connection to the Zone as well as the two between the Wrapper and its >>>> JVM. In all cases, the port number are the same. >>>> >>>> Because you have had reports from a few of customers, I am sure that >>>> "something" is happening. But from the information to date, I am not >>>> sure what the cause might be. Is it possible that there are some >>>> security configurations setup on one or more of the Zones that would >>>> prevent the Wrapper from starting? >>>> >>>> The Wrapper will loop over its 1000 possible ports looking for the >>>> first one that it is able to bind to. If all 1000 fail to bind then >>>> it reports that fact to the user. Rather than all 1000 ports >>>> actually being already bound, it may be that the OS is refusing to >>>> allow the Wrapper to bind to those ports for security reasons? >>>> >>>> Anyway, here is the netstat output from our 3 Zones. >>>> >>>> --- >>>> jupiter >>>> TCP: IPv4 >>>> Local Address Remote Address Swind Send-Q Rwind Recv-Q >>>> State >>>> -------------------- -------------------- ----- ------ ----- ------ >>>> ----------- >>>> jupiter.22 192.168.0.128.59013 18816 0 49232 0 >>>> ESTABLISHED >>>> localhost.31000 localhost.32000 49152 0 49152 0 >>>> ESTABLISHED >>>> localhost.32000 localhost.31000 49152 0 49170 0 >>>> ESTABLISHED >>>> >>>> Active UNIX domain sockets >>>> Address Type Vnode Conn Local Addr Remote Addr >>>> ffffffff889688f8 stream-ord 00000000 >>>> ffffffff89c8bac0 /tmp/.X11-unix/X0 >>>> ffffffff88968ac0 stream-ord 00000000 >>>> 00000000 /tmp/.X11-unix/X0 >>>> ffffffff87a38728 stream-ord 00000000 >>>> 00000000 /tmp/.X11-unix/X0 >>>> ffffffff88968730 stream-ord 00000000 >>>> ffffffff89c8bac0 /tmp/.X11-unix/X0 >>>> ffffffff87a38560 stream-ord 00000000 >>>> ffffffff89c8bac0 /tmp/.X11-unix/X0 >>>> ffffffff87a38008 stream-ord ffffffff8852b780 >>>> 00000000 /var/run/zones/kore.console_sock >>>> ffffffff88968c88 stream-ord 00000000 >>>> 00000000 /tmp/.X11-unix/X0 >>>> ffffffff87a38398 stream-ord ffffffff89c8bac0 >>>> 00000000 /tmp/.X11-unix/X0 >>>> ffffffff87a38ab8 stream-ord ffffffff882b5880 >>>> 00000000 /var/run/zones/europa.console_sock >>>> ffffffff87a38c80 stream-ord ffffffff87a3d740 >>>> 00000000 /var/run/.inetd.uds >>>> >>>> europa: >>>> TCP: IPv4 >>>> Local Address Remote Address Swind Send-Q Rwind Recv-Q >>>> State >>>> -------------------- -------------------- ----- ------ ----- ------ >>>> ----------- >>>> europa.22 192.168.0.128.55040 13440 0 49232 0 >>>> ESTABLISHED >>>> localhost.31000 localhost.32000 49152 0 49152 0 >>>> ESTABLISHED >>>> localhost.32000 localhost.31000 49152 0 49170 0 >>>> ESTABLISHED >>>> >>>> Active UNIX domain sockets >>>> Address Type Vnode Conn Local Addr Remote Addr >>>> ffffffff87a381d0 stream-ord ffffffff87de6740 >>>> 00000000 /var/run/.inetd.uds >>>> >>>> kore: >>>> TCP: IPv4 >>>> Local Address Remote Address Swind Send-Q Rwind Recv-Q >>>> State >>>> -------------------- -------------------- ----- ------ ----- ------ >>>> ----------- >>>> kore.22 192.168.0.128.56248 17664 0 49232 0 >>>> ESTABLISHED >>>> localhost.31000 localhost.32000 49152 0 49152 0 >>>> ESTABLISHED >>>> localhost.32000 localhost.31000 49152 0 49170 0 >>>> ESTABLISHED >>>> >>>> Active UNIX domain sockets >>>> Address Type Vnode Conn Local Addr Remote Addr >>>> ffffffff87a388f0 stream-ord ffffffff8aab4600 >>>> 00000000 /var/run/.inetd.uds >>>> --- >>>> >>>> We will keep poking around, but please let me know if you are able to >>>> collect any more information. >>>> >>>> Cheers, >>>> Leif >>>> >>>> >>>> On Mon, May 18, 2009 at 8:14 PM, Santo74 >>>> <gds...@de...> wrote: >>>>> >>>>> Leif, >>>>> >>>>> Regarding the issue of restarting the application after it crashed or >>>>> was >>>>> forcedly killed I will >>>>> keep an eye on it and report back with more info whenever it should >>>>> happen >>>>> again. >>>>> It is indeed not the behaviour that I would expect from the wrapper >>>>> especialy now that you confirmed that it isn't allocating the whole >>>>> range. >>>>> >>>>> As you already mentioned correctly I won't be able yet to verify if I >>>>> can >>>>> start the app on a second zone after having stopped the app on the >>>>> first >>>>> zone for at least 2 min. >>>>> >>>>> Concerning your last question: our dev/test system is currently >>>>> configured >>>>> with 5 zones, all using their own ip address. >>>>> I have no idea about the configuration of the solaris systems at our >>>>> customers. >>>>> >>>>> regards, >>>>> >>>>> gds >>>>> >>>>> >>>>> Leif Mortenson-2 wrote: >>>>>> >>>>>> Santo, >>>>>> We are in the process of setting up a Solaris 10 server to do some >>>>>> testing with Zones in house. We have a Solaris 9 server, but out >>>>>> Solaris 10 testing has been done IN a zone on Sun's EZqual loaner >>>>>> server. I will let you know what we found. >>>>>> >>>>>> As I explained, the Wrapper never actually attempts to allocate all >>>>>> 1000 ports unless they are already blocked. If the first instance of >>>>>> your application uses ports 32000 and 31000 and that crashes, it is >>>>>> possible that the 32000 port will be locked for 2 minutes so the >>>>>> second invocation of the JVM would use 32001 and 31000. But the >>>>>> other 999 ports would have never been accessed so I can imagine no >>>>>> reason why they would be locked. >>>>>> >>>>>> In your case with Solaris Zones. You say that the Wrapper can not >>>>>> start on these second Zone when one is running on the first. Are you >>>>>> able to verify that the wrapper on the second Zone works if the first >>>>>> had not been running for at least 2 minutes? I am wondering if it is >>>>>> a configuration issue. >>>>>> >>>>>> We will be able to test this shortly ourselves. And it doesn't sound >>>>>> like you will be able to test it until your system is back up and >>>>>> running. >>>>>> >>>>>> Sorry for this next question as it may show my lack of knowledge with >>>>>> Solaris Zones: >>>>>> With your system, are both Zones sharing the same IP address? If so, >>>>>> they should not be able to share ports on that IP. In this case >>>>>> however, we are only binding to localhost, so it should not matter. >>>>>> >>>>>> I will post back as soon as we have gotten this tested out. >>>>>> >>>>>> Cheers, >>>>>> Leif |