With git.50b74e I've been seeing fairly regular client freezes. When they happen, I'm seeing (on the console) errors that suggest XML parse errors. I've attached the log file. You can see the event at "WARNING: XML fail". I have not included the save game because it is not reproducible. It does happen regularly, just not in any reproducible way. It has all the earmarks of a race condition somewhere.
Ugh. These are hard to find. Can you suggest a git revision that did not have the problem?
Yeah. I hate these types of things, too. I looked hard to see if I could find a reproducible case. It does seem to trigger more often with a busier game but the game itself doesn't seem to matter. It seems to happen most often at the start of a turn. Maybe it's related to the size of the overall message from the server?
I don't have a specific revision where the problem appeared. However, the problem appeared some time in the past week or so. I don't recall any problems like this from two weeks ago.
GIven the variation in the error messages (it's not always the same XML tag in the error), it looks like it's related to the communication between the server and client.
I've also noticed another issue that may be related. I mention it here in case it helps. Over the same time frame, I've had colonies suddenly become unclickable (can't open the colony screen for them at all). This is also not reproducible. In fact, a reconnect resolves the problem in this case. This seems to trigger an NPE but I don't have a log file handy for that. I could see something like this happening due to XML corruption, too. This seems to happen more frequently than the XML fail thing. It seems to present in most cases by showing colonies with size 0 on the map. Sometimes taking an action on the tile with the colony fixes it (presumably because the colony information is re-learned from the server).
Quite credibly. That would certainly open a bigger window for a race.
Perhaps you could try running with 0.11.6-release and see if it is found there?
The Colony panel is the most complex part of the UI, and it is often a canary for other trouble. A log file with the crash would be good, although it may well be a secondary effect of this race.
That one I have seen, and indeed, thought I had fixed with git.1c641ea. If you are past there and still see 0-size colonies, that is very likely a serious serialization bug which deserves its own report. OTOH, it is not impossible that if we fix that problem, the race here could go away.
I gave that a try. Neither the XML failure nor the unclickable zero sized colonies problem showed up. I played long enough that one or the other should have showed up if it was going to based on the frequency I've been seeing it.
Last edit: William Astle 2015-11-10
Cool. I am relieved that it must then be something we have introduced since the release.
OK, two weeks ago would be around 27 Oct. That was about the time I put in some big changes (enhanced trade routes) which would be my best guess ATM as to where things went wrong (in git.d8047bf, 91f54ec). I have not see the possible-race yet but have had little chance to play, however I definitely saw the 0-size colonies on 4 Nov (and tried to fix them with git.1c641ea). So the interesting tests would be whether you see (either) bug before git.d8047bf or after git.1c641ea.
I saw the zero size colony thing appear with git.1c641ea. The XML fail one didn't show up after quite a long time though that isn't as conclusive as I'd like. However, git.50bf74 (which it turns out is the actual revision I was using for the original report - stupid typos) is after that point, too, at least date-wise.
I didn't see either problem with git.76311fe though my git-fu is very weak so I may not have had a clean tree. (I apparently do not think the right way to actually understand the git way of doing things.)
We have all been there. "git checkout 76311fe" works for me. Good that we have a baseline then.
OK, so we have the interval (76311fe .. 50bf74] for both bugs. Time to look for a reproducible case and use "git bisect".
I have played about a two hours of games in various stages and been unable to reproduce. Perhaps the extra tracing I have enabled has suppressed the race. Do we have any more sightings using a git version in the above range?
I haven't had a chance to play much lately. However, I have observed that it happens more often with "large" games. I've attached a save game where I have fairly regularly run into problems. Maybe it will help.
I did see it show up with the current trunk (63cb1fa) using this save game. In this case, the zero size colony one. The log shows some things like "Fixing bogus unit location". I've attached the log as well.
OK, the log is interesting indeed. Nasty concurrent modification crash there, which I am testing a fix for... not convinced it is the cause of the race, but it is definitely a good one to find and eliminate. Then, I think you are right, and the "bogus unit" messages are an indicator of the misbehaviour underlying the 0-size colonies. So I will chase that after the prior fix lands. After that... lets hope the race is gone.
The concurrent modification should now be prevented (git.2c38e06), one possible cause of the "bogus unit" problem has been fixed (git.cda603f), and extra logging has been added (git.c075fb2). Please attach log files following 0-sized colony sightings with trunk.
Had it happen immediately with the attached save game. Two colonies showed up as zero sized at the start of the next turn. I've attached the log and screen shots showing the 0 sized colonies.
Alas, not for me. This is exasperatingly difficult to reproduce. git.2749fa7 adds yet more logging...
Here's another one, this time with "FINEST" logging. It took a few turns from the save save game as the previous one but eventually Roanoke went zero sized. This one includes the NPE when trying to open Roanoke.
Interestingly, moving a nearby unit onto the Roanoke tile fixed the zero size problem, presumably because the client got corrected information from the server as a result.
I haven't seen any hard freezes lately but those were much rarer.
I managed to trigger it again (with git.2749fa7). It took a lot longer but the log is attached. I also got the following on the console though it may not be related:
<colony id="user-content-colony:12219" owner="player:1" tile="tile:1383" settlementtype="model.settlement.colony" name="Roanoke" established="74" sonsofliberty="100" oldsonsofliberty="100" tories="0" oldtories="0" liberty="9069" immigration="230" productionbonus="2"><goodscontainer id="user-content-goodsContainer:12220"><storedgoods><goods type="model.goods.bells" amount="9069"></goods><goods type="model.goods.crosses" amount="230"></goods><goods type="model.goods.food" amount="164"></goods><goods type="model.goods.furs" amount="173"></goods><goods type="model.goods.hammers" amount="96"></goods><goods type="model.goods.horses" amount="212"></goods><goods type="model.goods.lumber" amount="120"></goods><goods type="model.goods.muskets" amount="50"></goods><goods type="model.goods.ore" amount="24"></goods><goods type="model.goods.sugar" amount="43"></goods><goods type="model.goods.tobacco" amount="12"></goods><goods type="model.goods.tools" amount="98"></goods></storedgoods><oldstoredgoods><goods type="model.goods.bells" amount="8939"></goods><goods type="model.goods.crosses" amount="229"></goods><goods type="model.goods.food" amount="163"></goods><goods type="model.goods.furs" amount="173"></goods><goods type="model.goods.hammers" amount="48"></goods><goods type="model.goods.horses" amount="212"></goods><goods type="model.goods.muskets" amount="45"></goods><goods type="model.goods.ore" amount="12"></goods><goods type="model.goods.sugar" amount="43"></goods><goods type="model.goods.tobacco" amount="6"></goods><goods type="model.goods.tools" amount="81"></goods></oldstoredgoods></goodscontainer><ability id="user-content-model.ability.coastalOnly" value="false"></ability><building id="user-content-building:13305" colony="colony:12219" buildingtype="model.building.fortress"></building></colony>
The NPE is expected once the colony goes zero sized. The console output was to check that the colony is empty. Thanks for posting that, it confirms that the problem can be traced to a bad update from the server. I will take out the output soon.
Moving a unit into the colony is exactly what I now suspect is causing the problem. The problem is alas intermittant. You are correct that it will trigger a corrective update... if it does not break.
I have cleaned up a bit of dodgy code in git.ea2265a, but have little expectation we have got to the bottom of this. The log file shows that the server is writing animation updates to the client using the privileges of the Sioux player but for updates to all players --- the Sioux can not see inside your colony, so the update lacks units and thus the colony size becomes zero. This is utterly puzzling, and AFAICT unrelated to any recent changes. I suspect more logging will be needed...
Triggered again with git.2aec5af. I've attached the log. In this case, it was Newport that went zero sized.
Actually finally managed to reproduce the race today. But only on one machine that I can only access occasionally. More logging went into git.0fef903.
Finally! Overnight testing turned up a useful crash, which gave me a clue as to what might be behind this. I would be especially interested in hearing of more sightings of this problem with trunk post git.a5eeb13.
I haven't seen client hangs (the original problem reported in this bug) recently. I've had a number of other weird things show up but none of them have caused a client hang. If I can figure out a test case or even a sensible description of the other weirdnesses, I'll file separate bug reports.
I have seen zero size colonies popping up but they tend to repair relatively quickly on their own (immediately or after the end of turn) or on a client reconnect. I have't been able to figure out a good trigger for them but it seems to happen most often if I do some sort of input while the client is already performing actions, and it tends to trigger most often on "big" games with lots of units or colonies.
OK, the zero sized colony issue was recently finally tracked down with a reproducable test case (BR#2982). I am hoping we are all done here, as I have not seen the XML problem for quite some time.
Time to move this to Pending. I think between the DOM removal and the BR#2982 fix the zero sized colonies have been significantly reduced. I suspect the AI is still doing it in some strange cases, but I think you are right and that these are auto-repairing quickly.