Hi Tim, I got your message (on my phone though). You're probably at work now.. Do you have aim access there? If so, sign on! :)
While trying to get logging to work, I think I found another bug.. We're getting type cast exceptions in the selfMaint() method as we look up the expiration time. Timespan.Totalmilliseconds returns a double rather than a long. One way to correct this is to simply use the DateTime.Now object as the value of the socket timeout vs converting it to a long/double. This will make the code a little bit more readable and shouldn't be a performance hit (it's a value type anyway that is just holding # of ticks).
I'm not certain what the extent of this bug is and what the symptoms are but semingly, the result is that sockets aren't closing when they should be since the exception breaks it out of the loop immediately.
Also, does logging for your simple benchmark project work for you? I'm not sure why but I cant get any logging whatsoever. I've mostly used nlog and never log4net but i've tried a few different app.config examples and still no luck. Any ideas?
Yes, I noticed the exceptions as well. I've broadly narrowed (…?) it down to the TCP connections timing out. I would really like to squash that if I could.
As time frees up here are my three development tasks that I would like to tackle:
-Clean up the API (the longer it goes on like this the more of a pain it will be to change later)
-Fix those exceptions that occur when the sockets time out
-Make an App.config section handler so we can configure all of the options for the client API in the App or Web.config file (I just thought of this one today).
And wow that's quite a setup you have there. Ours isn't quite as big as that right now, but there is a good possibility that we'll need to scale out that big so we're planning ahead J .
As for an explanation on how the client handles failover:
The socket pool keeps persistent connections to each of the memcached servers. When the client loses a connection to one it tries to reconnect after the reconnect timeout. If it doesn't connect, it doubles the timeout and tries again. It keeps doubling the timeout everytime it fails to connect. If it ever reconnects the timeout length is reset. This works really well because if there is a slight network "hiccup" you'll reconnect very quickly, but if the node goes down for a very long time the client quickly ignores it. The failover code could use a little more work and it's all related to those exceptions. There isn't really any "redundancy", only failover. But the failover process works pretty well and it's how the other clients I looked at handle it.
Are you familiar with how memcached works? It's really just a hashtable of hashtables. The first level of hashing decides "what server does this go on?" then the next level of hashing happens on the server and it says "where does this item go in my hashtable?" Since all the clients use the same hashing algorithm they all end up coming up with the same values. That's one thing I had to explain to some of the devs on my team. It's only a cache, it's not a persistent store. If one of the nodes goes down, there will be a small hiccup in our web application because most of the stuff will have to be re-cached.
Unfortunately I haven't had time to tune and performance test our cache yet, so we're just using the default values right now for our limited beta test of our system. We're a pretty small development shop and I'm pretty pressed for time just trying to add features and squash bugs in the application. Unfortunately the users can't see how cool memcached is, they only see how cool the interface for our application looks and that gets priority.
I'm copying this email to the developer mailing list. I'll IM you my AIM screenname.
Have a great weekend and drink up!
This sounds great. It's 'alpha' but it's working great for us so far. We've been stable with our patched client for a few days now running on 25+ web servers with 10 memcache boxes with just a small handful of unhandled exceptions.
I've recently reimaged out machines from 2.4 kernel to an smp 2.6 kernel. Load on each box went from ~85% to less than 5% ! This was extremely encouraging and paves the way to more extensive use of memcache throughout our site.
Go ahead and post the bugs and emails--Anything that gets more people to download the library and get more eyes on the code would be great.
I'd like to gain a bit more insight into some aspects of the client.. One question I get frequently is about redundancy. From looking at the code, it seems like if a node goes down then it's marked as down and the next server on the list is used instead. Since every client performs the same check and goes on to the same next server, there's barely a performance hit. Then the server is checked periodically (at increasing time intervals?) until it comes back up. Is this about right? Can you describe this process any further? Do you have some ideas of improving it?
What do you think is a good value to use for max idle connections? I currently have roughly 2000 simultaneous connections to each memcache node and i'm pretty sure that most of these are idle sockets in client pools--though i'm not certain how many are actually used. It'd be cool to be able to see various client stats to get more transparency into things. I think the perl client does something like it already. I could take a look at this closer next week and get some more concrete ideas together.
This email is getting long... are you on aim or any im? i'm **** on aim or msn: ****
gonna go out drinking.. ttyl! :)
I made the changes to the library and re-uploaded the binaries. I also have two sets of project/solutions, one for VS.NET 2003 and .NET 1.1 and one for VS.NET 2005 and .NET 2.0. They both run off the exact same source files (although I think I may make the 2.0 version work with the native GZIP stuff new in the 2.0 framework at some point).
Keep in mind though that technically the project is still in alpha. Mainly because the API isn't very clean. I would like to clean it up to adhere more to the .NET coding standards. It would be pretty easy to change any code (mostly just going from lowercase stuff to uppercase and small stuff like that), but I'll make sure to put any changes in the changelog.
p.s. Would you mind if I copied this message to the development mailing list for the project so that it looks like there is some activity? It might help us out if some other people have an indication that something is going on with this project.
I'll see if i can send diff's later on when i have a chance but the first bug I changed the sleeps to this:
Thread.Sleep((int)interval * 10 ) in the catch
and the while loop changed to this
while ((count = gzi.Read(tmp, 0, 2048)) != 0)
please confirm that -1 really is never returned though (or why it would be) but we've had no probs so far.
Regarding 2.0, we are using it for some projects but not for those using this client so it'd be great if you can maintain two branches if you start using 2.0-only stuff.
We're using this pretty heavily at this point and I'll be sure to let you know as soon as we find new issues.
Hey thanks for the extra set of eyes. Yes, the nanoseconds stuff caused a lot of small little errors when I was porting it from Java (which uses milliseconds).
If you made any changes, would it be possible to send me a diff of your project? I could incorporate them (and give you credit) and repost the project.
On another note, are you using .NET 1.1 or 2.0? One thing I would like to do is move the project to .NET 2.0 because the library performs so much faster (I think serialization in .NET 2.0 has been much improved). If not, I'll make sure to keep both project files around and build it for both frameworks.
Hi Tim, great work with the memcached c# client -- been using it over here in production with a lot of good results. I've had to modify the client so far in a couple ways to make it work and wanted to let you know so you could consider fixing it.
In SockIOPool.Maintain(): when you pass a timespan in into thread.sleep you should instead be just passing in that number of milliseconds (5000). A timespan constructor takes ticks in nanoseconds. This causes 100% cpu time since it's polling way faster than it should.
In Memcacheclient.LoadItems you have a while loop that reads until gzi.Read returns -1. I'm not sure if it ever returns -1 but looking at the zip code, it does return 0. This again was causing 100% cpu time as it never left this tight loop.
Please keep up the awesome work you're doing and I will let you know if I find other issues ( I suspected the maintenance thread has a bug somewhere causing exceptions but haven't nailed it yet).