From: Bernard L. <be...@va...> - 2008-07-10 18:37:15
|
Dear all: Here's the release plan for the upcoming 3.1 release. I believe all the important, show-stopping backport proposals in the STATUS file for 3.1 branch have already been processed and voted on. So all the remaining backport proposals will be moved from "BACKPORT PROPOSALS" to "BACKPORT PROPOSALS NEXT VERSION" except for documentation patches. As for... * gmond: avoid latency and timeouts when using the tcpconn python module If this causes issues, we could just turn it off by default and put in documentation about its potential pitfalls on certain platforms. Let's get this done by Friday and roll out a beta. We'll test this for a week, and roll out RC1, RC2, etc. etc. Regards, Bernard |
From: Brad N. <BNI...@no...> - 2008-07-10 19:43:53
|
>>> On 7/10/2008 at 12:37 PM, in message <d4c...@ma...>, "Bernard Li" <be...@va...> wrote: > Dear all: > > Here's the release plan for the upcoming 3.1 release. > > I believe all the important, show-stopping backport proposals in the > STATUS file for 3.1 branch have already been processed and voted on. > So all the remaining backport proposals will be moved from "BACKPORT > PROPOSALS" to "BACKPORT PROPOSALS NEXT VERSION" except for > documentation patches. > > As for... > > * gmond: avoid latency and timeouts when using the tcpconn python module > > If this causes issues, we could just turn it off by default and put in > documentation about its potential pitfalls on certain platforms. > > Let's get this done by Friday and roll out a beta. We'll test this > for a week, and roll out RC1, RC2, etc. etc. > Sounds good. Make it happen. :) Brad |
From: Carlo M. A. B. <ca...@sa...> - 2008-07-11 10:36:48
|
On Thu, Jul 10, 2008 at 11:37:23AM -0700, Bernard Li wrote: > > Here's the release plan for the upcoming 3.1 release. do you mean 3.1.0? > I believe all the important, show-stopping backport proposals in the > STATUS file for 3.1 branch have already been processed and voted on. actually there is 1 showstopper reported there which has not yet a resolution, and that is the probable licensing issue between the BSD ganglia-webfrontend and the GPL templatePower class. sadly, I hadn't heard back from Ron (the author of templatePower) on the alternatives we might be able to go with and so can't comment in that. but checking again all legalese that seems to be tied into the files in the web frontend, it might seem we could be OK after all as the terms of the BSD license there (which actually look more like a MIT license to me) seem compatible with the terms of GPLv2. but of course IANAL and we should probably seek advice from one (maybe debian legal or fedora legal could help there). > So all the remaining backport proposals will be moved from "BACKPORT > PROPOSALS" to "BACKPORT PROPOSALS NEXT VERSION" except for > documentation patches. 2 of them still needing votes, and most likely some other ones still not proposed for backport or committed and dealing with the issues that we had been saying will be "put in documentation" like the one proposed below, the upgrading instructions or building/packaging recommendations for CentOS 4 users (including dependencies that are not available in the official repositories). > As for... > > * gmond: avoid latency and timeouts when using the tcpconn python module > > If this causes issues, we could just turn it off by default and put in > documentation about its potential pitfalls on certain platforms. It is definitely unstable and not likely to be fixed before the freeze, so IMHO would be better deleted (not turned off by default) as there is no way to do that reliably in a clean way AFAIK. If we would have contrib for 3.1.0, adding it back there in both versions (the python 2.3 compatible one, and the more reliable python 2.4 compatible version) might be a good idea, so that users can use them and configure them as needed (if they agree to the annoyances/risks), but since that is very likely to delay tagging the beta since today is already the "release date" proposed, will be most likely better to just cut it clean. Carlo |
From: Jarod W. <jw...@re...> - 2008-07-11 13:22:17
|
On Friday 11 July 2008 07:02:04 am Carlo Marcelo Arenas Belon wrote: > On Thu, Jul 10, 2008 at 11:37:23AM -0700, Bernard Li wrote: > > Here's the release plan for the upcoming 3.1 release. > > do you mean 3.1.0? > > > I believe all the important, show-stopping backport proposals in the > > STATUS file for 3.1 branch have already been processed and voted on. > > actually there is 1 showstopper reported there which has not yet a > resolution, and that is the probable licensing issue between the BSD > ganglia-webfrontend and the GPL templatePower class. > > sadly, I hadn't heard back from Ron (the author of templatePower) on the > alternatives we might be able to go with and so can't comment in that. > > but checking again all legalese that seems to be tied into the files in > the web frontend, it might seem we could be OK after all as the terms > of the BSD license there (which actually look more like a MIT license > to me) seem compatible with the terms of GPLv2. > > but of course IANAL and we should probably seek advice from one (maybe > debian legal or fedora legal could help there). Tom Callaway is Fedora's first line of defense when it comes to licensing questions, and either he knows the answer from looking into tons of package licensing issues already, or knows who to talk to. -- Jarod Wilson jw...@re... |
From: Carlo M. A. B. <ca...@sa...> - 2008-07-11 16:17:43
|
On Fri, Jul 11, 2008 at 09:22:31AM -0400, Jarod Wilson wrote: > On Friday 11 July 2008 07:02:04 am Carlo Marcelo Arenas Belon wrote: > > > > but checking again all legalese that seems to be tied into the files in > > the web frontend, it might seem we could be OK after all as the terms > > of the BSD license there (which actually look more like a MIT license > > to me) seem compatible with the terms of GPLv2. > > > > but of course IANAL and we should probably seek advice from one (maybe > > debian legal or fedora legal could help there). > > Tom Callaway is Fedora's first line of defense when it comes to licensing > questions, and either he knows the answer from looking into tons of package > licensing issues already, or knows who to talk to. right, thanks Jarod and welcome Tom. so to summarize the relevant facts are : * the Ganglia Web Frontend (which has been with us since 3.0) has been always under a BSD license (not sure which one really because our COPYING file has no conditions that I can count and so looks more like a MIT/X11 license to my untrained eye), but whichever it is, does not have the advertising clause. http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/web/COPYING?view=markup * the frontend is built using a template class called TemplatePower that is itself licensed under the GPLv2 or later and that has been with the frontend since the first release as well. http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/web/class.TemplatePower.inc.php?revision=459&view=markup * the frontend is build in PHP and so all source code is of course available and distributed with it, and so all distribution requirements for both licenses are met. * the BSD license we are using doesn't add any restrictions (like the advertising clause that usually comes from the original BSD license) that are not already part of the GPL license and therefore would seem (at least to my untrained eye) that they are indeed compatible and fine for distribution together : http://www.gnu.org/philosophy/license-list.html Carlo |
From: Brad N. <BNI...@no...> - 2008-07-11 15:29:08
|
>>> On 7/11/2008 at 5:02 AM, in message <20080711110204.GB7724@tapir>, Carlo Marcelo Arenas Belon <ca...@sa...> wrote: > On Thu, Jul 10, 2008 at 11:37:23AM -0700, Bernard Li wrote: >> >> So all the remaining backport proposals will be moved from "BACKPORT >> PROPOSALS" to "BACKPORT PROPOSALS NEXT VERSION" except for >> documentation patches. > > 2 of them still needing votes, and most likely some other ones still > not proposed for backport or committed and dealing with the issues that > we had been saying will be "put in documentation" like the one proposed > below, > the upgrading instructions or building/packaging recommendations for CentOS > 4 > users (including dependencies that are not available in the official > repositories). > >> As for... >> >> * gmond: avoid latency and timeouts when using the tcpconn python module >> >> If this causes issues, we could just turn it off by default and put in >> documentation about its potential pitfalls on certain platforms. > > It is definitely unstable and not likely to be fixed before the freeze, so > IMHO would be better deleted (not turned off by default) as there is no way > to do that reliably in a clean way AFAIK. > Disabling it is just a matter of a file name change from tcpconn.py to tcpconn.pyoff or something like that. The same thing would have to be done for the tcpconn.pyconf file as well (tcpconn.pyoff). I would suggest we just make the file name change and still distribute it for those that want to use it anyway. It still works reliably, it just has a wait timeout issue that is really only noticeable when using the -m parameter. Brad |
From: Brad N. <BNI...@no...> - 2008-07-11 16:07:18
|
>>> On 7/10/2008 at 12:37 PM, in message <d4c...@ma...>, "Bernard Li" <be...@va...> wrote: > Dear all: > > Here's the release plan for the upcoming 3.1 release. > > I believe all the important, show-stopping backport proposals in the > STATUS file for 3.1 branch have already been processed and voted on. > So all the remaining backport proposals will be moved from "BACKPORT > PROPOSALS" to "BACKPORT PROPOSALS NEXT VERSION" except for > documentation patches. > > As for... > > * gmond: avoid latency and timeouts when using the tcpconn python module > > If this causes issues, we could just turn it off by default and put in > documentation about its potential pitfalls on certain platforms. > > Let's get this done by Friday and roll out a beta. We'll test this > for a week, and roll out RC1, RC2, etc. etc. > I would just like to make a comment about version numbers as we are about to generate our first release of 3.1. I noted this on the wiki several months ago under the section "Generating a Release Candidate and GA Release" (http://ganglia.wiki.sourceforge.net/ganglia_works) which describes the same release versioning process that the Apache project uses. This also goes back to our discussions about 3.1.0 vs. 3.1.1 version number. The Apache project does not use the labels Alpha, Beta, RCx for any of the actual tarball file names or internal version numbers in the source code itself. The only time these labels are used are in the mailing list announcements during the testing period. The reason why these labels are not used in the file name or in the source code is so that a tarball only has to be rolled once and if determined during the testing period to be releasable, no alterations to the actual tarball are made. It is simply released officially. What this means is that if a 3.1.1 tarball is tagged in SVN, rolled, posted for testing and fails testing, that version number is simply thrown away. Once the fix is made in trunk and backported, a new tag is created (ie. 3.1.2) in SVN, the tarball is rolled and the process starts over again. If the "Beta" or "RCx" label were used in the source code or file name, source code changes would have to be made (even though minor), the tarball would have to be re-rolled which carries a potential of reintroducing a bug or build issue. Another advantage to this is that all public releases (including betas or RCs) of a tarball are tracked in SVN. It also avoids any potential confusion because multiple testing tarballs carried the same version number as the actual release. The only downside to this is the fact that our first official release might be 3.1.3 or rather than 3.1.0. This could be a little mis-leading to some users however the Apache project has been successfully doing this for a long time with no problem or confusion. The first release of Apache 2.0 was not 2.0.0 it was actually 2.0.35. NOTE: I do have to mention that the first release of Apache 2.2 was actually 2.2.0, but that was because the Apache project moved to an odd/even versioning scheme (like the linux kernel) where all odd minor version numbers are considered "in-development" and all even minor version numbers are releases. That is why nobody ever saw an official Apache 2.1 release but rather skipped directly to 2.2. All 2.1.x versions were development/testing only. We could do that as well, but that hasn't been the precedence so far in the Ganglia project. If we did, then our first official release would be 3.2.0 rather than 3.1.<whatever>. My preference would be to stick to the 3.1.x scheme as described in the wiki and the paragraph above. Comments? Brad |
From: Carlo M. A. B. <ca...@sa...> - 2008-07-11 19:35:24
|
On Fri, Jul 11, 2008 at 10:07:08AM -0600, Brad Nicholes wrote: > >>> On 7/10/2008 at 12:37 PM, in message > <d4c...@ma...>, "Bernard Li" > <be...@va...> wrote: > > Dear all: > > > > Here's the release plan for the upcoming 3.1 release. > > > > I believe all the important, show-stopping backport proposals in the > > STATUS file for 3.1 branch have already been processed and voted on. > > So all the remaining backport proposals will be moved from "BACKPORT > > PROPOSALS" to "BACKPORT PROPOSALS NEXT VERSION" except for > > documentation patches. > > > > As for... > > > > * gmond: avoid latency and timeouts when using the tcpconn python module > > > > If this causes issues, we could just turn it off by default and put in > > documentation about its potential pitfalls on certain platforms. > > > > Let's get this done by Friday and roll out a beta. We'll test this > > for a week, and roll out RC1, RC2, etc. etc. > > I would just like to make a comment about version numbers as we are about > to generate our first release of 3.1. I noted this on the wiki several > months ago under the section "Generating a Release Candidate and GA Release" > (http://ganglia.wiki.sourceforge.net/ganglia_works) which describes the same > release versioning process that the Apache project uses. This also goes > back to our discussions about 3.1.0 vs. 3.1.1 version number. right, the first beta that Bernard is going to generate sometime today will be either called 3.1.0 or 3.1.1 (depending on what he decides to do, and which will be most likely 3.1.0 since there shouldn't be any technical reason not to anyway and he expressed several times that is what he wanted to do) since we had been testing snapshots for more than a year, I am pretty sure is going to be rock solid (except of course for the platforms that will have no support and that we are most likely going to have to defer to the next release but will be interesting to test as well, even if that means will need to have unofficial patches applied to them to work for 3.1.0) > The Apache project does not use the labels Alpha, Beta, RCx for any of > the actual tarball file names or internal version numbers in the source > code itself. The only time these labels are used are in the mailing list > announcements during the testing period. The reason why these labels > are not used in the file name or in the source code is so that a tarball > only has to be rolled once and if determined during the testing period to > be releasable, no alterations to the actual tarball are made. It is simply > released officially. This could be a little confusing, but we agreed to it so be it, hopefully again, since we had been testing this for a long time, the beta won't need to be thrown away but used AS-IS all the way through the RCs and we would make a 3.1.0 official release instead of having to resort into a 3.1.25 like Apache 2.0 did. any one willing to take some bets? > If we did, then our first official release would be 3.2.0 rather than > 3.1.<whatever>. My preference would be to stick to the 3.1.x scheme > as described in the wiki and the paragraph above. Agree, we could reconsider it when 3.2.0 gets released. Carlo |
From: Carlo M. A. B. <ca...@sa...> - 2008-07-11 16:50:37
|
On Fri, Jul 11, 2008 at 09:29:06AM -0600, Brad Nicholes wrote: > >>> On 7/11/2008 at 5:02 AM, in message <20080711110204.GB7724@tapir>, Carlo > Marcelo Arenas Belon <ca...@sa...> wrote: > > > > It is definitely unstable and not likely to be fixed before the freeze, so > > IMHO would be better deleted (not turned off by default) as there is no way > > to do that reliably in a clean way AFAIK. > > Disabling it is just a matter of a file name change from tcpconn.py to > tcpconn.pyoff or something like that. The same thing would have to be > done for the tcpconn.pyconf file as well (tcpconn.pyoff). That is what I meant by "not in a clean way", as it will leave dead code around and will get most likely people confused by the funny names and will require them to rename files (which are under a package manager and then will complain as being missing and won't be removed at uninstall time obstructing the removal for other directories as well). in any case, if documented clearly I have no reason to object but that is just because I won't be affected anyway as I don't use our provided RPM packages. but on that line, remember that it might not be implemented the way you envision for all available packages (which is what I meant by unreliably) as the copying of the files is done now by the SPEC and that could result in even more confusion. > I would suggest we just make the file name change and still distribute it > for those that want to use it anyway. My suggestion was to make a file name change as well into the contrib directory, where it won't get in the way and will be also available for those that want to use it, but since there is no contrib yet distributed then cleanly removing it (it will be available from our repository in the web anyway for whoever wants to install it) looks like the best next option. > It still works reliably, it just has a wait timeout issue that is really > only noticeable when using the -m parameter. but that would result in some metric samples failing silently and therefore in some wholes in the RRD values that could then result in mysterious drops in the graphs or flat lines. Carlo |
From: Carlo M. A. B. <ca...@sa...> - 2008-07-11 18:21:08
|
On Fri, Jul 11, 2008 at 12:10:36PM -0400, Tom spot Callaway wrote: > > Now, the web front-end is composed of MIT licensed pages and the one > GPLv2+ licensed page. MIT and GPLv2+ are compatible, so this is not a > problem. In my opinion, the web front-end is not a derived work of the > libganglia code (doesn't take code from it, doesn't link to the > library), so there is no concern around licensing incompatibility > between the ASL 1.1 portion of the libganglia license and the GPLv2+ php > page. This is great news, and that is also supported by the fact that the ganglia web frontend was originally and independent package (before 3.0) and so has cleared at least for me any doubts about the legality of distributing it with the upcoming 3.1 release. > However, if you disagree and think that the web front-end is a derived > work, you would need to either relicense (or replace) the code under the > ASL 1.1 license or the GPLv2+ license to resolve the conflict. Probably a nice thing to do for a future release and just so every possible interpretation of our license mix is covered as you suggested. > A few additional points worth mentioning: > > 1. A large chunk of the code in that tarball does not have license > attribution in the code itself. The only reliable way to determine the > license of code is to have the license attribution in the source file > itself (usually in the initial comment header). I would highly encourage > you to do this for all of your source code as soon as possible. Remember > that code moves often, and people forget what "COPYING" said (or even > which "COPYING" it came from). Agree, and definitely something I was looking forward to after we are done with this release. > 2. You should correct the "BSD" license references in your code, Agree, using "MIT" is definitely more accurate, but in our defense "BSD" is a confusing license name anyway as it can really mean different things, some of which are functional equivalent to a "MIT" license, like the "2 clause BSD" and "MIT" was after all based in BSD. There is also the fact that this all was started as part of a UC Berkeley project and therefore Matt might had been playing the regents a prank when he used instead a "MIT" license and put the regent names inside ;) and so, since he is still at shooting distance from Berkeley, calling it BSD helps avoid any animosities directed at him or us. In any case since the original intent was to use a "3 clause BSD" license from what I recall and that is functionally equivalent to a "MIT" license I don't think that to be considered a showstopper anyway but sometime to work for in the near future. > it is clearly not BSD licensed (with the exception of the freebsd metrics > code). and the other "BSD" metric code which is also under the "Original BSD (AKA 4 clause BSD)" license and that we will hopefully replace soon with something more modern as well. > Hope that helps, Thanks a lot for your great advice, we surely own you one, and take for granted the next time we meet that beer is on me. Carlo |
From: Brad N. <BNI...@no...> - 2008-07-11 18:24:05
|
>>> On 7/11/2008 at 11:15 AM, in message <20080711171554.GB13456@tapir>, Carlo Marcelo Arenas Belon <ca...@sa...> wrote: > On Fri, Jul 11, 2008 at 09:29:06AM -0600, Brad Nicholes wrote: >> >>> On 7/11/2008 at 5:02 AM, in message <20080711110204.GB7724@tapir>, Carlo >> Marcelo Arenas Belon <ca...@sa...> wrote: >> > >> > It is definitely unstable and not likely to be fixed before the freeze, so >> > IMHO would be better deleted (not turned off by default) as there is no way >> > to do that reliably in a clean way AFAIK. >> >> Disabling it is just a matter of a file name change from tcpconn.py to >> tcpconn.pyoff or something like that. The same thing would have to be >> done for the tcpconn.pyconf file as well (tcpconn.pyoff). > > That is what I meant by "not in a clean way", as it will leave dead code > around and will get most likely people confused by the funny names and > will require them to rename files (which are under a package manager and > then will complain as being missing and won't be removed at uninstall > time obstructing the removal for other directories as well). > > in any case, if documented clearly I have no reason to object but that is > just because I won't be affected anyway as I don't use our provided > RPM packages. > > but on that line, remember that it might not be implemented the way you > envision for all available packages (which is what I meant by unreliably) > as the copying of the files is done now by the SPEC and that could result > in even more confusion. > I guess I would just rather see it distributed so that the user can decide what they want to do rather than us making the decision for them. >> I would suggest we just make the file name change and still distribute it >> for those that want to use it anyway. > > My suggestion was to make a file name change as well into the contrib > directory, where it won't get in the way and will be also available for > those that want to use it, but since there is no contrib yet distributed > then cleanly removing it (it will be available from our repository in > the web anyway for whoever wants to install it) looks like the best next > option. > I would agree as well if we had a contrib/ directory. But just because we don't should not mean that we remove it completely and make it unavailable for those that would still like to use it. >> It still works reliably, it just has a wait timeout issue that is really >> only noticeable when using the -m parameter. > > but that would result in some metric samples failing silently and therefore > in some wholes in the RRD values that could then result in mysterious drops > in the graphs or flat lines. > No and the reason why is because the actual gathering of the data is threaded. tcpconn.py spins up its own gathering thread that periodically exec's netstat and updates an internal array of metrics. When the gmond main thread requests the metrics, all it does is read the internal array and return whatever the last gathered value was. There is no delay to gmond at all. At worst, the tcpconn gathering thread might delay occasionally which has no effect on anything else. It was written this way on purpose so that gmond would never be at the mercy of the python exec code, netstat delays in execution or OS delays. The delay only shows up for gmond when the tcpconn metric_clean() function is called and the main gmond process has to wait for the tcpconn gathering thread to shutdown. That's why you see the delay in with the -m parameter and no where else. The gmond -m option causes the metric_init(), which starts the gathering thread and the metric_cleanup() which shuts down the gathering thread, to happen one immediately after the other. Gmond has to delay waiting for the thread cleanup. Also tcpconn.py takes a RefreshRate parameter that can be set in the tcpconn.pyconf configuration file. This parameter determines how often the tcpconn gathering thread should attempt to exec netstat to get a new value for the internal structure. The gathering of the netstat value and the gathering of the gmond metric can be on two different cycles for the simple fact that latency can't be pre-determined. This issue really is just a -m parameter annoyance. Brad |
From: Carlo M. A. B. <ca...@sa...> - 2008-07-11 19:16:57
|
On Fri, Jul 11, 2008 at 12:24:01PM -0600, Brad Nicholes wrote: > >>> On 7/11/2008 at 11:15 AM, in message <20080711171554.GB13456@tapir>, Carlo > Marcelo Arenas Belon <ca...@sa...> wrote: > > I guess I would just rather see it distributed so that the user can decide > what they want to do rather than us making the decision for them. and I agree with you on that, the only difference of opinions comes on how to distribute that and if that is feasible now (see below). > > My suggestion was to make a file name change as well into the contrib > > directory, where it won't get in the way and will be also available for > > those that want to use it, but since there is no contrib yet distributed > > then cleanly removing it (it will be available from our repository in > > the web anyway for whoever wants to install it) looks like the best next > > option. > > I would agree as well if we had a contrib/ directory. But just because > we don't should not mean that we remove it completely and make it > unavailable for those that would still like to use it. there is also the possibility of just adding the "contrib" into this first release and using instead that (which should be safe enough) and has been already voted for backport (but for the next release). feel free to commit that then and base disabling this metric / documentation on the contrib directory which should satisfy all raised concerns. if you are going that route, it might be also a good idea to backport including ganglia-rrd-modify.pl into the contrib which has been approved also and was dependent on that first backport. but if you are going that route (and this is where this starts becoming a risky proposition) is that would be also nice to backport the original python 2.4 compatible version which doesn't have the problem the 2.3 compatible version has and that would be a better fit for the majority of the users (except for the ones stuck with python 2.3 like CentOS 4 users and that have other problems getting ganglia running as well, like the lack of an APR1 official package they could use as a dependency), but then that version doesn't exist yet (even if it will be easy to come up with as you explained by rolling back the 2.3 compatibility patches) and hasn't been tested probably as much as the buggy one. > >> It still works reliably, it just has a wait timeout issue that is really > >> only noticeable when using the -m parameter. > > > > but that would result in some metric samples failing silently and therefore > > in some wholes in the RRD values that could then result in mysterious drops > > in the graphs or flat lines. > > No and the reason why is because the actual gathering of the data is > threaded. tcpconn.py spins up its own gathering thread that periodically > exec's netstat and updates an internal array of metrics. > When the gmond main thread requests the metrics, all it does is read the > internal array and return whatever the last gathered value was. Ok, but then that spawning netstat thread will randomly fail, an so depending on the frequency it fails compared with the polling gmond does you will get flat lines. > There is no delay to gmond at all. At worst, the tcpconn gathering thread > might delay occasionally which has no effect on anything else. It was > written this way on purpose so that gmond would never be at the mercy of > the python exec code, netstat delays in execution or OS delays. Good to know, and definitely a sound architectural design. > The delay only shows up for gmond when the tcpconn metric_clean() function > is called and the main gmond process has to wait for the tcpconn gathering > thread to shutdown. That's why you see the delay in with the -m parameter > and no where else. Well, as you explained you also see it at shutdown. > The gmond -m option causes the metric_init(), which starts the gathering > thread and the metric_cleanup() which shuts down the gathering thread, > to happen one immediately after the other. Gmond has to delay waiting > for the thread cleanup. And this is IMHO a bug, but a fix for it is not something that will be ready to release anytime soon as spelled in the STATUS file. It would be better if the metric_init() doesn't initialize the "spawning netstat thread" but leave that to the collection method that is scheduled by gmond and who would just need to do the first sample and initialize that thread the first time it is called. This way the metric_cleanup() method won't need to wait either for the `gmond -m` case which shouldn't execute any metric collection code in principle. Carlo |
From: Brad N. <BNI...@no...> - 2008-07-11 19:56:08
|
>>> On 7/11/2008 at 1:42 PM, in message <20080711194215.GB14407@tapir>, Carlo Marcelo Arenas Belon <ca...@sa...> wrote: > On Fri, Jul 11, 2008 at 12:24:01PM -0600, Brad Nicholes wrote: >> >>> On 7/11/2008 at 11:15 AM, in message <20080711171554.GB13456@tapir>, Carlo >> Marcelo Arenas Belon <ca...@sa...> wrote: >> >> I guess I would just rather see it distributed so that the user can decide >> what they want to do rather than us making the decision for them. > > and I agree with you on that, the only difference of opinions comes on how > to distribute that and if that is feasible now (see below). > >> > My suggestion was to make a file name change as well into the contrib >> > directory, where it won't get in the way and will be also available for >> > those that want to use it, but since there is no contrib yet distributed >> > then cleanly removing it (it will be available from our repository in >> > the web anyway for whoever wants to install it) looks like the best next >> > option. >> >> I would agree as well if we had a contrib/ directory. But just because >> we don't should not mean that we remove it completely and make it >> unavailable for those that would still like to use it. > > there is also the possibility of just adding the "contrib" into this first > release and using instead that (which should be safe enough) and has been > already voted for backport (but for the next release). > > feel free to commit that then and base disabling this metric / documentation > on the contrib directory which should satisfy all raised concerns. > > if you are going that route, it might be also a good idea to backport > including ganglia-rrd-modify.pl into the contrib which has been approved also > and was dependent on that first backport. > > but if you are going that route (and this is where this starts becoming a > risky proposition) is that would be also nice to backport the original > python 2.4 compatible version which doesn't have the problem the 2.3 > compatible version has and that would be a better fit for the majority of > the users (except for the ones stuck with python 2.3 like CentOS 4 users > and that have other problems getting ganglia running as well, like the lack > of an APR1 official package they could use as a dependency), but then that > version doesn't exist yet (even if it will be easy to come up with as you > explained by rolling back the 2.3 compatibility patches) and hasn't been > tested probably as much as the buggy one. > Wow, I think I would rather just release it as is and fix all of this in the next version. This issue really isn't that big of a deal. Especially since it is Friday and Bernard is ready to roll. >> >> It still works reliably, it just has a wait timeout issue that is really >> >> only noticeable when using the -m parameter. >> > >> > but that would result in some metric samples failing silently and therefore >> > in some wholes in the RRD values that could then result in mysterious drops >> > in the graphs or flat lines. >> >> No and the reason why is because the actual gathering of the data is >> threaded. tcpconn.py spins up its own gathering thread that periodically >> exec's netstat and updates an internal array of metrics. >> When the gmond main thread requests the metrics, all it does is read the >> internal array and return whatever the last gathered value was. > > Ok, but then that spawning netstat thread will randomly fail, an so > depending on the frequency it fails compared with the polling gmond does > you will get flat lines. > No, There aren't flat lines. A value is always being returned and I have never seen the netstat thread fail in normal use. The only reason why a failure appears with the -m is because of the metric_clean() function was called and there was a race condition. I have been running this code for months now and have never seen any kind of failure other than the -m parameter case. >> There is no delay to gmond at all. At worst, the tcpconn gathering thread >> might delay occasionally which has no effect on anything else. It was >> written this way on purpose so that gmond would never be at the mercy of >> the python exec code, netstat delays in execution or OS delays. > > Good to know, and definitely a sound architectural design. > >> The delay only shows up for gmond when the tcpconn metric_clean() function >> is called and the main gmond process has to wait for the tcpconn gathering >> thread to shutdown. That's why you see the delay in with the -m parameter >> and no where else. > > Well, as you explained you also see it at shutdown. > right, which isn't a problem. Any threaded application has to wait for it's threads to terminate on a clean shutdown. >> The gmond -m option causes the metric_init(), which starts the gathering >> thread and the metric_cleanup() which shuts down the gathering thread, >> to happen one immediately after the other. Gmond has to delay waiting >> for the thread cleanup. > > And this is IMHO a bug, but a fix for it is not something that will be ready > to release anytime soon as spelled in the STATUS file. > > It would be better if the metric_init() doesn't initialize the "spawning > netstat thread" but leave that to the collection method that is scheduled > by gmond and who would just need to do the first sample and initialize > that thread the first time it is called. > > This way the metric_cleanup() method won't need to wait either for the > `gmond > -m` case which shouldn't execute any metric collection code in principle. > right, I agree this is a bug and the solution you describe is exactly what should be done for the next version. But this bug doesn't prevent tcpconn.py module from functioning normally and providing good metric data. At the very worst, it is a -m annoyance and occasionally a shutdown delay. IMO, the severity of the bug is not enough to pull it from the release. Besides the fact that this is a python module. If a user really required a fix before the next release, updating this module is no more than a file copy. No rebuilding of anything is required. Brad |
From: Carlo M. A. B. <ca...@sa...> - 2008-07-11 22:12:14
|
On Fri, Jul 11, 2008 at 01:56:07PM -0600, Brad Nicholes wrote: > >>> On 7/11/2008 at 1:42 PM, in message <20080711194215.GB14407@tapir>, Carlo > Marcelo Arenas Belon <ca...@sa...> wrote: > > > > there is also the possibility of just adding the "contrib" into this first > > release and using instead that (which should be safe enough) and has been > > already voted for backport (but for the next release). > > > > feel free to commit that then and base disabling this metric / documentation > > on the contrib directory which should satisfy all raised concerns. > > I think I would rather just release it as is and fix all of this in > the next version. This issue really isn't that big of a deal. well considering how much debate has been done about this with no other work committed and proposed for vote this might be the only option left. another issue with releasing it AS-IS is that the implementation is buggy, it does the external interaction with netstat in a non standard/problematic way (so it will run in python 2.3), but that is not explained/documented anywhere and since this is the first release will be most likely used as a template AS-IS for future modules which will just then inherit its bugs. > Especially since it is Friday and Bernard is ready to roll. Well, Bernard proposed to fix this on Wednesday in a private list and you approved even if no patch was ever produced. So I was really hoping that wouldn't be the case. Indeed in those 2 days I'd worked for improving the contrib inclusion patches just in case they will be needed for whatever solution you were planning to implement. Also in those 2 days, I'd worked on debugging and improving this module as well but I definitely agree that would be too destabilizing this late of the game. I don't really use the python modules in my setup (most of them won't run or even get installed in any of the platforms I use daily and that I was focusing my testing on), and really didn't knew it was a known issue since no one had reported it before Ulf did (which was a surprise to me knowing how much testing and stabilization work has been done for this release) I was of course waiting for that proposed fix to materialize but will hack one out if that is really needed. > >> >> It still works reliably, it just has a wait timeout issue that is really > >> >> only noticeable when using the -m parameter. > >> > > >> > but that would result in some metric samples failing silently and therefore > >> > in some wholes in the RRD values that could then result in mysterious drops > >> > in the graphs or flat lines. > >> > >> No and the reason why is because the actual gathering of the data is > >> threaded. tcpconn.py spins up its own gathering thread that periodically > >> exec's netstat and updates an internal array of metrics. > >> When the gmond main thread requests the metrics, all it does is read the > >> internal array and return whatever the last gathered value was. > > > > Ok, but then that spawning netstat thread will randomly fail, an so > > depending on the frequency it fails compared with the polling gmond does > > you will get flat lines. > > No, There aren't flat lines. A value is always being returned If the same value is always being returned because the other thread failed to spawn netstat and get fresh values then the line is flat. > and I have never seen the netstat thread fail in normal use. The only > reason why a failure appears with the -m is because of the metric_clean() > function was called and there was a race condition. I have been running > this code for months now and have never seen any kind of failure other > than the -m parameter case. That is encouraging, and hopefully means that whenever the final fix is committed it will resolve all issues left. > >> The gmond -m option causes the metric_init(), which starts the gathering > >> thread and the metric_cleanup() which shuts down the gathering thread, > >> to happen one immediately after the other. Gmond has to delay waiting > >> for the thread cleanup. > > > > And this is IMHO a bug, but a fix for it is not something that will be ready > > to release anytime soon as spelled in the STATUS file. > > > > It would be better if the metric_init() doesn't initialize the "spawning > > netstat thread" but leave that to the collection method that is scheduled > > by gmond and who would just need to do the first sample and initialize > > that thread the first time it is called. > > > > This way the metric_cleanup() method won't need to wait either for the > > `gmond > > -m` case which shouldn't execute any metric collection code in principle. > > right, I agree this is a bug and the solution you describe is exactly > what should be done for the next version. But this bug doesn't prevent > tcpconn.py module from functioning normally and providing good metric data. > At the very worst, it is a -m annoyance and occasionally a shutdown delay. > IMO, the severity of the bug is not enough to pull it from the release. > Besides the fact that this is a python module. If a user really required > a fix before the next release, updating this module is no more than a file > copy. No rebuilding of anything is required. OK, your previous comments seemed to imply otherwise. I already think I made my points clear. Your call, but the idea of bug delaying through procrastination doesn't seem that appealing. Carlo |
From: Carlo M. A. B. <ca...@sa...> - 2008-07-12 09:15:40
|
On Thu, Jul 10, 2008 at 01:43:54PM -0600, Brad Nicholes wrote: > >>> On 7/10/2008 at 12:37 PM, in message > <d4c...@ma...>, "Bernard Li" > <be...@va...> wrote: > > > > Let's get this done by Friday and roll out a beta. We'll test this > > for a week, and roll out RC1, RC2, etc. etc. > > Sounds good. Make it happen. :) OK, I know that technically it is still Friday in Honolulu (just checked) but the fact that practically we didn't yet produce a package and there are no commits in either trunk or the 3.1 branch, while all discussions about known bugs and their criticality has gone silent while the STATUS page has no remaining open items for this release is not helping. either way, so that anyone who has been waiting (like me) for this moment I had just uploaded an "unofficial" release package for 3.1.0 in : http://tapir.sajinet.com.pe/ganglia/ganglia-3.1.0.tar.gz this could be at least used by packagers as a starting point to rebasing their packages for the official release package (which should be otherwise very similar except that bootstrapped with more ancient versions of autotools) and for everyone else that was waiting with anxiety for a stable version of ganglia 3.1 that they can install for testing in their favorite platform/cluster and give it a spin. the main features of this release are : * Dynamically loaded metric support (DSO) * Scriptable metric support with Python * Modular frontend graph support * Platform support for DragonFlyBSD * Improved native metric support for Windows (Built with CygWin) * Bug fixes and Enhancements and has been known to work (if sometimes in somehow tricky ways) in : * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE) * [Open]Solaris * FreeBSD * NetBSD * OpenBSD * DragonflyBSD * Cygwin (no support for DSO yet) * AIX (no support for DSO yet) it might work or not (most likely not, even if would maybe compile) in : * Darwin (AKA MacOS/X) * HPUX * Tru64 (AKA OSF/1) * Irix read all the README, INSTALL and any other documentation you can get a hold of as a lot of things had changed since 3.0.7, and be also careful when upgrading from 3.0 as you would have to do it in a way that doesn't mix ganglia 3.0 and 3.1 nodes in the same cluster (as defined by a multicast address or unicast collector node) as that might misbehave (nothing that could crash your cluster though, but not supported nonetheless) please report back with any problems you will find or even better with reports about how wonderfully this is working in your specific configuration and how it is the best thing since sliced bread by making that old mainframe you had lying around running OpenVMS into a great cluster reporting tool. happy testing Carlo |
From: Brad N. <BNI...@no...> - 2008-07-14 16:03:46
|
Thanks Carlo for doing this. Did you happen to create a 3.1.0 tag in SVN as well that matches the tarball? If not then let's create the tag. That way we can be sure that no matter who rolls the official tarball, we always get the same code. Bernard, are you available to roll and post an official tarball? Let's get this thing tested and released :) Brad >>> On 7/12/2008 at 3:40 AM, in message <20080712094059.GA19337@tapir>, Carlo Marcelo Arenas Belon <ca...@sa...> wrote: > On Thu, Jul 10, 2008 at 01:43:54PM -0600, Brad Nicholes wrote: >> >>> On 7/10/2008 at 12:37 PM, in message >> <d4c...@ma...>, "Bernard Li" >> <be...@va...> wrote: >> > >> > Let's get this done by Friday and roll out a beta. We'll test this >> > for a week, and roll out RC1, RC2, etc. etc. >> >> Sounds good. Make it happen. :) > > OK, I know that technically it is still Friday in Honolulu (just checked) > but the fact that practically we didn't yet produce a package and there are > no commits in either trunk or the 3.1 branch, while all discussions about > known bugs and their criticality has gone silent while the STATUS page has > no remaining open items for this release is not helping. > > either way, so that anyone who has been waiting (like me) for this moment > I had just uploaded an "unofficial" release package for 3.1.0 in : > > http://tapir.sajinet.com.pe/ganglia/ganglia-3.1.0.tar.gz > > this could be at least used by packagers as a starting point to rebasing > their packages for the official release package (which should be otherwise > very similar except that bootstrapped with more ancient versions of > autotools) > and for everyone else that was waiting with anxiety for a stable version of > ganglia 3.1 that they can install for testing in their favorite > platform/cluster and give it a spin. > > the main features of this release are : > > * Dynamically loaded metric support (DSO) > * Scriptable metric support with Python > * Modular frontend graph support > * Platform support for DragonFlyBSD > * Improved native metric support for Windows (Built with CygWin) > * Bug fixes and Enhancements > > and has been known to work (if sometimes in somehow tricky ways) in : > > * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE) > * [Open]Solaris > * FreeBSD > * NetBSD > * OpenBSD > * DragonflyBSD > * Cygwin (no support for DSO yet) > * AIX (no support for DSO yet) > > it might work or not (most likely not, even if would maybe compile) in : > > * Darwin (AKA MacOS/X) > * HPUX > * Tru64 (AKA OSF/1) > * Irix > > read all the README, INSTALL and any other documentation you can get a hold > of as a lot of things had changed since 3.0.7, and be also careful when > upgrading from 3.0 as you would have to do it in a way that doesn't mix > ganglia 3.0 and 3.1 nodes in the same cluster (as defined by a multicast > address or unicast collector node) as that might misbehave (nothing that > could crash your cluster though, but not supported nonetheless) > > please report back with any problems you will find or even better with > reports > about how wonderfully this is working in your specific configuration and how > it is the best thing since sliced bread by making that old mainframe you had > lying around running OpenVMS into a great cluster reporting tool. > > happy testing > > Carlo > > ------------------------------------------------------------------------- > Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! > Studies have shown that voting for your favorite open source project, > along with a healthy diet, reduces your potential for chronic lameness > and boredom. Vote Now at http://www.sourceforge.net/community/cca08 > _______________________________________________ > Ganglia-developers mailing list > Gan...@li... > https://lists.sourceforge.net/lists/listinfo/ganglia-developers |
From: Bernard L. <be...@va...> - 2008-07-14 22:40:13
|
Hi all: On Mon, Jul 14, 2008 at 9:03 AM, Brad Nicholes <BNI...@no...> wrote: > Thanks Carlo for doing this. Did you happen to create a 3.1.0 tag in SVN as well that matches the tarball? If not then let's create the tag. That way we can be sure that no matter who rolls the official tarball, we always get the same code. > > Bernard, are you available to roll and post an official tarball? > > Let's get this thing tested and released :) Sorry I was out of town this weekend and didn't get to check my email until now. Thanks Carlo for releasing tarball in my stead. I have tagged 3.1.0 (r1562) and have put the tarball and RPMs (built on CentOS 4.x) on http://www.ganglia.info/testing. I encourage everybody to test it out and let us know of any critical issues. Thanks everybody for their hard work and support to get this release out! Cheers, Bernard |
From: Carlo M. A. B. <ca...@sa...> - 2008-07-15 08:04:13
|
On Mon, Jul 14, 2008 at 03:40:20PM -0700, Bernard Li wrote: > On Mon, Jul 14, 2008 at 9:03 AM, Brad Nicholes <BNI...@no...> wrote: > > > Bernard, are you available to roll and post an official tarball? > > > > Let's get this thing tested and released :) > > Sorry I was out of town this weekend and didn't get to check my email > until now. Considering that you were the one that proposed the release date, and you are the only Release Manager could we then instead officially delegate that task if sudden conflict arises? > Thanks Carlo for releasing tarball in my stead. No problem; doing a release tarball is not rocket science and I'd been doing unofficial ones since I offered doing releases for 3.0 (when you said you were no longer interested and wanted to focus in 3.1 instead). Should I assume that with 3.1 released I should then be doing the 3.0 releases finally, or have you reconsidered? > I have tagged 3.1.0 (r1562) and have put the tarball and RPMs (built > on CentOS 4.x) on http://www.ganglia.info/testing. I encourage > everybody to test it out and let us know of any critical issues. There hasn't been an official announcement for this release (except the unofficial one I did in ganglia-developers), feel free to use that message as a baseline for the official announcement and call for testing, which should also include ganglia-general and the ganglia-announce list IMHO. But I would think that we could use some better references to documentation, including references for upgrading instructions from previous releases or the previous snapshots (specially if using our RPM packages for CentOS 4) and a reference to known issues like : * no support for C++ to create DSO modules * no spoofing from modular metrics (use gmetric if spoofing is needed) * race condition for tcpconn python metric (affects gmond -m) and all others we said we will document and kept going and that I lost track of already (as I wasn't the one doing that documenting) Carlo |
From: Brad N. <bni...@no...> - 2008-07-15 14:14:49
|
>>> On 7/15/2008 at 2:29 AM, in message <20080715082949.GB9804@tapir>, Carlo Marcelo Arenas Belon <ca...@sa...> wrote: > On Mon, Jul 14, 2008 at 03:40:20PM -0700, Bernard Li wrote: >> On Mon, Jul 14, 2008 at 9:03 AM, Brad Nicholes <BNI...@no...> wrote: >> >> > Bernard, are you available to roll and post an official tarball? >> > >> > Let's get this thing tested and released :) >> >> Sorry I was out of town this weekend and didn't get to check my email >> until now. > > Considering that you were the one that proposed the release date, and you > are the only Release Manager could we then instead officially delegate > that task if sudden conflict arises? > >> Thanks Carlo for releasing tarball in my stead. > > No problem; doing a release tarball is not rocket science and I'd been > doing unofficial ones since I offered doing releases for 3.0 (when you > said you were no longer interested and wanted to focus in 3.1 instead). > > Should I assume that with 3.1 released I should then be doing the 3.0 > releases finally, or have you reconsidered? > >> I have tagged 3.1.0 (r1562) and have put the tarball and RPMs (built >> on CentOS 4.x) on http://www.ganglia.info/testing. I encourage >> everybody to test it out and let us know of any critical issues. > > There hasn't been an official announcement for this release (except the > unofficial one I did in ganglia-developers), feel free to use that message as > a baseline for the official announcement and call for testing, which should > also include ganglia-general and the ganglia-announce list IMHO. > > But I would think that we could use some better references to documentation, > including references for upgrading instructions from previous releases or > the previous snapshots (specially if using our RPM packages for CentOS 4) > and a reference to known issues like : > > * no support for C++ to create DSO modules > * no spoofing from modular metrics (use gmetric if spoofing is needed) > * race condition for tcpconn python metric (affects gmond -m) > > and all others we said we will document and kept going and that I lost track > of already (as I wasn't the one doing that documenting) > > Carlo I will take care of the announcement email when I get into the office today. The email that you sent out for the unofficial testing tarball looks good. I will base it off of that. For the additional documentation, I will start a Release Notes page on the wiki and try to get some of these things documented. If everybody else could jump in at that point and update the wiki, I think we can get that taken care of. Brad |
From: Brad N. <BNI...@no...> - 2008-07-15 20:05:30
|
>>> On 7/15/2008 at 8:14 AM, in message <487...@no...>, "Brad Nicholes" <bni...@no...> wrote: > > I will take care of the announcement email when I get into the office today. > The email that you sent out for the unofficial testing tarball looks good. > I will base it off of that. For the additional documentation, I will start a > Release Notes page on the wiki and try to get some of these things > documented. If everybody else could jump in at that point and update the > wiki, I think we can get that taken care of. > I have just sent an announcement to all three ganglia mailing lists. I got a bounce message from the announce list, if somebody with email list rights could take care of that. Also, I have started a new page on the wiki site for current release notes (http://ganglia.wiki.sourceforge.net/ganglia_release_notes). This should be the place where we can document known issues, add platform build tips or anything else that we feel would be valuable information. Please help update this site so that we can make the testing of Ganglia 3.1 go as smoothly as possible. I didn't set a time limit on the testing period. Bernard had mentioned one week with a follow up of another testing tarball. I would suggest two weeks with an additional testing tarballs if required. It doesn't really matter to me, I am just happy that we are moving this release forward. thanks everybody, Brad |
From: Bernard L. <be...@va...> - 2008-07-15 20:22:03
|
Hi Brad: On Tue, Jul 15, 2008 at 1:05 PM, Brad Nicholes <BNI...@no...> wrote: > I have just sent an announcement to all three ganglia mailing lists. I got a bounce message from the announce list, if somebody with email list rights could take care of that. Also, I have started a new I just approved that email and added your email address to the 'always accept' list. I didn't see the email in the SF.net mail archives yet but if somebody who subscribes to ganglia-announce can just let us know that it went through, I would appreciate it. > page on the wiki site for current release notes (http://ganglia.wiki.sourceforge.net/ganglia_release_notes). This should be the place where we can document known issues, add platform build tips or anything else that we feel would be valuable information. Please help update this site so that we can make the testing of Ganglia 3.1 go as smoothly as possible. I didn't set a time limit on the testing period. Bernard had mentioned one week with a follow up of another testing tarball. I would suggest two weeks with an additional testing tarballs if required. It doesn't really matter to me, I am just happy that we are moving this release forward. Two week testing period works for me. I am also glad that we're moving forward. Cheers, Bernard |