[root@scribe ~]# scribed -c /etc/scribed/default.conf
[Mon May 11 18:44:09 2009] "STATUS: STARTING"
[Mon May 11 18:44:09 2009] "STATUS: configuring"
[Mon May 11 18:44:09 2009] "got configuration data from file </etc/scribed/default.conf>"
[Mon May 11 18:44:09 2009] "CATEGORY : default"
[Mon May 11 18:44:09 2009] "Creating default store"
[Mon May 11 18:44:09 2009] "configured <1> stores"
[Mon May 11 18:44:09 2009] "STATUS: "
[Mon May 11 18:44:09 2009] "STATUS: ALIVE"
[Mon May 11 18:44:09 2009] "Starting scribe server on port 9103"
Thrift: Mon May 11 18:44:09 2009 libevent 1.1a method epoll
....
[Mon May 11 18:44:12 2009] "store thread starting"
[Mon May 11 18:44:12 2009] "[apache.access] Opened file </var/log/scribe/apache.access/apache.access_00000> for writing"
[Mon May 11 18:44:12 2009] "[apache.access] Opened file </tmp/apache.access/apache.access_00000> for writing"
[Mon May 11 18:44:12 2009] "[apache.access] Changing state from <DISCONNECTED> to <SENDING_BUFFER>"
[Mon May 11 18:44:12 2009] "[apache.access] successfully read <0> entries from file </tmp/apache.access/apache.access_00000>"
[Mon May 11 18:44:12 2009] "[apache.access] No more buffer files to send, switching to streaming mode"
[Mon May 11 18:44:12 2009] "[apache.access] Changing state from <SENDING_BUFFER> to <STREAMING>"
*** stack smashing detected ***: scribed terminated
Aborted
[root@scribe ~]#
I haven't yet run into any stack corruptions when running Scribe, so I'm not sure what could be causing this. Can you tell me a little more about your setup? How many categories do you have and how many messages per second are you pushing? And is cpu/memory usage reasonable when this is running?
Also, can you try to compile with symbols (configure --disable-opt) and then run scribe from a debugger to get the line number where this error is detected? Of course, the line number where a stack corruption is detected isn't necessarily the same line where the corruption occurred, but it should provide a pretty good hint.
-Anthony
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Scribe is still in our development environment so the load is trivial.
We have one Scribe server (configuration in previous post) and two clients (configuration below). Each client is running Scribe and relays messages to the server (I haven't seen the clients crash yet). We have ~15 categories and do about ~60 messages/second.
CPU and memory look fine on the clients and the server.
I'm not sure what could be causing this. It looks like everything you are doing is correct. I would try upgrading to a more recent version of libevent. I am currently using libevent-1.4.3.
Let us know how it goes.
-Anthony
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It seems to only occur during "store thread starting". If the server can get through starting all the store threads it seems to run fine. It seems random on when it will fail and when it works.
I'm using thrift revision 770888, fb303 revision 770888 (from thrift apache project) and scribe 2.01. If possible can you tell me what versions of thrift/fb303 you're using.
Thanks again,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It sounds like libevent 1.4.4-stable fixed a stack corruption bug. Can you double check to make sure that you definitely linked with this version of libevent when rebuilding scribe?
While I have tested recent revisions of thrift trunk, most of our stress testing(and production traffic) uses an older revision of thrift. The version of fb303 should not be an issue.
-Anthony
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Working on this project again, sorry for the very late response.
In http://gist.github.com/112436 you can see that the libevent 1.4.4-stable is displayed on start: "Thrift: Fri May 15 14:38:37 2009 libevent 1.4.4-stable method epoll"
I'm having the same issue on Fedora 11 (boost 1.37 / libevent 1.4.5).
I just commented on the bug you filed and suggested contacting the developers of libevent since this is a known issue in earlier versions of libevent.
But as a sanity check, you should probably double check that you are definitely not building or linking with the earlier version of libevent that was once installed on your system. Even if Thrift is printing out the correct version, it is still possible you are compiling with the old version and linking with the newer one. I would wipe the old version from your system and make sure that you have the correct libevent*.so and event.h. Then you should make clean and remake Thrift, and then do the same for Scribe.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm positive its the never version of libevent. The Fedora 11 build was built completely separate from the CentOS version and doesn't come with the older version of libevent. Regardless:
I have used Scribe in production with a version of Thrift that was probably pretty close to the original version that was open sourced.
Keep in mind that to use an older version of thrift, you will need to make some small changes to Scribe. I believe you would just need to replace all occurrences of 'apache::thrift' in Scribe's src with 'facebook::thrift'.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I built Scribe with a 18 month old version of thrift and I ran into the same issue. I'm guessing this is something I'm doing wrong, but I've had to give up for now (rsyslog works fine for us).
Anyway, I wanted to thank you for your help and I'll try again in the next release or two.
Thanks,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Silas, Sorry I wasn't able to figure out your stack smashing issues earlier. These issues are difficult to debug as they could be coming from any layer of code.
But I just ran across an old thread on the Thrift Jira page that may fix the problem you're running into:
When I apply the patch listed below Scribe will compile fine and works for a while, but eventually I get a "stack smashing detected" error.
Does anyone one if there is something in later (> 1.33.1) versions of Boost (or possibly libevent 1.1a) which would fix this error?
And has anyone successfully statically compiled Scribe on CentOS 5?
I'd really like to get Scribe working as an RPM so other CentOS 5 shops can use it.
Patch: http://code.google.com/p/silassewell/source/browse/trunk/projects/packages/rpms/scribe/scribe.2.01.patch
Spec/Init/etc..: http://silassewell.googlecode.com/svn/trunk/2009/05/05/scribe-for-el5/scribe/
SRPMs for Thrift/fb303/Scribe: http://www.silassewell.com/blog/2009/05/07/scribe-scalable-real-time-log-aggregation-for-centos-5-rhel-5/
Logs:
[root@scribe ~]# scribed -c /etc/scribed/default.conf
[Mon May 11 18:44:09 2009] "STATUS: STARTING"
[Mon May 11 18:44:09 2009] "STATUS: configuring"
[Mon May 11 18:44:09 2009] "got configuration data from file </etc/scribed/default.conf>"
[Mon May 11 18:44:09 2009] "CATEGORY : default"
[Mon May 11 18:44:09 2009] "Creating default store"
[Mon May 11 18:44:09 2009] "configured <1> stores"
[Mon May 11 18:44:09 2009] "STATUS: "
[Mon May 11 18:44:09 2009] "STATUS: ALIVE"
[Mon May 11 18:44:09 2009] "Starting scribe server on port 9103"
Thrift: Mon May 11 18:44:09 2009 libevent 1.1a method epoll
....
[Mon May 11 18:44:12 2009] "store thread starting"
[Mon May 11 18:44:12 2009] "[apache.access] Opened file </var/log/scribe/apache.access/apache.access_00000> for writing"
[Mon May 11 18:44:12 2009] "[apache.access] Opened file </tmp/apache.access/apache.access_00000> for writing"
[Mon May 11 18:44:12 2009] "[apache.access] Changing state from <DISCONNECTED> to <SENDING_BUFFER>"
[Mon May 11 18:44:12 2009] "[apache.access] successfully read <0> entries from file </tmp/apache.access/apache.access_00000>"
[Mon May 11 18:44:12 2009] "[apache.access] No more buffer files to send, switching to streaming mode"
[Mon May 11 18:44:12 2009] "[apache.access] Changing state from <SENDING_BUFFER> to <STREAMING>"
*** stack smashing detected ***: scribed terminated
Aborted
[root@scribe ~]#
/etc/scribed/default.conf:
port=9103
max_msg_per_second=10000
check_interval=3
use_conn_pool=yes
<store>
category=default
type=buffer
target_write_size=20480
max_write_interval=1
buffer_send_rate=2
retry_interval=30
retry_interval_range=10
<primary>
type=file
fs_type=std
file_path=/var/log/scribe
base_filename=thisisoverwritten
max_size=1073741824
</primary>
<secondary>
type=file
fs_type=std
file_path=/tmp
base_filename=thisisoverwritten
max_size=1073741824
</secondary>
</store>
I'd be happy to do some debugging for a developer if you point me in the right directory.
Thanks,
I haven't yet run into any stack corruptions when running Scribe, so I'm not sure what could be causing this. Can you tell me a little more about your setup? How many categories do you have and how many messages per second are you pushing? And is cpu/memory usage reasonable when this is running?
Also, can you try to compile with symbols (configure --disable-opt) and then run scribe from a debugger to get the line number where this error is detected? Of course, the line number where a stack corruption is detected isn't necessarily the same line where the corruption occurred, but it should provide a pretty good hint.
-Anthony
Scribe is still in our development environment so the load is trivial.
We have one Scribe server (configuration in previous post) and two clients (configuration below). Each client is running Scribe and relays messages to the server (I haven't seen the clients crash yet). We have ~15 categories and do about ~60 messages/second.
CPU and memory look fine on the clients and the server.
gdb info: http://gist.github.com/110302
client config: http://gist.github.com/110293
Let me know if there is something I can do to get more/better information.
Thanks,
I'm not sure what could be causing this. It looks like everything you are doing is correct. I would try upgrading to a more recent version of libevent. I am currently using libevent-1.4.3.
Let us know how it goes.
-Anthony
I updated to libevent 1.4.4-stable and it seems like the issue was resolved.
Thanks for your help!
Nevermind, it just didn't happen the first couple of times.
Here is another gdb dump: http://gist.github.com/112436
It seems to only occur during "store thread starting". If the server can get through starting all the store threads it seems to run fine. It seems random on when it will fail and when it works.
I'm using thrift revision 770888, fb303 revision 770888 (from thrift apache project) and scribe 2.01. If possible can you tell me what versions of thrift/fb303 you're using.
Thanks again,
Again, gdb info with thrift symbols: http://gist.github.com/110309
It sounds like libevent 1.4.4-stable fixed a stack corruption bug. Can you double check to make sure that you definitely linked with this version of libevent when rebuilding scribe?
While I have tested recent revisions of thrift trunk, most of our stress testing(and production traffic) uses an older revision of thrift. The version of fb303 should not be an issue.
-Anthony
Working on this project again, sorry for the very late response.
In http://gist.github.com/112436 you can see that the libevent 1.4.4-stable is displayed on start: "Thrift: Fri May 15 14:38:37 2009 libevent 1.4.4-stable method epoll"
I'm having the same issue on Fedora 11 (boost 1.37 / libevent 1.4.5).
I created a bug with the Fedora 11 gdb logs:
https://sourceforge.net/tracker/?func=detail&aid=2826310&group_id=240708&atid=1113651
Thanks for all your help!
I just commented on the bug you filed and suggested contacting the developers of libevent since this is a known issue in earlier versions of libevent.
But as a sanity check, you should probably double check that you are definitely not building or linking with the earlier version of libevent that was once installed on your system. Even if Thrift is printing out the correct version, it is still possible you are compiling with the old version and linking with the newer one. I would wipe the old version from your system and make sure that you have the correct libevent*.so and event.h. Then you should make clean and remake Thrift, and then do the same for Scribe.
I'm positive its the never version of libevent. The Fedora 11 build was built completely separate from the CentOS version and doesn't come with the older version of libevent. Regardless:
# ldd /usr/sbin/scribed
linux-vdso.so.1 => (0x00007fff2edff000)
libboost_system.so.4 => /usr/lib64/libboost_system.so.4 (0x0000003434000000)
libboost_filesystem.so.4 => /usr/lib64/libboost_filesystem.so.4 (0x0000003432800000)
libfb303.so => /usr/lib64/libfb303.so (0x0000003433800000)
libthrift.so.0 => /usr/lib64/libthrift.so.0 (0x0000003432c00000)
libthriftnb.so.0 => /usr/lib64/libthriftnb.so.0 (0x0000003433c00000)
libevent-1.4.so.2 => /usr/lib64/libevent-1.4.so.2 (0x0000003433000000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000037c6400000)
libscribe.so => /usr/lib64/libscribe.so (0x00007f95208bd000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000037ca800000)
libm.so.6 => /lib64/libm.so.6 (0x00000037c6800000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000037c9400000)
libc.so.6 => /lib64/libc.so.6 (0x00000037c5c00000)
librt.so.1 => /lib64/librt.so.1 (0x00000037c8400000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00000037c8c00000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00000037c9000000)
/lib64/ld-linux-x86-64.so.2 (0x00000037c5800000)
# yum whatprovides \*libevent-1.4.so.2
updates/filelists_db
libevent-1.4.5-2.fc11.x86_64 : Abstract asynchronous event notification library
Repo : fedora
Matched from:
Filename : /usr/lib64/libevent-1.4.so.2
Right now I'm blaming TNonblockingServer (maybe something like http://issues.apache.org/jira/browse/THRIFT-536\).
Regardless I'm going to try to get the oldest version of thrift working with scribe as possible.
Any chance you can give me the closest revision of the public thrift repository you're using in your environment?
Thanks for your help.
I have used Scribe in production with a version of Thrift that was probably pretty close to the original version that was open sourced.
Keep in mind that to use an older version of thrift, you will need to make some small changes to Scribe. I believe you would just need to replace all occurrences of 'apache::thrift' in Scribe's src with 'facebook::thrift'.
I built Scribe with a 18 month old version of thrift and I ran into the same issue. I'm guessing this is something I'm doing wrong, but I've had to give up for now (rsyslog works fine for us).
Anyway, I wanted to thank you for your help and I'll try again in the next release or two.
Thanks,
Sorry we couldn't figure this out. You should also contact the maintainers of libevent and see if they are aware of this issue.
I will let you know if I hear of anyone else running into the same problem.
Silas, Sorry I wasn't able to figure out your stack smashing issues earlier. These issues are difficult to debug as they could be coming from any layer of code.
But I just ran across an old thread on the Thrift Jira page that may fix the problem you're running into:
http://issues.apache.org/jira/browse/THRIFT-510
And please post any future replies or discussion to Scribe's new home and discussion group:
http://github.com/facebook/scribe
http://groups.google.com/group/scribe-server
-Anthony