scribeserver-users Mailing List for Scribe (Page 3)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Joydeep Sen Sarma wrote:
> Dhruba can shed more light on chukwa - I haven't looked at it myself.
>
> Does ur mail imply that the log files are being tailed and written out to hdfs periodically (by an application outside scribe?). If so - this is not so different from what we do (except there's more code to deal with catchups etc.).
>
>   
we're not using scribe at the moment (it & chukwa weren't there when we 
started), but am investigating moving to a more widely used solution.

and yes...
we're using something called 'logtail' 
http://www.drxyzzy.org/ntlog/logtail.c which remembers where it was the 
last time it was run, and does a seek() into the logfile and continues 
from that point, there is a bit of log-rotation logic which goes on  so 
it doesn't miss something if it was stopped. 

This stream is then piped into a  piece of code that runs on the 
webserver  that writes it into hdfs, switching the filename every 15m ( 
making sure it is unique etc)
the downside is that we create files like crazy on HDFS. 4 x # of 
webservers per hour, each about 30-180Mb each (which is small imho for a 
file in hdfs).

hence why are looking at doing something else ;-)
> -----Original Message-----
> From: Ian Holsman [mailto:li...@ho...] 
> Sent: Thursday, November 20, 2008 4:41 PM
> To: Joydeep Sen Sarma
> Cc: scr...@li...; Dhruba Borthakur; Rodrigo Schmidt
> Subject: Re: [Scribeserver-users] Scribe and Hadoop
>
> Hi Joydeep.
>
> we currently are just doing the naivé approach writing log files 
> directly into hadoop to individual files, rotating them every 15minutes 
> to avoid the append problem. We used logtail on the client side to 
> de-couple the system and a map/red job which then aggregates the info 
> 10-30 minutes later.
>
> I was wondering if you had seen the recent contribution to hadoop called 
> 'chukwa', and what your thoughts were on it.
>
> personally i'm looking at scribe (and chukwa) for realtime logging and 
> decision systems.
>
> Joydeep Sen Sarma wrote:
>   
>> Hi folks,
>>
>> Can shed some light on scribe and hdfs/hadoop integration at FB:
>>
>> - when we (actually Avinash - who's leading the Cassandra project now) 
>> started out - we investigated writing log files from scribe directly 
>> to hdfs (using libhdfs c++ api). However there were a few issues with 
>> this approach that steered us in a different direction:
>>
>> o hdfs uptime: there have been periods of sustained downtime and we 
>> can't rule that out in the future. There are many reasons - software 
>> upgrades being the most common. Buffering data in scribe for such 
>> large periods didn't seem like a very good route
>>
>> o lack of append support in hdfs in early days
>>
>> o desire to build loosely coupled systems (otherwise we would have to 
>> upgrade scribe servers with new libhdfs every time we had a software 
>> upgrade on hdfs)
>>
>> o flexibility in transforming data while copying into hdfs (more on 
>> this later)
>>
>> - currently we have a rsync like model to pull data from scribe to hdfs:
>>
>> o scribe writes data to netapp filers. These filers are high speed 
>> buffers for the most part
>>
>> o we have 'copier' jobs that 'pull' data from scribe output locations 
>> in these filers to hdfs. They maintain file offsets for copied data in 
>> a registry - so that these jobs can be periodically invoked so that 
>> continuous copying can happen.
>>
>> o 'copier' jobs can run in continous mode - or can be invoked to copy 
>> (or re-copy) data from older dates (this can be important if incorrect 
>> data was logged or data shows up late)
>>
>> o 'copier' jobs are map-only jobs in hadoop - this means that we can 
>> increase the copy parallelism if required. For example - if we are 
>> falling behind or hdfs was down for long time and there's a lot of 
>> accumulated data - the copiers will dial up the parallelism (up to a 
>> maximum - so as not to trip the filers up completely).
>>
>> - data 'copied' into hdfs - is eventually 'loaded' into Hive (this is 
>> our open source date warehousing layer on top of Hadoop). Usually this 
>> loading is a nightly process - but in some small number of cases - we 
>> load data at hourly granularity for semi-real-time applications. 
>> Application processing over scribe log sets is typically using Hive QL.
>>
>> - one interesting angle is 'close of books'. Scribe itself does not 
>> provide any hard guarantees on when data for a given date will be 
>> logged by. However several applications (especially revenue sensitive 
>> ones) need a hard deadline (invoke me when all data for a given day 
>> has been logged). For such applications - the loading process 
>> typically waits until 2am or so in the night (on day N+1) and then 
>> scans data from day N-1, N, N+1 to find all the relevant data for day 
>> N (using unix timestamps that are typically logged with the data). 
>> This is the data that's loaded into date partition N for the relevant 
>> hive table. Clearly the 2am boundary is arbitrary and we will move 
>> towards more heuristic based ways of determining when data for a given 
>> date is (almost) complete.
>>
>> - we have instances of text, json and thrift data sets logged via 
>> scribe. For the case of thrift (particularly when there's a thrift 
>> file with heterogenous records) - we do some transforms in the copying 
>> process to make the subsequent loading easier. Thrift data also shows 
>> up as TFileTransport format - and this cannot be parallel processed by 
>> Hadoop natively (although it wouldn't be so hard to arrange that as 
>> well) - so we always convert thrift data into sequencefiles as it's 
>> copied into hadoop.
>>
>> there are several pieces here that are not open sourced - and 
>> depending on community interest can be made available. The scribe to 
>> hdfs copier code for one. TFileTransport's java implementation is also 
>> not open sourced (since there is constant talk of superseding it with 
>> newer better transports).
>>
>> Please let us know if there are more questions and would be happy to 
>> answer.
>>
>> Joydeep
>>
>> ------ Forwarded Message
>> *From: *Johan Oskarsson <jo...@os...>
>> *Date: *Wed, 19 Nov 2008 09:36:38 -0800
>> *To: *<scr...@li...>
>> *Subject: *[Scribeserver-users] Scribe and Hadoop
>>
>> I understand Scribe is being used to put logs into the Hadoop HDFS at
>> Facebook. I'd love to hear more about how that works and how to
>> replicate the setup you guys have.
>>
>> /Johan
>>
>> -------------------------------------------------------------------------
>> This SF.Net email is sponsored by the Moblin Your Move Developer's 
>> challenge
>> Build the coolest Linux based applications with Moblin SDK & win great 
>> prizes
>> Grand prize is a trip for two to an Open Source event anywhere in the 
>> world
>> http://moblin-contest.org/redirect.php?banner_id=100&url=/ 
>> <http://moblin-contest.org/redirect.php?banner_id=100&url=/>
>> _______________________________________________
>> Scribeserver-users mailing list
>> Scr...@li...
>> https://lists.sourceforge.net/lists/listinfo/scribeserver-users
>>
>>
>> ------ End of Forwarded Message
>>
>> ------------------------------------------------------------------------
>>
>> -------------------------------------------------------------------------
>> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
>> Build the coolest Linux based applications with Moblin SDK & win great prizes
>> Grand prize is a trip for two to an Open Source event anywhere in the world
>> http://moblin-contest.org/redirect.php?banner_id=100&url=/
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scribeserver-users mailing list
>> Scr...@li...
>> https://lists.sourceforge.net/lists/listinfo/scribeserver-users
>>   
>>     
>
>
>   

2008	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (2)	Nov (24)	Dec (2)
2009	Jan (2)	Feb (4)	Mar	Apr (3)	May	Jun	Jul (11)	Aug (8)	Sep	Oct (5)	Nov	Dec
2011	Jan	Feb (1)	Mar (1)	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec

scribeserver-users Mailing List for Scribe (Page 3)

scribeserver-users — Updates for users of Scribe