Re: [Exist-development] new functionality: Execution Pipeline

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

2010/1/25 Thomas White <tho...@gm...>:
> Adam,
>
> From my point of view, there are some important differences. Scheduling
> functions can be used to execute a function asynchronously if the execution
> time is set say 1 sec after the current time but this is where the
> similarity ends.
>
> 1. So far all eXist functions are executed asynchronously (except scheduled
> jobs and triggers). If I need to get data from say 25 or 250 remote sources
> at the moment we will need to do it one data chunk at a time, one after
> another. What about if we need to fetch 5000 RSS feeds?
> We do need asynchronous commands.

Ah okay, I think I understand now. Yes, it seems like an interesting
feature to add to eXist.

I guess we could add this onto the end of the roadmap if the other
developers agree, however there is some work involved in this and as I
am sure we are pretty busy.

So how soon do you need this functionality?

> 2.execute-before function - ability to conditionally execute a job with
> variable delay.
> 3. callback function - ability to call a specific code after the job is
> done.
> 4. ability to group jobs in batches, batch functions and batch callback
> function can provide very powerful way to perform an asynchronous final
> operation on a group of  asynchronous functions.
>
> So far we have been thinking synchronously only - we need all data on the
> server and we can produce the whole page at once. But it is getting much
> more interesting and much more powerful when we think asynchronously on the
> client - we can create and/or update partially many areas on the screen
> simultaneously. For example use case 3. federated search, on a web client.
> Imagine an application where the web client has to display results from 25
> servers where  each of them has different latency. Whenever any of the data
> becomes available - it is displayed without page refresh.
>
> Some will say just AJAX in action, but I will say it is going to a very
> different level. Why? Because we have progress update during the request and
> we have batch operations and especially ability to cancel whole batch of
> requests when needed.
>
> Let take an application where the use will see on a same screen results from
> 50 servers asynchronously. It is an application that gets quotes for car
> insurance and it takes between 5 and 45 seconds for different quotes to be
> completed.
>
> Case 1. Traditional AJAX approach - Every quote area on the screen has its
> own AJAX request for a specific quote provider. When the user clicks on the
> button, then:
>
> browser opens 50 TCP connections to the server until all quotes are received
> (Windows limits up to 2 simultaneous TCP connections to a domain, that can
> introduce additional delay).
> On the server  50 synchronous TCP connections to the quote providers.
> When every data chunk is recieved two TCP ports will be released on the
> server.
> On the server this user so far will occupy 100 TCP ports =
> 50(browser-server)+50( server - quote providers ).
>
> After 10 seconds we have recieved data for say 10 of the quotes and the user
> decides to amend something and press the "Get Quotes" button again. Then:
>
> The browser opens new 50 synchronous requests.
> On the server this user now will occupy 180 TCP ports = 50(new browser
> requests )+ 40(incompleted old browser requests)  +50( new server-quote
> providers)  + 40(incomplete old server-quote providers connections).
> There is nothing to cancel the old 40 connections to the browser add 40
> incomplete server-quote providers connections except connection timeout.
> As a result the server can get out of TCP ports very quickly and the data
> will be delivered slowly especially if more users click the quote button
> earlier. It is getting worse very quickly when the users press the quote
> button prematurely or when the users refresh the page. Scalability is
> severely effected by the user behavior and by the number of users.
>
> Case 2. We have batch of asynchronous quote requests on the server.
> There will be no long waiting requests on TCP connections on the client.
> "Get Quotes" button will call a XQuery and quickly receive the batchID and
> an initial estimated delay. For simplicity let say the client will call
> getBatchStatus every second, closing the TCP connection after receiving the
> data.When any of the jobs is complete, a quick call to fetch the data is
> made. Then:
>
> On the browser there are no long waiting opened TCP connections. We have
> quick fetch of the status and one call to fetch received data every second
> and then all TCP ports are closed.
> On the server, we have 50( server - quote providers ) TCP connections + 1
> every second from the browser, closed immediately + 1 connection for the
> received data, all quotes delivered in one call, closed immediately .
>
> Now when the user clicks the "Get Quotes" button earlier, then we first
> cancel all incomplete calls  to quote providers on the server by calling
> closeAll, releasing all TCP ports and then we make the new 50 requests. The
> result:
>
> On the browser still have a call or two every second. No hanging TCP ports,
> no timeouts.
> On the server we have exactly the same 50( server - quote providers ) TCP
> connections + 1 or 2 every second from the browser.
> The scalability of the server is not effected by the user behaviour at all
> and it can take much more users.
>
>
> 5. Use case 4 federated search, on a the server I believe is very
> important if we want to query more then one server in real time.
> This scenario can do a pretty good job for awhile, before the eXist real
> clustering is ready.
>
> I hope this explains your question.
>
> Regards,
> Thomas
>
> ------
>
> Thomas White
>
> Mobile:+44 7711 922 966
> Skype: thomaswhite
> gTalk: thomas.0007
> Linked-In:http://www.linkedin.com/in/thomaswhite0007
> facebook: http://www.facebook.com/thomas.0007
>
>
>
>
> 2010/1/25 Adam Retter <ad...@ex...>:
>> Can you not already do most (if not all) of this by Scheduling XQuery
>> jobs with eXist's Scheduler?
>>
>>
>> 2010/1/25 Thomas White <tho...@gm...>:
>>> I would like to propose a new functionality that I believe could be very
>>> beneficial for eXist users:
>>>
>>> Asynchronous Execution Pipeline
>>>
>>> This a mechanism for execution of number of asynchronous jobs
>>> simultaneously.  It is very useful for executing long running jobs or in
>>> cases where it is impossible to predict how long it will take to perform
>>> the
>>> operation. Every job will run as a separated thread and the jobID and the
>>> estimated delay will be returned immediately to the caller.
>>>
>>> Use cases:
>>>
>>> 1. Executing long running queries
>>>
>>> Callback function will be used to store the result, at a location
>>> according
>>> to the function-parameters.
>>> A client checking periodically the status of this job will take next
>>> action.
>>>
>>> 2. Fetching data from (large) number of remote URLs
>>>
>>> An XQuery or a scheduled job creates XX execution pipeline entries for
>>> each
>>> remote server.
>>> Callback functions are used to store the results, at a location according
>>> to
>>> the function-parameters.
>>> The batch callback function will combine the result and trigger the next
>>> action.
>>>
>>> 3. Federated search, on a web client
>>>
>>> A web client sends a search request to a local XQuery, that creates XX
>>> execution pipeline entries for each remote server and returns to the web
>>> client a batch-id.
>>> The web client queries the status for the jobs with this batch-id
>>> periodically and when some of the jobs has status 'completed', web client
>>> gets the result for this job and displays it on the screen
>>> asynchronously.
>>>
>>> 4. Federated search, on a the server
>>>
>>> A web client sends a search request to a local XQuery, that creates XX
>>> execution pipeline entries for each remote server and returns to the web
>>> client a batch-id.
>>> Every job callback function will save the result at a location according
>>> to
>>> the function-parameters. The batch callback function will combine the
>>> result.
>>> The web client queries the status for this batch periodically and when
>>> the
>>> batch is completed, web client gets the result and displays combined
>>> result
>>> set on the screen asynchronously.
>>>
>>> 5. Data Replication
>>>
>>> An XQuery or a scheduled job creates XX execution pipeline entries for
>>> each
>>> remote server.
>>> Execute-before function will identify what needs to be replicated.
>>> The main function does the replication.
>>> The batch callback function moves the replication marker.
>>>
>>> A call to the Execution Pipe Line:
>>>    execution-pipeline:addJob( function, function-parameters,
>>> pipeline-parameters )
>>>  returning :
>>>     handlerID, estimated-delay,  function-parameters
>>>
>>>
>>> To get the result we need to call another function:
>>>     execution-pipeline:getJobResults( handlerID, autoClose )
>>> returning either:
>>>     the result data set. if autoClose is true then close the job and
>>> release
>>> all used resources.
>>> or
>>>    same handlerID, new-estimated-delay,function -parameters
>>> or
>>>    unknown-handlerID error
>>>
>>> execution-pipeline:getJobStatus( handlerID )
>>> returns
>>>         status of the job, function-parameters for this job
>>>
>>> execution-pipeline:getBatchStatus(  batch-ID )
>>> returns
>>>         the status for all jobs from a particular batch ID.
>>>
>>>
>>> execution-pipeline:getStatus(  )
>>> returns
>>>         the status for all jobs.
>>>
>>>
>>> execution-pipeline:closeJob( handlerID )
>>> execution-pipeline:closeBatch( batchID )
>>> execution-pipeline:closeAll( )
>>>
>>>
>>> function-parameters:
>>>
>>> job-statistic-id: used to keep average time for execution of this
>>> function.
>>> average time= (previous-average-time + last-execution-time)/2. URL with
>>> specific parameters could be used as an ID.
>>> execute-before function: when provided, it will be called before calling
>>> the
>>> main function for this job. If the result is 0 then proceed with the main
>>> function, otherwise use the result as number of milliseconds to put this
>>> job
>>> to sleep and try later.
>>> callback function: when provided callback-function will be called as
>>> callback-function( handlerID, result, function-parameters ). if it
>>> returns
>>> true() the job will be closed.
>>> any other parameters that may be used by the callback function.
>>>
>>> pipeline-parameters:
>>>
>>> batch-ID - to group
>>> batch-callback-function: called when all jobs from the batch are
>>> completed.
>>> any other parameters that may be used by the callback function.
>>>
>>> Any comments?
>>>
>>> Thomas
>>>
>>>
>>> ------
>>>
>>> Thomas White
>>>
>>> Mobile:+44 7711 922 966
>>> Skype: thomaswhite
>>> gTalk: thomas.0007
>>> Linked-In:http://www.linkedin.com/in/thomaswhite0007
>>> facebook: http://www.facebook.com/thomas.0007
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Throughout its 18-year history, RSA Conference consistently attracts the
>>> world's best and brightest in the field, creating opportunities for
>>> Conference
>>> attendees to learn about information security's most important issues
>>> through
>>> interactions with peers, luminaries and emerging and established
>>> companies.
>>> http://p.sf.net/sfu/rsaconf-dev2dev
>>> _______________________________________________
>>> Exist-development mailing list
>>> Exi...@li...
>>> https://lists.sourceforge.net/lists/listinfo/exist-development
>>>
>>>
>>
>>
>>
>> --
>> Adam Retter
>>
>> eXist Developer
>> { United Kingdom }
>> ad...@ex...
>> irc://irc.freenode.net/existdb
>>
>
>

-- 
Adam Retter

eXist Developer
{ United Kingdom }
ad...@ex...
irc://irc.freenode.net/existdb

Re: [Exist-development] new functionality: Execution Pipeline

eXist-db is a feature rich Open Source native XML database

Re: [Exist-development] new functionality: Execution Pipeline