Data & Result Slice Life Cycle

Authors:

Data & Result Slice Life Cycle

Out of the many challenges that we face in this project; Network, Coding, getting familiar with new technologies etc. there is one core challenge. The ability to divide, distribute and process the data the algorithm uses. In order to understand how this will work we need to first understand two of the basic concepts of the Yael cloud platform: The Data Slice and The Result slice.

Data and result slices in a nutshell

a data slice is a package of input data and metadata that can be sent from the lead server to a worker server, and be processed by a specific algorythem into a result slice that can then be sent back to the lead server and stored in the DB.

Data dividing - Multiplexing

The Lead Server (LS) holds or has access to all the algorithm's input data. One of it's main tasks is to divide this input data to manageable slices and send them to the Worker Servers (WS) for processing. Since the Algorithm is user generated, the LS's main limitation is that it doesn't "know" the input data. It doesn't have any context in which it could divide the data, this knowledge needs to come from the user.

The user is required to create a class to divide the input data. this class will provide us with a byte buffer that we call the Raw Data Slice.

Since we cannot expect what sort of data structures a user generated algorithm will use we expect the user's algorithm to be able to convert a Raw Data Slice from its byte buffer form to whatever class or data structure the algorithm is using.

all Raw Data Slices generated for a specific Algorithm must share some basic qualities:
1. Size: if the algorithm requires a fixed sized input, all generated Raw Data Slices for it will be of a fixed size, if the algorithm expects a dynamic size of input then there is no restriction on Slice size.
2. Self containment: A Raw Data Slice must contain at the very least all of the input data required for a full iteration of its designated algorithm. it is possible for an algorithm to iterate several times on the input data contained within a Raw Data Slice. but it is forbidden for an algorithm to require more then a single Raw Data Slice for one iteration.

Data Slice Creation

The system Multiplexer will create Data Slices and store them in the Data Slice queue for the DnC to distribute.

Each Data Slice will contain a single Raw Data Slice in the form of a Byte Buffer, and in addition to this it will contain any metadata required for the successive transmission and processing of the data to a worker server.

slice Metadata will include:
- Slice ID: a unique numeric designation for each slice created form the same input data, any Data Slice can be identified by the unique pair {Slice ID, Trio ID}.
- Trio ID: a unique numeric designation for each algorithm present on the Lead server, any Data Slice can be identified by the unique pair {Slice ID, Trio ID}.
- Slice Size: numeric indicator of Data Slice size in Bytes.

in addition to the data slice the multiplexer also creates an assembly Object.
this object contains data that is required for the reassembly of the result slice that will be created as a result from this data slice being processed.
the assembly object will be stored in the DB and be linked to a specific pair of {Slice ID, Trio ID}.

Distribution

The DnC waits for the Data Slice queue to start filling with Data Slices.

Each slice will be sent to a random worker server from one of the available worker servers linked to the system, the destination worker server will be determined randomly based on each worker servers rank. a worker server rank is determined by the DnC Manager.
the Dnc will record which data slice was sent to which server and at what time.

Processing a Data Slice

upon receiving a new data slice the worker server will store it in its own incoming queue.
when the data slice reaches the top of the queue the algorithm runner will extract the Raw Data slice from it and convert it to whatever data structure the algorithm uses.

the algorithm will then run over the data, creating in the end some form of result data. this data will then be saved as Byte Buffer. we call this a Raw Result Slice.

the algorithm runner will create a Result Slice by copying the metadata from the data slice and adding the Raw Result Slice as a Byte buffer.

the result slice will then be sent back to the Lead Server for storing and assembly.

Collection

The DnC receives result slices from the worker servers and stores them in the DB and in the result queue.

Assembling - De-Multiplexing

The system de-multiplexer will read result slices from the queue, it will then extract the Raw Result Slice and transfer it to the user de-multiplexer, on demand it will also load the assembly Object that matches this result slice and transfer it as well.

at this point the user de-multiplexer should have all it needs to begin assembling the complete result, which means this is the end of the data slice life cycle.

Wiki: Topic List
Wiki: Yael Cloud platform - Spec

tsDas Wiki