Re: Fwd: Re: [PVFS-developers] Fwd: [Openmosix-devel] pvs, DSM (DM?) and migshm

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Fri, 28 Mar 2003, Bob Arctor wrote:

> apart from moshe's ideas i imagine both routines working together this way
> (or at leas i thing it's a goal of code merging)

...

> On Friday 28 March 2003 01:57, Rob Ross wrote:
>
> > You should be aware that PVFS doesn't keep mmap()ed files consistent
> > across different nodes.  This isn't an access mode that we're interested
> > in, and support for this type of consistency is nontrivial, so that won't
> > happen any time soon (in PVFS1 or PVFS2).  Sorry!
>
> but oM needs it. this is whole different story, and i think you should not 
> worry about this. for me the most important part is that both projects could 
> be melted together, so some code would be not duplicated, some include files 
> shared, and then maybe such code will arise.

That's a nice idea Bob, and there are certainly areas where the two
projects overlap, but I think that there are enough differences between
the projects in terms of end goals, architecture, etc. that this would be
a *lot* of trouble, and no *fun* for anyone!

> also load balancer from oM could 'suggest' how pvfsd should strip data... 

i think that it would be cool for the load balancer to have the ability to
suggest data distributions to the client-side pvfs software.  we should
talk about an api for this.

> the goal is to have memory (oM's memsorter job) shared memory (pvfs
> stripper job) and files (pvfs again) distributed across the nodes, so
> if process executes it can migrate (continue execution and take
> results of calculatuions - mmaped data - memsorter's job, and shared
> mem and files - pvfs job) across the nodes instead of swapping out the
> data and retrieving them from either network or swap.

again, this shared writable mmap thing just isn't going to happen in the
context of pvfs.  sorry.

> this way you can i.e. fork processes until they will exceed available memory 
> , then - instead of swapping - process should migrate to next node, forking 
> processes there. pvfs job is to strip filesystem data (and shared memory) so 
> it'll follow processes - to avoid situation of having data on only 'master' 
> node and retrieving everything via network. while memory (and 
> filesystems+shared memory) on next node are free. 

providing shared global address space is not something that we are
interested in as part of this project.  if anything, we're looking at even
*more* loose consistency models than before!

software distributed shared memory is a hard, hard problem.  i don't want
to try to tackle that.  other groups have; maybe there's a third piece to
this problem?

> also imagine one big proecess i.e. parsing file. proces is 100M big, file is 
> 100M big. machines have 50M of ram and 50M of disk. so process start parsing 
> on first node - then migrates to second node and continue. simple 'grep' 
> routine don't need more ram (ofcourse it's example, nodes can have also ram 
> available for results) so will not transfer any data via network ,it'll just 
> migrate and continue on next node - also nodes don't have to exchange any 
> data(except when mmaping data or stripping data) 
> ofcourse more compilcated software will exchange data, but benefits from 
> saving network bandwitch on simple processes will allow increasing of their 
> performance .
> 
> > I'm not sure why you would want to migrate PVFS servers; surely given
> > their need for local data access it makes the most sense to leave them
> > local to the data?  Or am I missing the point?
> they can access metadata via /mfs/ filesystem, it's cluster .

ah.  no, if you have everyone just start accessing the metadata through
some shared file system then everything is going to break, because you
need a consistent shared file system to do that in the first place...and
nfs doesn't do that...

> this is headed toward migriation of process to 'available' (less loaded) 
> node.  i think there will be no need to migrate _servers_ - just possiblity 
> to trigger migration of filesystem metadata to machines choosen by load 
> balancer code (so they'll be in place of task execution, and will not 
> migrate across the network). this might be simpler than it looks like, just 
> move data to node requesting them, if it's only one process which open()'ed 
> the file across the cluster. otherwise copy - if open was read only - or 
> leave if it's rw. 
> the copying job should be in buffer managment system  .

you can't copy a data file just because someone opened it in read mode (if
you want to maintain any semblance of consistency semantics); someone else
might write to it.

it would be nice for processes doing heavy metadata operations to be where
the metadata is; that's a good idea.  but i think that it might be easier
to move the process to the metadata than to try the other approach.  this
indicates that one might want to be able to obtain an address for that as
well.  in the context of pvfs2 this will be more interesting because
metadata will be distributed.

good stuff,

rob