You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: <ha...@no...> - 2002-03-19 17:31:15
|
> > > Clubmask batch spooling can run scripts via BProc > > > > I must correct myself (looked to docs between) - it runs scripts on > > master only, right? > Well the scripts start off there -- but it doesnt need to be a script that > is run, you can specify a binary that uses all of the bproc commands to do > the process invocation. OK, looked to your docs for the third time and again got more of it... :-) There are two types of scripts one could care about: 1) Scripts preparing environments for parallel jobs (e.g. using MPI). Mostly written by cluster administrator, these scripts are moreorless part of computing system and can contain things like 'getnodes' and 'bpsh'. Predefined example scripts play this role in Clubmask. They are part of "Parallel Environment" definitions in Grid Engine. They should execute on master in BProc-based spooling systems. 2) Scripts for non-parallel jobs. Each such script requires one processor only, does some housekeeping on the entry and exit and most likely runs few heavy executables (or just one) to do the hard work. Instead of message passing inside MPI, these jobs read and write files and they are synchronized using job dependencies (start job 11 when jobs 1-10 are finished). These jobs are written by users and are expected to be the same across various implementations of batch spooling systems. Some sites seem to care about 1) and MPI (or PVM) only. But in some areas (e.g. our speech recognizer training) problems are best solved using 2). This is where things start to conflict: - I want to use BProc because it makes cluster administration easy - I want let my users to install some standard spooling system at home, at laptops etc., read standard documentation, prepare standard job scripts, learn and debug - I want them to carry unchanged scripts to cluster and just see the job done much more quickly (well, also finish debuging) Standard spooling systems like GE or PBS expect job scripts to be executed on slave nodes. BProc does not quite like it. The best solution I see so far is to mark heavy executables with prefix which expands to nothing on laptops and home computers and expands to 'bpsh right-node' on cluster (where all these scripts would execute on master, off-loading heavy executables to nodes) Probably it is good compromise. But I am not exactly happy to learn users that they should mark heavy executables and risk master node overload when they do not. Moving the user's script as a whole is still tempting... Best Regards Vaclav |
From: <ha...@no...> - 2002-03-19 16:06:13
|
> A question about that. Lets suppose I want to use a bproc based cluster > just to run parallel processes, like if it was a SMP box. No need of > load balancing and so on. Really I just need a local queue system on > master node, isn't it ? Just mantain a queue of programs, written > in MPI or just with bproc to do the spawns, and there it is. But even here you might want more than one simple queue. Maybe some jobs do not scale up well and you will prefer to run two jobs using half of your nodes each. Maybe you want few nodes for software tests and the rest for long production runs. So you have jobs which are rectangles of NODESxTIME and want to place them ideally on the resources rectangle you have (finish all jobs in shortest time possible) - this is where Maui backfill is handy, even if it just runs on master and decides where and when bpsh jobs. Vaclav |
From: <ha...@no...> - 2002-03-19 15:39:39
|
> PBS is not a good match for a BProc based system. It comes with a lot > of baggage for managing remote machines that you just don't need with > BProc. Should be true for any serious scheduler from pre-BProc times - they just had to do it somehow :-) > We have a student working a on an entirely new (and very simple) > scheduler for BProc based systems. I had my own simple (pre-BProc) scheduler but gave up further maintanance when I needed better job dependencies and multiuser envoronment. I found two free systems with job deps: PBS and GE (Grid Engine). PBS was not opensource enough for me, so my current bet is GE. I plan to port GE to BProc and suppose this to be relatively easy. I am still frustrated by the fact that to get say twice more functionality I adopted GE which is four orders of magnitude bigger but I am getting used to this. If GE works nicely with BProc, is it a viable option for your site? > As far as scripts go, I try to discourage people from trying to run > scripts on nodes. It's really not designed for it. That being said, > there are some facilities for running scripts. There is a gross hack > to make #! style execs work with execmove (bpsh). There's also an > Aexecve() hook which uses the ghost process on the front end to exec() > non-existent binaries. There's no caching of binaries on the slave > nodes though. I am trying to avoid scripts on nodes, but in batch spooling they are handy - though they just prepare environment for one heavy executable which does the real work. If I got the implications right, such scripts (being sent to nodes by bpsh) could work as long as they use absolute pathnames for executables (cause for relative ones shell looks around and gets mad)? (If we ever manage to provide all other things shell might want to touch - like .profile or .*rc) Putting all this together, probably the best approach is to let batch spooled scripts to execute on master and only migrate selected binaries by prefixing their command lines by special command; this command can look to environment variables set by spooling system, find something like queue name and bpsh executable to node ? > There's no caching of binaries on the slave > ... > > NFS is flaky and does not move data as quickly as BProc? > > BProc should always be faster for the reason you mention. But with hypothetical clever solid networked filesystem (caching in RAM and maybe local harddisk, streaming all data needed by exec) the speed of BProc would be the same or even lower if BProc does not cache executables? And furthermore I guess BProc has to get whole executable to RAM and move it while after local exec() just pages actually needed (visited by program execution) are demand loaded? (Which is probably small difference and contradicts my idea of clever filesystem streaming the whole executable to node doing local exec.) I know I compare working BProc with non-existent super-NFS, I just wanted to make sure I got it right. Best Regards Vaclav |
From: Nicholas H. <he...@se...> - 2002-03-19 15:36:57
|
On Tue, 19 Mar 2002 ha...@no... wrote: > > Clubmask batch spooling can run scripts via BProc > > I must correct myself (looked to docs between) - it runs scripts on > master only, right? Well the scripts start off there -- but it doesnt need to be a script that is run, you can specify a binary that uses all of the bproc commands to do the process invocation. We also have LAM/MPI support that uses bproc for its communication. Nic Nicholas Henke Undergraduate - Engineering 2002 -- Senior Architect and Developer Liniac Project - University of Pennsylvania http://clubmask.sourceforge.net ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There's nothing like good food, good beer, and a bad girl. |
From: J.A. M. <jam...@ab...> - 2002-03-19 15:15:33
|
On 2002.03.19 Erik Arjan Hendriks wrote: > >PBS is not a good match for a BProc based system. It comes with a lot >of baggage for managing remote machines that you just don't need with >BProc. We don't use PBS (haven't even tried) or any other existing ... > >> The reason for all this is that I'd like to use Grid Engine with >> BProc, though I will also consider Clubmask (and Clustermatic? does it >> have batch spooling?). > >No spooler in clustermatic yet. > A question about that. Lets suppose I want to use a bproc based cluster just to run parallel processes, like if it was a SMP box. No need of load balancing and so on. Really I just need a local queue system on master node, isn't it ? Just mantain a queue of programs, written in MPI or just with bproc to do the spawns, and there it is. Ah, and about mpirun-mpiexec ? If I just run my MPI program and it does an rfork() or thelike on init, no need for scripts that do the spawning via [rs]sh, which looks like the main reason for mpirun. Missed something ? -- J.A. Magallon # Let the source be with you... mailto:jam...@ab... Mandrake Linux release 8.2 (Bluebird) for i586 Linux werewolf 2.4.19-pre3-jam3 #1 SMP Fri Mar 15 01:16:08 CET 2002 i686 |
From: Erik A. H. <er...@he...> - 2002-03-19 12:33:14
|
On Mon, Mar 18, 2002 at 06:32:57PM +0100, ha...@no... wrote: > Clubmask batch spooling can run scripts via BProc and commercial PBS > also can. BProc 3.1.6+ can exec() on slave getting image from master, > but this most likely is not the PBS way of running scripts. We are > somewhere near to scripts transparently executing on slaves, but not > yet there. > > How exactly is script run? What's where during script execution? > > I can imagine these scenarios: > > 1) Slave nodes have NFS mounted not only /home but also /bin, /usr/bin > etc. Execution server is on master. Script is moved to slave somehow like > > bpsh NODE bash script > > and whenever bash executes command from script, it > > a) just gets the binary over NFS mount and does normal local exec() > > b) uses NFS just to look around, when it comes to exec(), binary is > fetched from master via BProc transparently to bash which does > not know that exec() does something unusual > > 2) Already execution server is BProc-moved to slave. Everything is > NFS-mounted as in 1), server runs on slave and gets everything > via NFS. > > 3) Execution server is on master, scripts are run on master via > modified bash which BProc-moves just certain heavy-duty executables > to slaves. Just /home is NFS-mounted. > > > I do not quite beleive that /bin and /usr/bin are NFS mounted on all > slaves as in 1) and 2) and I do not beleive that there is modified > bash as in 3). So Clubmask and PBS probably use some method which is > over my imagination. Please tell me what it is. PBS is not a good match for a BProc based system. It comes with a lot of baggage for managing remote machines that you just don't need with BProc. We don't use PBS (haven't even tried) or any other existing scheduler on our systems here. The scheduler has been one of the holes in our environment. We have a student working a on an entirely new (and very simple) scheduler for BProc based systems. I believe Scyld has one too although it's not open source. As far as scripts go, I try to discourage people from trying to run scripts on nodes. It's really not designed for it. That being said, there are some facilities for running scripts. There is a gross hack to make #! style execs work with execmove (bpsh). There's also an Aexecve() hook which uses the ghost process on the front end to exec() non-existent binaries. There's no caching of binaries on the slave nodes though. > I also wander about performance comparison of BProc and NFS. For > dynamically linked executable, overhead of 1)a) and 1)b) should be > comparable: > > 1)a) executable is moved via BProc > libraries are cached on slave > > 1)b) executable got via NFS > libraries got via NFS, next time probably cached in filesystem cache > > so the only advantage of BProc would be common PID space. However many > docs imply BProc move is better. So what is wrong with comparison > above? NFS is flaky and does not move data as quickly as BProc? > Something else? BProc should always be faster for the reason you mention. The number I like is 3ms is pretty much the baseline overhead for BProc move on myrinet. (i.e. time to send your process size + 3ms) > The reason for all this is that I'd like to use Grid Engine with > BProc, though I will also consider Clubmask (and Clustermatic? does it > have batch spooling?). No spooler in clustermatic yet. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: <ha...@no...> - 2002-03-19 11:10:50
|
Is anybody working on parallel GNU make using BProc to offload processes to slaves? Should not be hard, GNU make is prepared for this and at least two opensource projects go this way (just using different transport): [1] qmake in Grid Engine - allocates certain number of nodes (like PVM or MPI does), then works just as GNU "make -j N" but transports processes to nodes [2] ANTS uses normal make -j N, but Makefile is modified - commands are prefixed with "rant" which acts like "bpsh FREENODE" Of course there are problems, e.g.: - parallel compilation can be slow because normal file cache cannot help as much as it can on one node - Makefile should operate just on files in NFS-mounted /home - certain Makefiles do not work with "make -j N" - flaky NFS can make make's decisions based on existence of files just created elsewhere problematic - there are better ways of resource management, see my (hanzl's) posts in [3] but I still would have good suitable applications. Even hand-allocation of free nodes like "bpmake --nodes 1,3,7-10 -- MAKE_OPTIONS" would be often useful on small clusters and easy to install cause it requires BProc and nothing else. I suppose a student given sources of bpsh and qmake would be able to write bpmake - I will try to start this project here if nobody else did it yet. Best Regards Vaclav Hanzl References: [1] http://gridengine.sunsource.net/unbranded-source/browse/~checkout~/gridengine/doc/htmlman/htmlman1/qmake.html [2] http://unthought.net/antsd/ [3] http://gridengine.sunsource.net/servlets/BrowseList?listName=dev&by=thread&from=85&to=85&first=1&count=7&JServSessionIdservlets=pdmvxixv01 |
From: <ha...@no...> - 2002-03-19 09:29:59
|
> Clubmask batch spooling can run scripts via BProc I must correct myself (looked to docs between) - it runs scripts on master only, right? Vaclav |
From: <ha...@no...> - 2002-03-18 17:24:18
|
Clubmask batch spooling can run scripts via BProc and commercial PBS also can. BProc 3.1.6+ can exec() on slave getting image from master, but this most likely is not the PBS way of running scripts. We are somewhere near to scripts transparently executing on slaves, but not yet there. How exactly is script run? What's where during script execution? I can imagine these scenarios: 1) Slave nodes have NFS mounted not only /home but also /bin, /usr/bin etc. Execution server is on master. Script is moved to slave somehow like bpsh NODE bash script and whenever bash executes command from script, it a) just gets the binary over NFS mount and does normal local exec() b) uses NFS just to look around, when it comes to exec(), binary is fetched from master via BProc transparently to bash which does not know that exec() does something unusual 2) Already execution server is BProc-moved to slave. Everything is NFS-mounted as in 1), server runs on slave and gets everything via NFS. 3) Execution server is on master, scripts are run on master via modified bash which BProc-moves just certain heavy-duty executables to slaves. Just /home is NFS-mounted. I do not quite beleive that /bin and /usr/bin are NFS mounted on all slaves as in 1) and 2) and I do not beleive that there is modified bash as in 3). So Clubmask and PBS probably use some method which is over my imagination. Please tell me what it is. I also wander about performance comparison of BProc and NFS. For dynamically linked executable, overhead of 1)a) and 1)b) should be comparable: 1)a) executable is moved via BProc libraries are cached on slave 1)b) executable got via NFS libraries got via NFS, next time probably cached in filesystem cache so the only advantage of BProc would be common PID space. However many docs imply BProc move is better. So what is wrong with comparison above? NFS is flaky and does not move data as quickly as BProc? Something else? (Sorry for these amateur questions. I prommise to learn quickly... ;-) The reason for all this is that I'd like to use Grid Engine with BProc, though I will also consider Clubmask (and Clustermatic? does it have batch spooling?). Best Regards Vaclav Hanzl |
From: Jag <ag...@li...> - 2002-03-18 15:00:16
|
On Mon, 18 Mar 2002, Erik Arjan Hendriks wrote: > On Mon, Mar 18, 2002 at 11:50:55AM +0100, ha...@no... wrote: > > I was quite surprised to find out that BProc lives at: > >=20 > > http://bproc.sourceforge.net > > > > and is active and maintained and has docs and maillist etc. On the > > other hand, assumption that Scyld website will be the first place > > which will point me to BProc news was quite false. > >=20 > > BProc seemes to be mostly one-man-show by Erik Arjan Hendriks and he > > is not working for Scyld anymore and might not be on good terms > > with Scyld management. Sorry to touch these personal things but I > > think they have large technical implications and therefore should be > > known to cluster developers. >=20 > I left Scyld in December of 2002. The technical implications of my I think that was supposed to be 2000, not 2002... |
From: Erik A. H. <er...@he...> - 2002-03-18 14:51:56
|
On Mon, Mar 18, 2002 at 11:50:55AM +0100, ha...@no... wrote: > I was quite surprised to find out that BProc lives at: > > http://bproc.sourceforge.net > > and is active and maintained and has docs and maillist etc. On the > other hand, assumption that Scyld website will be the first place > which will point me to BProc news was quite false. > > BProc seemes to be mostly one-man-show by Erik Arjan Hendriks and he > is not working for Scyld anymore and might not be on good terms > with Scyld management. Sorry to touch these personal things but I > think they have large technical implications and therefore should be > known to cluster developers. I left Scyld in December of 2002. The technical implications of my departure are have been very good for BProc. My current employer (Los Alamos National Lab) is funding further BProc development. Working on BProc and related things is my full-time job. Since I left Scyld, Cray has also funded some of the work adding debugger support to BProc. I haven't brought BProc under LANL's umbrella the way I did with Scyld. My experience with Scyld taught me to never let your management get confused about who owns what. While Scyld uses BProc, they did not support very much development. A lot of good integration work has been done by Scyld but the BPRoc (2.x) version they are currently distributing is remarkably similar to what I had when I started there. The changes consist almost entirely of bug fixes and hooks to facilitate cluster management. I understand that a small startup isn't going to fund a large development effort but the amount of time available for improvement was very disappointing. That contributed to my decision to leave but it wasn't the primary reason. Since the DOE has made an attempt to fund work at Scyld but they seemed... umm... uninterested. There has been MUCH more work improving BProc during the last year than during the one before it. During the first 9 months since I left, the process management was almost entirely rewritten, full ptrace (gdb, strace) support was added and it was ported to Linux 2.4. Since then there have been smaller added features, a LOT of testing and a port to PowerPC. Anyway, the bottom line is, my leaving Scyld has been very good for BProc development. Oh, and yes, I am on bad terms with their management although I'm still on good terms with the technical people who were there when I was there. > (Sorry for this non-technical entry on bproc-users, I hope I'll be > less off-topic next time.) No problem. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: <ha...@no...> - 2002-03-18 10:42:19
|
I was quite surprised to find out that BProc lives at: http://bproc.sourceforge.net and is active and maintained and has docs and maillist etc. On the other hand, assumption that Scyld website will be the first place which will point me to BProc news was quite false. BProc seemes to be mostly one-man-show by Erik Arjan Hendriks and he is not working for Scyld anymore and might not be on good terms with Scyld management. Sorry to touch these personal things but I think they have large technical implications and therefore should be known to cluster developers. BProc is GPL project and is used by several independent groups now. Good. Please help me to make the picture clear by completing the list. So far I know these projects using BProc: 1) Scyld Beowulf, of course http://www.scyld.com 2) LANL Clustermatic (where Erik's most recent email is) http://www.clustermatic.org/ 3) Clubmask and Maui http://clubmask.sourceforge.net/ http://www.supercluster.org/maui/body.html If you know others, please let me know. My apologies to anybody offended by this message, especially to Scyld and Erik Hendriks. Please correct any false statements I made - Internet was my only information source. The picture I made from various small pieces was quite big surprise for me, so I had to post this. (Sorry for this non-technical entry on bproc-users, I hope I'll be less off-topic next time.) Best Regards Vaclav Hanzl |
From: Erik A. H. <er...@he...> - 2002-03-14 00:00:03
|
bproc 3.1.9 and beoboot lanl.1.2 are available in the usual place: http://sourceforge.net/project/showfiles.php?group_id=24453 ====================================================================== Release notes and change log for BProc: ====================================================================== 3.1.9 --------------------------------------------------------------------- This release is just bug fixes. See the change log for details. The basic infrastructure is surviving a much more harsh stress test than before (ptree.c) so with a little luck this release will be somewhat better than the last few. More x86 FPU bogosity. This release addresses further problems with FPU migration on x86. It turns out that it's possible to load a clean FPU state from a P3 on a P1 without taking a trap. However, math on the P1 after doing that produces incorrect results. Nice work, Intel. VMADump now tries to avoid touching an unused FPU during migration. This fixes the case where an application is started on the front end and immediately migrated to a remote node where it runs to completion. There will still be problems if an application which has used the FPU tries to migrate between FPU architectures. Changes from 3.1.8 to 3.1.9 * Added a patch for Linux version 2.4.18. * Changed VMADump FPU handling on x86 so that a process which has not used its FPU will not generate a clean FPU state before sending. This way the FPU state will be generated on the remote machine. * Fixed VMADump so that no pages are ever stored for VM_IO regions. * Fixed a bug that caused zombies to persist after a successful wait() call on slave nodes. * Fixed vrfork path through move code. A kernel oops was possible because of a misplaced bit of TCP work around code. * Fixed kernel oops with kernel_thread() on slave nodes. (as caused by NFS mounts, etc.) * Fixed master dameon process accounting bug. It failed to note parent process IDs for remote forks. * Fixed master daemon process accounting bug. It failed to clear a pending request on one of the move error paths. * Fixed problems in bproc_unmasq that could lead to slave node crashes. * Fixed a race condition in move that could lead to lost parent exit messages. That could lead to process child counting problems later on. ====================================================================== Release notes and change log for Beoboot LANL: ====================================================================== beoboot-lanl 1.2 ----------------------------------------------------- This version should be used with BProc version 3.1.6+ There are some monte MONTE_PROTECTED related cleanups which require that you patch the phase 1 beoboot kernel. This is necessary because the kernel normally throws away the information from the real mode code after reading it. It used to be possible to just find it at 90000h but boot loaders have begun putting that information at other addresses. beoserv requires some calls only present in BProc 3.1.6+ now. The worst of the BProc dependencies have been removed. The beoboot script now does the link with bpslave when you generate the boot images. Beoboot does not require a rebuild every time BProc is updated although boot images will have to be recreated. Support for linking in the mon daemon (from supermon) has also been added. A supermon supporting this should be released soon. Changes from lanl 1.1 to lanl 1.2 * Fixed a two kernel monte issue with protected mode operation and succesfully finding the real mode setup code. MONTE_PROTECTED unfortunately requires a kernel patch in the first kernel now. * Removed most of the BProc dependencies. * Made kver statically linked to reduce the problems. * Added chkswap improvements from Rick Niles <ni...@sc...> * Added in some script updates from Scyld. * Added in supermon support at image build time. * Reworked make files to allow for no two kernel monte support. (i.e. on ppc) |
From: Erik A. H. <er...@he...> - 2002-03-08 03:41:13
|
On Thu, Mar 07, 2002 at 07:20:28PM -0500, Grant Taylor wrote: > So all my icache flushing troubles appear to have been bugs in the > platform code. With these corrected, it now works. Cool. > Anyway, it's a little frentic here now, but after things settle down > I'll put together a clean patch for mips. In the meantime if anyone > really wants vmadump on mips they can pester me for an ugly hack of a > diff... I'd be interested in adding a MIPS patch to vmadump even if there was no MIPS support in the rest of BProc yet. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: Grant T. <gt...@sw...> - 2002-03-08 00:20:39
|
So all my icache flushing troubles appear to have been bugs in the platform code. With these corrected, it now works. (At least until I enable HIGHMEM, which seems to cause *other* platform-specific bugs to bite. Serves us right for picking a 6 month old all-new CPU design). Anyway, it's a little frentic here now, but after things settle down I'll put together a clean patch for mips. In the meantime if anyone really wants vmadump on mips they can pester me for an ugly hack of a diff... -- Grant Taylor - x285 - http://pasta/~gtaylor/ Starent Networks - +1.978.851.1185 |
From: Grant T. <gt...@sw...> - 2002-03-04 17:48:28
|
>>>>> Erik Arjan Hendriks <er...@he...> writes: >> Yes, I got all this working flawlessly on uniprocessor by mangling >> all the entry and exit points to do full saves and restores. It's >> ugly and inefficient, but until everything actually works this will >> have to do. > Since you don't seem to be porting BProc, would you mind telling me > what the application is? Embedded something or other? Yes, we're making a telco router sort of thing. Formally, it's a "PDSN", the gateway you connect through when using a cellphone for internet access. In reality, it speaks a PPP-over-GRE-ish protocol over T1's or ATM to the base stations, and stock IP-over-Ethernet or various tunnelling formats out the other end. Internally it's approximately a 50-odd node Linux cluster with hardware assist to divvy the packets up. When a card gets pulled, the simplest thing is to use vmadump to migrate a the various processes to another card, rather than to transfer state by hand the hard way. Since the card is going away, most of the bproc/mosix-like "stub"-based process location transparency stuff is useful... We're currently investigating our icache flushing implementation; clearly the thing's got a bug since it always blows up on what should be valid addresses ;( -- Grant Taylor - x285 - http://pasta/~gtaylor/ Starent Networks - +1.978.851.1185 |
From: Erik A. H. <er...@he...> - 2002-03-04 05:53:14
|
On Sun, Mar 03, 2002 at 07:30:19PM -0500, Grant Taylor wrote: > FP doesn't work on this CPU, anyway. I'd need to make the in-kernel > fp simulator dump state! The simulated FPU state should still be in the thread_struct more or less like a real FPU. > The flush_icache_range() implementation is sb1_flush_icache_range() in > arch/mips/mm/sb1.c. The comments there and in the other flush flavors > speak of Kseg0 addresses, which is what made me think it expected > non-userspace addresses. KSEG0 is address that physical memory is mapped at in the kernel's virtual address space. KSEG0 addresses seem to mean "virtual" to me. I think I said "user" addresses when I meant "virtual" addresses. All these flush routines take virtual addresses. User/kernel space isn't a meaningful distinction except with respect to trap handling. That shouldn't be an issue though. It's very unlikely that a page won't be present which flush_icache_range is called since we just got finished writing to that page. > Shouldn't calling flush_icache_page() on page table entries should do > the right thing either way? Yeah, it should. Make sure you use the virtual address though. No manual page table walking should be required. > Or even with normal stuff. GDB disassembles on our target at the rate > of one instruction every several seconds! Neato. > Yes, I got all this working flawlessly on uniprocessor by mangling all > the entry and exit points to do full saves and restores. It's ugly > and inefficient, but until everything actually works this will have to > do. Since you don't seem to be porting BProc, would you mind telling me what the application is? Embedded something or other? - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: Grant T. <gt...@sw...> - 2002-03-04 00:30:27
|
>>>>> Erik Arjan Hendriks <er...@he...> writes: > The alpha has the same problem. See the big (and ugly) syscall entry > code there for an example of what I did. Basically the first bit of > the syscall handler is a hunk of asm code that saves what doesn't > normally get saved. The alpha code is basically copied from context > switch and fork. It seems (from a glance at MIPS fork code) that this > should be easily accomplished with the "save_static_function" macro. Maybe. But there are still registers not included in that or SAVE_SOME. As far as I can tell they're just saved and restored ad-hoc by whatever functions run in the kernel, so the only place they're easy to get at is when you're at the bottom of the stack about to return from the syscall or context switch. > It looks like FP will still be an issue on that platform. This shouldn't matter for my application, where all freezes and thaws are done in the middle of functions that have no fp. FP doesn't work on this CPU, anyway. I'd need to make the in-kernel fp simulator dump state! > Hrm. Well, it works fine (and is required) on PPC. I'm quite > certain the icache flush functions take user addresses - see > kernel/module.c and kernel/ptrace.c. I'm basically doing the same > thing there that module.c does to make the I and D caches > consistent. If I was flushing the wrong addresses, the PPC port > should be broken too. > Since it's blowing up only on SMP and (I presume) you can successfully > load modules, I believe something else is wrong. Well, I can buy that. We use a static kernel; I have no idea if module loading works. The flush_icache_range() implementation is sb1_flush_icache_range() in arch/mips/mm/sb1.c. The comments there and in the other flush flavors speak of Kseg0 addresses, which is what made me think it expected non-userspace addresses. Shouldn't calling flush_icache_page() on page table entries should do the right thing either way? Since doing this (or indeed, doing flush_icache_all()) gives the symptoms of not flushing properly, and flush_icache_range() panics on SMP, I'm beginning to wonder about the platform icache flushing code. Hmm... > It's probably a waste of time. GDB seems to get confused fairly > easily when messing with weird stuff. Or even with normal stuff. GDB disassembles on our target at the rate of one instruction every several seconds! > Try takeing a closer look at how you're restoring the values that > aren't saved in the default syscall entry. The text switch back to > user space needs to restore these values properly which means the > return from syscall needs to do some magic there. Failing to restore > some of that will likely lead to a user space crash. I saw that on > alpha. Yes, I got all this working flawlessly on uniprocessor by mangling all the entry and exit points to do full saves and restores. It's ugly and inefficient, but until everything actually works this will have to do. -- Grant Taylor - x285 - http://pasta/~gtaylor/ Starent Networks - +1.978.851.1185 |
From: Erik A. H. <er...@he...> - 2002-03-03 23:46:11
|
On Sun, Mar 03, 2002 at 05:59:48PM -0500, Grant Taylor wrote: > I've made vmadump go on the sb1 embedded mips cpu, at least mostly. > Currently it only works in non-SMP mode... > > The main thing for the port, and something which I'm not sure how > you'll want to deal with, is the fact that on MIPS, kernel entry > points save only some registers (as there are plenty of registers, and > only a few are really used for syscalls, this saves a little time). > At key times later (context switch, etc) there is code to save the > rest of them. Unfortunately it isn't at all easy to get at the > unsaved registers, so I ended up saving them all always. The slightly > better choice is to have kernel entry magically recognize the vmadump > syscall number and save all just for it. > > Anyway, this somewhat precludes the current vmadump-as-a-module > arrangement. There might be other platforms with this property that > vmadump will need to deal with someday, so it's worth pondering a bit. The alpha has the same problem. See the big (and ugly) syscall entry code there for an example of what I did. Basically the first bit of the syscall handler is a hunk of asm code that saves what doesn't normally get saved. The alpha code is basically copied from context switch and fork. It seems (from a glance at MIPS fork code) that this should be easily accomplished with the "save_static_function" macro. It looks like FP will still be an issue on that platform. > Another key thing, and an apparent general bug, is the (somewhat new) > flush_icache_range() call in load_map, which explodes every time. > This appears to be passing in a userspace address, while flush_icache > functions seem to expect a real address. I replaced this with a > get_user_pages(), flush_icache_page() incantation: [snip] > Curiously, the bad flush_icache thing didn't fail in uniprocessor > mode. I don't know why this would be so; it seems to me that flushing > effectively random addresses would be poor either way. Evidently it's > a nonfatal failure on my platform in uniprocessor mode. > > Regardless, now thawing under SMP never panics the kernel. Sometimes, > thawing even works completely. However, it usually still fails, and > in a funky way. The thawed process merely segfaults immediately or > shortly after resurrection. Occasionally it (vmadtest) will even > print some of it's .+'s as it does things and then segv. What's > really interesting, is that often *my shell* will exit for no reason > shortly after a segv'd thawed process dies. Hrm. Well, it works fine (and is required) on PPC. I'm quite certain the icache flush functions take user addresses - see kernel/module.c and kernel/ptrace.c. I'm basically doing the same thing there that module.c does to make the I and D caches consistent. If I was flushing the wrong addresses, the PPC port should be broken too. Since it's blowing up only on SMP and (I presume) you can successfully load modules, I believe something else is wrong. An immediate segfault in user space is the symptom I saw of failing to sync the I and D caches on PPC. PPC faulted in user space if I didn't set the permissions on the segments properly before doing a flush_icache_range. rwx should be loose enough permission-wise for anything though. * snip * > If I run the thaw under GDB, then the GDB tends to bizzarely exit, > so I can't really poke around easily to verify registers and memory > contents ;( Can you get a core dump out of it? If it's really a user mode trap and exit, that should be no problem. GDB might be happier looking at the core file. I doubt the core file will show you anything other than "all is well". It's probably a waste of time. GDB seems to get confused fairly easily when messing with weird stuff. > All in all, I'm inclined to think that the icache stuff is still > wrong; it seems to be time-dependant as to whether or not the child > will work, which I could see if there were dcache/icache interactions > giving my core old code to run in the newly undumped process. > Interestingly, using a flush_icache_all() at the end of the page > loading behaves exactly the same. > > Can anyone offer any suggestions? I'm a bit puzzled... Try takeing a closer look at how you're restoring the values that aren't saved in the default syscall entry. The text switch back to user space needs to restore these values properly which means the return from syscall needs to do some magic there. Failing to restore some of that will likely lead to a user space crash. I saw that on alpha. The other thing I would double check was that you're not restoring any important control registers from the dump file. You usually have to be careful about that to avoid opening up security holes or putting the CPU in some bogus state once it tries to return to user mode. > * Signal state can apparently be shared with another process, so > vmadump may have a bug here if something clones keeping signals and > then calls the vmadump thaw syscall before exec. There's a static > function which breaks the sharing of task->sig in fs/exec.c. VMADump should just overwrite the signal handler state for both processes in that case. VMADump behavior for processes that are sharing resources (files, sig handlers, memory) is mostly undefined at this point. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: Grant T. <gt...@sw...> - 2002-03-03 22:59:55
|
I've made vmadump go on the sb1 embedded mips cpu, at least mostly. Currently it only works in non-SMP mode... The main thing for the port, and something which I'm not sure how you'll want to deal with, is the fact that on MIPS, kernel entry points save only some registers (as there are plenty of registers, and only a few are really used for syscalls, this saves a little time). At key times later (context switch, etc) there is code to save the rest of them. Unfortunately it isn't at all easy to get at the unsaved registers, so I ended up saving them all always. The slightly better choice is to have kernel entry magically recognize the vmadump syscall number and save all just for it. Anyway, this somewhat precludes the current vmadump-as-a-module arrangement. There might be other platforms with this property that vmadump will need to deal with someday, so it's worth pondering a bit. Another key thing, and an apparent general bug, is the (somewhat new) flush_icache_range() call in load_map, which explodes every time. This appears to be passing in a userspace address, while flush_icache functions seem to expect a real address. I replaced this with a get_user_pages(), flush_icache_page() incantation: // bproc flushed a user address: flush_icache_range(page.start, page.start + PAGE_SIZE); { struct page *pages[1]; struct vm_area_struct *vmas[1]; int i; pages[0] = NULL; vmas[0] = NULL; /* It really seems like there should be a lock held over the get_user_pages() to flush_icache_page()? */ i = get_user_pages(current, current->mm, page.start, 1, 0, 1, pages, vmas); if (i == 1 && vmas[0] && pages[0]) { flush_icache_page(vmas[0], pages[0]); } else { printk("vmadump: trouble finding user page at 0x%x for icache flush!\n", page.start); } } Curiously, the bad flush_icache thing didn't fail in uniprocessor mode. I don't know why this would be so; it seems to me that flushing effectively random addresses would be poor either way. Evidently it's a nonfatal failure on my platform in uniprocessor mode. Regardless, now thawing under SMP never panics the kernel. Sometimes, thawing even works completely. However, it usually still fails, and in a funky way. The thawed process merely segfaults immediately or shortly after resurrection. Occasionally it (vmadtest) will even print some of it's .+'s as it does things and then segv. What's really interesting, is that often *my shell* will exit for no reason shortly after a segv'd thawed process dies. The shell is of course the parent process of the "vmadtest -u", but what in the thaw process makes the parent exit later I just don't understand. It seems like any shared-with-parent memory mappings would have been lost long ago when the "vmadtest -u" was exec'd; likewise for any inherited signal state information*. And I'm pretty sure my registers are coming back at me. If I run the thaw under GDB, then the GDB tends to bizzarely exit, so I can't really poke around easily to verify registers and memory contents ;( All in all, I'm inclined to think that the icache stuff is still wrong; it seems to be time-dependant as to whether or not the child will work, which I could see if there were dcache/icache interactions giving my core old code to run in the newly undumped process. Interestingly, using a flush_icache_all() at the end of the page loading behaves exactly the same. Can anyone offer any suggestions? I'm a bit puzzled... * Signal state can apparently be shared with another process, so vmadump may have a bug here if something clones keeping signals and then calls the vmadump thaw syscall before exec. There's a static function which breaks the sharing of task->sig in fs/exec.c. -- Grant Taylor - gtaylor<at>picante.com - http://www.picante.com/~gtaylor/ Linux Printing Website and HOWTO: http://www.linuxprinting.org/ |
From: Nicholas H. <he...@se...> - 2002-02-27 18:27:52
|
must get some sleep --that is a bit obvious :-) BTW -- I should be working on testing 3.1.7 soon here. On Wed, 27 Feb 2002, Erik Arjan Hendriks wrote: > On Wed, Feb 27, 2002 at 02:35:08AM -0500, Nicholas Henke wrote: > > On Tue, 26 Feb 2002, Jag wrote: > > > > > On Tue, 26 Feb 2002, he...@se... wrote: > > > > > > > When I have changed the owner of a node to a particular user, any user is > > > > still able to bpsh to that node -- is this a known issue? What can I do to > > > > help debug it? > > > > > > What version of BProc are you using? Assuming you're using a 3.x > > > version of BProc, you'll also want to use bpctl to change the mode > > > (similar to unix file modes, except the only one implimented is x/1) > > > > I am using 3.1.6 --- but the bpctl changes are not affecting the mode. I > > have tried bpctl a-x, or bpctl 110. None of these seem to affect the mode > > or access control. > > The syntax you're looking for is: > > bpctl -S slavenumbers -m 110 > > The -S has to come before -m. Also, it only understands the octal > notation, not the a-x type notation. The only defined bits are the > execute bits. Trying to set the others will produce an error. > > - Erik > -- > Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons > er...@he... Contents may settle during shipment > |
From: Erik A. H. <er...@he...> - 2002-02-27 18:18:40
|
On Wed, Feb 27, 2002 at 02:35:08AM -0500, Nicholas Henke wrote: > On Tue, 26 Feb 2002, Jag wrote: > > > On Tue, 26 Feb 2002, he...@se... wrote: > > > > > When I have changed the owner of a node to a particular user, any user is > > > still able to bpsh to that node -- is this a known issue? What can I do to > > > help debug it? > > > > What version of BProc are you using? Assuming you're using a 3.x > > version of BProc, you'll also want to use bpctl to change the mode > > (similar to unix file modes, except the only one implimented is x/1) > > I am using 3.1.6 --- but the bpctl changes are not affecting the mode. I > have tried bpctl a-x, or bpctl 110. None of these seem to affect the mode > or access control. The syntax you're looking for is: bpctl -S slavenumbers -m 110 The -S has to come before -m. Also, it only understands the octal notation, not the a-x type notation. The only defined bits are the execute bits. Trying to set the others will produce an error. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: Nicholas H. <he...@se...> - 2002-02-27 07:37:42
|
On Tue, 26 Feb 2002, Jag wrote: > On Tue, 26 Feb 2002, he...@se... wrote: > > > When I have changed the owner of a node to a particular user, any user is > > still able to bpsh to that node -- is this a known issue? What can I do to > > help debug it? > > What version of BProc are you using? Assuming you're using a 3.x > version of BProc, you'll also want to use bpctl to change the mode > (similar to unix file modes, except the only one implimented is x/1) I am using 3.1.6 --- but the bpctl changes are not affecting the mode. I have tried bpctl a-x, or bpctl 110. None of these seem to affect the mode or access control. Nic > > If you're using 2.2, then it sounds like there's something wrong. > |
From: Jag <ag...@li...> - 2002-02-27 04:47:10
|
On Tue, 26 Feb 2002, he...@se... wrote: > When I have changed the owner of a node to a particular user, any user is= =20 > still able to bpsh to that node -- is this a known issue? What can I do t= o=20 > help debug it? What version of BProc are you using? Assuming you're using a 3.x version of BProc, you'll also want to use bpctl to change the mode (similar to unix file modes, except the only one implimented is x/1) If you're using 2.2, then it sounds like there's something wrong. |
From: <he...@se...> - 2002-02-26 21:29:42
|
When I have changed the owner of a node to a particular user, any user is still able to bpsh to that node -- is this a known issue? What can I do to help debug it? Nic Nicholas Henke Undergraduate - Engineerring 2002 -- Senior Architect and Developer Liniac Project - University of Pennsylvania http://clubmask.sourceforge.net ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There's nothing like good food, good beer, and a bad girl. |