From: FUJITA T. <fuj...@la...> - 2005-03-01 07:19:55
|
Hi, I would like to announce the iSCSI enterprise target (IET) software, which is open-source software to build iSCSI storage systems. It can provide disk volumes to iSCSI initiators by using any kinds of files (regular files, block devices, virtual block devices like RAID and LVM, etc). The project was started by forking the Ardis target implementation (http://www.ardistech.com/iscsi/) about one year ago. The source code and further information are available from: http://iscsitarget.sourceforge.net/ The user-space daemon handles authentication and the kernel threads take care of network and disk I/O requests from initiators by using the VFS interface. The kernel-space code is not intrusive. It doesn't touch other parts of the kernel. The code is already stable and usable. More than 130 people currently subscribe to the project's mailing list. The developers aim for inclusion into the mainline kernel. The latest code against 2.6.11-rc5 for review can be found at: http://zaal.org/iscsi/iet/0.4.6/r996.tar.gz Could you please review the code? Any comments are greatly appreciated. |
From: Arjan v. de V. <ar...@in...> - 2005-03-01 08:40:53
|
On Tue, 2005-03-01 at 16:19 +0900, FUJITA Tomonori wrote: > The user-space daemon handles authentication and the kernel threads > take care of network and disk I/O requests from initiators by using > the VFS interface. The kernel-space code is not intrusive. It doesn't > touch other parts of the kernel. The code is already stable and > usable. More than 130 people currently subscribe to the project's > mailing list. The developers aim for inclusion into the mainline > kernel. > Could you please review the code? Any comments are greatly > appreciated. > - can you explain why the target has to be inside the kernel and can't be a pure userspace daemon ? |
From: Libor V. <lv...@te...> - 2005-03-01 10:48:41
|
Arjan van de Ven wrote: >On Tue, 2005-03-01 at 19:22 +0900, FUJITA Tomonori wrote: > > >>From: Arjan van de Ven <ar...@in...> >>Subject: [Iscsitarget-devel] Re: [ANNOUNCE] iSCSI enterprise target software >>Date: Tue, 01 Mar 2005 10:46:03 +0100 >> >> >> >>>fsync or msync() ? I would imagine the target mmaping it's backend in >>>userspace and using msync() to kick off IO. At which point it's not that >>>much different from the control you do of the pagecache from inside the >>>kernel... >>> >>> >>Can we avoid calling mmap() and munmap() repeatedly with large disk? >> >> > >my server has 512Gb address space with 2.6.9/2.6.10, and a lot more than >that with the 2.6.11 kernel (4 level page tables rock). So the answer >would be yes. > >(and on old servers without 64 bit, you indeed need to mmap/munmap >lazily to create a window, but I suspect that the 3 Gb of address space >you have there can be managed smart to minimize the number of unmaps if >you really try) > > I don't know in detail what are you talking about (if whole disk must fit address space) but please consider we're speaking about TBs (10-20 TB RAID is quite cheap nowadays with 400 GB SATA disks). -- Best regards, Libor Vanek |
From: Arjan v. de V. <ar...@in...> - 2005-03-01 10:51:55
|
On Tue, 2005-03-01 at 11:48 +0100, Libor Vanek wrote: > > > I don't know in detail what are you talking about (if whole disk must > fit address space) but please consider we're speaking about TBs (10-20 > TB RAID is quite cheap nowadays with 400 GB SATA disks). so? if you need one map/unmap per terabyte, the cost of that is like zero. |
From: FUJITA T. <fuj...@la...> - 2005-03-01 09:36:06
|
From: Arjan van de Ven <ar...@in...> Subject: Re: [ANNOUNCE] iSCSI enterprise target software Date: Tue, 01 Mar 2005 09:40:38 +0100 > > Could you please review the code? Any comments are greatly > > appreciated. > > - > > can you explain why the target has to be inside the kernel and can't be > a pure userspace daemon ? o synchronization Suppose that an target runs in the user space and an initiator sends two WRITE commands (A and B) with the simple attribute. The target can write A and B simultaneously. Before the target sends the response of A, A must be committed to disk (that is, some dirty page cache must be committed). So the target calls fsync(). It commits A to disk. Moreover, it also commits B to disk unnecessarily. This really hurts performance. The current code uses the sync_page_range function. o disk drive cache When the target calls fsync(), dirty page cache is supposed to be committed to disk. However, the disk drive uses write-back policy, it is not. The data is still in disk drive cache. There is no system call to control disk drive cache. So the target (in the user space) cannot make good use of it. The current code also assumes the disk drive uses write-through policy. This is because no handy vfs interface for controlling disk drive cache. I think that there is some room for further improvement in the Linux kernel for storage systems. If the kernel maintainers add new system calls to do the above jobs for storage systems, We can implement good iSCSI target software running in the user space. The last reason is that user-space cost like memory copy. With 1Gbs Ethernet, is is not critical. However, with 10G, it is critical, I expect. I've been setting up 10G experimental infrastructure to evaluate iSCSI performance. |
From: Arjan v. de V. <ar...@in...> - 2005-03-01 09:46:22
|
On Tue, 2005-03-01 at 18:35 +0900, FUJITA Tomonori wrote: > From: Arjan van de Ven <ar...@in...> > Subject: Re: [ANNOUNCE] iSCSI enterprise target software > Date: Tue, 01 Mar 2005 09:40:38 +0100 > > > > Could you please review the code? Any comments are greatly > > > appreciated. > > > - > > > > can you explain why the target has to be inside the kernel and can't be > > a pure userspace daemon ? > > o synchronization > > Suppose that an target runs in the user space and an initiator sends > two WRITE commands (A and B) with the simple attribute. > > The target can write A and B simultaneously. Before the target sends > the response of A, A must be committed to disk (that is, some dirty > page cache must be committed). So the target calls fsync(). It commits > A to disk. Moreover, it also commits B to disk unnecessarily. This > really hurts performance. fsync or msync() ? I would imagine the target mmaping it's backend in userspace and using msync() to kick off IO. At which point it's not that much different from the control you do of the pagecache from inside the kernel... > o disk drive cache > > When the target calls fsync(), dirty page cache is supposed to be > committed to disk. However, the disk drive uses write-back policy, it > is not. The data is still in disk drive cache. There is no system call > to control disk drive cache. So the target (in the user space) cannot > make good use of it. fsync() (and I suppose msync()) nowadays send a "flush cache" command to the physical disk as well. This is new since 2.6.9 or so. > The current code also assumes the disk drive uses write-through > policy. This is because no handy vfs interface for controlling disk > drive cache. I think that there is some room for further improvement > in the Linux kernel for storage systems. that's already present since 2.6.9..... > The last reason is that user-space cost like memory copy. With 1Gbs > Ethernet, is is not critical. However, with 10G, it is critical, I > expect. I've been setting up 10G experimental infrastructure to > evaluate iSCSI performance. if you use the mmap not write/read approach this copy isn't there. |
From: FUJITA T. <fuj...@la...> - 2005-03-01 10:23:06
|
From: Arjan van de Ven <ar...@in...> Subject: [Iscsitarget-devel] Re: [ANNOUNCE] iSCSI enterprise target software Date: Tue, 01 Mar 2005 10:46:03 +0100 > fsync or msync() ? I would imagine the target mmaping it's backend in > userspace and using msync() to kick off IO. At which point it's not that > much different from the control you do of the pagecache from inside the > kernel... Can we avoid calling mmap() and munmap() repeatedly with large disk? > > When the target calls fsync(), dirty page cache is supposed to be > > committed to disk. However, the disk drive uses write-back policy, it > > is not. The data is still in disk drive cache. There is no system call > > to control disk drive cache. So the target (in the user space) cannot > > make good use of it. > > fsync() (and I suppose msync()) nowadays send a "flush cache" command to > the physical disk as well. This is new since 2.6.9 or so. > > > The current code also assumes the disk drive uses write-through > > policy. This is because no handy vfs interface for controlling disk > > drive cache. I think that there is some room for further improvement > > in the Linux kernel for storage systems. > > that's already present since 2.6.9..... Thanks a lot. I've not noticed these changes. I'll see the code later. |
From: Arjan v. de V. <ar...@in...> - 2005-03-01 10:33:48
|
On Tue, 2005-03-01 at 19:22 +0900, FUJITA Tomonori wrote: > From: Arjan van de Ven <ar...@in...> > Subject: [Iscsitarget-devel] Re: [ANNOUNCE] iSCSI enterprise target software > Date: Tue, 01 Mar 2005 10:46:03 +0100 > > > fsync or msync() ? I would imagine the target mmaping it's backend in > > userspace and using msync() to kick off IO. At which point it's not that > > much different from the control you do of the pagecache from inside the > > kernel... > > Can we avoid calling mmap() and munmap() repeatedly with large disk? my server has 512Gb address space with 2.6.9/2.6.10, and a lot more than that with the 2.6.11 kernel (4 level page tables rock). So the answer would be yes. (and on old servers without 64 bit, you indeed need to mmap/munmap lazily to create a window, but I suspect that the 3 Gb of address space you have there can be managed smart to minimize the number of unmaps if you really try) |
From: Arjan v. de V. <ar...@in...> - 2005-03-01 10:46:42
|
On Tue, 2005-03-01 at 11:33 +0100, Arjan van de Ven wrote: > On Tue, 2005-03-01 at 19:22 +0900, FUJITA Tomonori wrote: > > From: Arjan van de Ven <ar...@in...> > > Subject: [Iscsitarget-devel] Re: [ANNOUNCE] iSCSI enterprise target software > > Date: Tue, 01 Mar 2005 10:46:03 +0100 > > > > > fsync or msync() ? I would imagine the target mmaping it's backend in > > > userspace and using msync() to kick off IO. At which point it's not that > > > much different from the control you do of the pagecache from inside the > > > kernel... > > > > Can we avoid calling mmap() and munmap() repeatedly with large disk? > > my server has 512Gb address space with 2.6.9/2.6.10, and a lot more than > that with the 2.6.11 kernel (4 level page tables rock). So the answer > would be yes. > > (and on old servers without 64 bit, you indeed need to mmap/munmap > lazily to create a window, but I suspect that the 3 Gb of address space > you have there can be managed smart to minimize the number of unmaps if > you really try) note that on 32 bit servers the kernel side needs to do kmap() on the pages anyway, and that a kmap/kunmap series is very much equivalent to a mmap/munmap series in lots of ways, so I doubt that has many additional savings for doing it in kernel space. |
From: FUJITA T. <fuj...@la...> - 2005-03-01 11:23:27
|
From: Arjan van de Ven <ar...@in...> Subject: Re: [Iscsitarget-devel] Re: [ANNOUNCE] iSCSI enterprise target software Date: Tue, 01 Mar 2005 11:46:32 +0100 > note that on 32 bit servers the kernel side needs to do kmap() on the > pages anyway, and that a kmap/kunmap series is very much equivalent to a > mmap/munmap series in lots of ways, so I doubt that has many additional > savings for doing it in kernel space. The code uses the vfs interface, kmap_atomic() is used instead of kmap(). kmap_atomic() is much faster kmap(). |
From: Bryan H. <hb...@us...> - 2005-03-01 18:24:24
|
One thing that's implicit in your reasons for wanting to be in the kernel is that you've chosen to exploit the kernel's page cache. As a user of the page cache, you have more control from inside the kernel than from user space. The page cache was designed to be fundamentally invisible to user space. A pure user space implementation of an ISCSI target would use process virtual memory for a cache and manage it itself. It would access the storage with direct I/O. It looks to me like this is aimed at a single-application Linux system (the whole system is just an ISCSI target), which means there's not much need for a kernel to manage shared resources. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems |
From: Arjan v. de V. <ar...@in...> - 2005-03-01 18:37:22
|
On Tue, 2005-03-01 at 10:24 -0800, Bryan Henderson wrote: > One thing that's implicit in your reasons for wanting to be in the kernel > is that you've chosen to exploit the kernel's page cache. As a user of > the page cache, you have more control from inside the kernel than from > user space. The page cache was designed to be fundamentally invisible to > user space. > > A pure user space implementation of an ISCSI target would use process > virtual memory for a cache and manage it itself. It would access the > storage with direct I/O. why would it use direct I/O ? Direct I/O would be really stupid for such a thing to use since that means there's no caching going on *at all*. You want to *use* the kernel pagecache as much as you can. You do so by using mmap and such, and msync to force content to disk. That uses the kernel pagecache to the maximum extend, while not having to bother with knowing the intimate details of the implementation thereof, which a kernel side implementation would be involved in. (if it wasnt and only used highlevel functions, then you might as well do the same in userspace after all) |
From: Ming Z. <mi...@el...> - 2005-03-01 18:48:55
|
On Tue, 2005-03-01 at 13:37, Arjan van de Ven wrote: > On Tue, 2005-03-01 at 10:24 -0800, Bryan Henderson wrote: > > One thing that's implicit in your reasons for wanting to be in the kernel > > is that you've chosen to exploit the kernel's page cache. As a user of > > the page cache, you have more control from inside the kernel than from > > user space. The page cache was designed to be fundamentally invisible to > > user space. > > > > A pure user space implementation of an ISCSI target would use process > > virtual memory for a cache and manage it itself. It would access the > > storage with direct I/O. > > why would it use direct I/O ? Direct I/O would be really stupid for such > a thing to use since that means there's no caching going on *at all*. > what Bryan suggest is a privately owned and managed user space cache. so for that disk write should be real write-through. it is hard to beat linux kernel cache performance though. > You want to *use* the kernel pagecache as much as you can. You do so by > using mmap and such, and msync to force content to disk. That uses the > kernel pagecache to the maximum extend, while not having to bother with > knowing the intimate details of the implementation thereof, which a > kernel side implementation would be involved in. (if it wasnt and only > used highlevel functions, then you might as well do the same in > userspace after all) > > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html |
From: Bryan H. <hb...@us...> - 2005-03-01 21:04:47
|
>it is hard to beat linux kernel [page] cache performance though. It's quite easy to beat it for particular applications. You can use special knowledge about the workload to drop pages that won't be accessed soon in favor of pages that will, not clean a page that's just going to get discarded or overwritten soon, allocate less space to less important data, and on and on. And that's pretty much the whole argument for direct I/O. Sometimes the code above the filesystem layer is better at caching. Of course, in this thread we're not talking about beating the page cache -- we're just talking about matching it, while reaping other benefits of user space code vs kernel code. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems |
From: Ming Z. <mi...@el...> - 2005-03-01 21:15:30
|
On Tue, 2005-03-01 at 16:04, Bryan Henderson wrote: > >it is hard to beat linux kernel [page] cache performance though. > > It's quite easy to beat it for particular applications. You can use > special knowledge about the workload to drop pages that won't be accessed > soon in favor of pages that will, not clean a page that's just going to > get discarded or overwritten soon, allocate less space to less important > data, and on and on. u are talking about application aware caching/prefetching stuff. but i prefer to modifying kernel page cache a little bit while make use of most of the code there. > > And that's pretty much the whole argument for direct I/O. Sometimes the > code above the filesystem layer is better at caching. > > Of course, in this thread we're not talking about beating the page cache > -- we're just talking about matching it, while reaping other benefits of > user space code vs kernel code. > yes, we went too far. > -- > Bryan Henderson IBM Almaden Research Center > San Jose CA Filesystems |
From: Bryan H. <hb...@us...> - 2005-03-02 18:20:42
|
>u are talking about application aware caching/prefetching stuff. but i >prefer to modifying kernel page cache a little bit while make use of >most of the code there. That's a powerful argument for using the page cache, and further, for using it from within the kernel. I once started a project to port a particular filesystem driver from the kernel to user space. It was to be used in a Linux system whose sole purpose was to export one filesystem via NFS. I was looking for engineering ease. Everything about porting the driver to user space was almost trivial except for duplicating the page cache, and that was enough work to call into question the whole strategy (I never went far enough to actually make a decision). But I'm sure there are cases where the tradeoff works. -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems |
From: Ming Z. <mi...@el...> - 2005-03-02 19:34:25
|
On Wed, 2005-03-02 at 13:20, Bryan Henderson wrote: > >u are talking about application aware caching/prefetching stuff. but i > >prefer to modifying kernel page cache a little bit while make use of > >most of the code there. > > That's a powerful argument for using the page cache, and further, for > using it from within the kernel. I once started a project to port a > particular filesystem driver from the kernel to user space. It was to be > used in a Linux system whose sole purpose was to export one filesystem via > NFS. I was looking for engineering ease. Everything about porting the > driver to user space was almost trivial except for duplicating the page > cache, and that was enough work to call into question the whole strategy > (I never went far enough to actually make a decision). > i tried several time before on implementing own cache structures in user space/ kernel space. every time i think i can do it better than before. but frankly, there are always corner cases that make it performs poor. > But I'm sure there are cases where the tradeoff works. yes. i am sure about this as well. > > -- > Bryan Henderson San Jose California > IBM Almaden Research Center Filesystems |
From: Arjan v. de V. <ar...@in...> - 2005-03-01 21:16:45
|
On Tue, 2005-03-01 at 13:04 -0800, Bryan Henderson wrote: > >it is hard to beat linux kernel [page] cache performance though. > > It's quite easy to beat it for particular applications. You can use > special knowledge about the workload to drop pages that won't be accessed > soon in favor of pages that will, not clean a page that's just going to > get discarded or overwritten soon, allocate less space to less important > data, and on and on. except that in iscsi a big chunk of the access patterns are *external*; eg the real smarts are on that other machine on the network, not in the iscsi server. (and some of the stuff you describe are available via the pagecache too via madvise and friends) |
From: Bryan H. <hb...@us...> - 2005-03-02 18:27:26
|
>except that in iscsi a big chunk of the access patterns are *external*; >eg the real smarts are on that other machine on the network, not in the >iscsi server. We strayed a little from the topic; I don't claim that a private user-space cache is better than the page cache for an ISCSI server. My only point is that some of the reasons given for the kernel being a better place for ISCSI server code than user space are reasons only if you use assume using the page cache in both cases. Of course, one can always argue first that the page cache is better for an ISCSI server than any other kind of cache, and therefore that the kernel is a better place than user space for the server code. We just hadn't gone there yet. -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems |
From: FUJITA T. <fuj...@la...> - 2005-03-02 03:30:28
|
From: Bryan Henderson <hb...@us...> Subject: Re: [ANNOUNCE] iSCSI enterprise target software Date: Tue, 1 Mar 2005 13:04:58 -0800 > >it is hard to beat linux kernel [page] cache performance though. > > It's quite easy to beat it for particular applications. You can use > special knowledge about the workload to drop pages that won't be accessed > soon in favor of pages that will, not clean a page that's just going to > get discarded or overwritten soon, allocate less space to less important > data, and on and on. Yes. The page-cache replacement policy has a big impact on the performance. And a smart storage system tries to know how initiators use its disk volume and adjust the replacement policy. For example, if a initiator uses a file system, page cache keeping data used as meta-data blocks can be important than page cache keeping data for file-data blocks. mlock and madvise may helps, however, storage people need more functionality for this issue, I guess. As Arjan said, all page cache are not identical on some architectures. So, target software need to control the way to allocate page cache to get the best performance out of such architectures. Can user-mode target software do it? (sorry, I've not used such architectures.) Another possible reason why kernel-space target software is preferable is handling hardware. For example, NVRAM is very useful and widely used for storage systems. I have no experience with NVRAM cards, however, after a scan of drivers/block/umem.c (Micro Memory's NVRAM card driver), I think that you need to modify the interrupt handler or using bio to get the best performance out of it (although user-mode target software can write, read, and mmap it like normal block devices). And user-mode target software cannot handle this issue well. Note that I'm not trying to push highly-specialized functionality, such as new page-cache replacement policy, for storage systems into the mainline kernel. As I said before, our code doesn't touch other parts of the kernel. I consider our project as a kind of a plat-home for building iSCSI storage systems. I think that basic iSCSI functionality that our code provides satisfies the majority. In addition, industry and academic people can modify our code and the linux kernel to add what they want. They can build remote mirroring systems, failover storage systems, etc possibly by using their own hardware. We may benefit from some of them. Thus, I think that kernel-mode iSCSI target functionality, which can control all the system-resources and provide the maximum flexibility and performance, is a better approach. |
From: Andi K. <ak...@mu...> - 2005-03-01 20:38:39
|
Arjan van de Ven <ar...@in...> writes: > > You want to *use* the kernel pagecache as much as you can. You do so by > using mmap and such, and msync to force content to disk. That uses the Last time I checked you couldn't mmap block devices. Has this changed now? Could be a problem for an iSCSI target. I remember there used to be a hack in 2.2 to map them to a pseudo fs to allow mmaping, but that's not very nice and would require another step by the administrator. Also using mmap would imply the server only works on 64bit systems, and may even there have uncomfortable limits. One issue is that the kernel currently doesn't garbage collect page tables, so e.g. when you map a 10TB volume this way and the user accesses it randomly you will eventually have quite a lot of page tables filling up your RAM. And those will not go away. My overall feeling is that mmap is not a good idea for this. -Andi |
From: Ming Z. <mi...@el...> - 2005-03-01 20:49:56
|
On Tue, 2005-03-01 at 15:38, Andi Kleen wrote: > Arjan van de Ven <ar...@in...> writes: > > > > You want to *use* the kernel pagecache as much as you can. You do so by > > using mmap and such, and msync to force content to disk. That uses the > > Last time I checked you couldn't mmap block devices. Has this changed > now? Could be a problem for an iSCSI target. > we definitely need to support export any block device like lv or md, or just regular hdX or sdX. we also need supports to act as an iSCSI bridge mode, which means it can export real scsi devices. ming > I remember there used to be a hack in 2.2 to map them to a pseudo fs > to allow mmaping, but that's not very nice and would require > another step by the administrator. > > Also using mmap would imply the server only works on 64bit systems, > and may even there have uncomfortable limits. One issue is that > the kernel currently doesn't garbage collect page tables, so > e.g. when you map a 10TB volume this way and the user accesses > it randomly you will eventually have quite a lot of page tables > filling up your RAM. And those will not go away. > > My overall feeling is that mmap is not a good idea for this. > > -Andi > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html |
From: Christoph H. <hc...@in...> - 2005-03-01 22:19:40
|
On Tue, Mar 01, 2005 at 09:38:34PM +0100, Andi Kleen wrote: > Arjan van de Ven <ar...@in...> writes: > > > > You want to *use* the kernel pagecache as much as you can. You do so by > > using mmap and such, and msync to force content to disk. That uses the > > Last time I checked you couldn't mmap block devices. Has this changed > now? Could be a problem for an iSCSI target. Since 2.4.10 you can mmap block devices. |
From: Bryan H. <hb...@us...> - 2005-03-01 20:53:08
|
>You want to *use* the kernel pagecache as much as you can. No, I really don't. Not always. I can think of only 2 reasons to maximize my use of the kernel pagecache: 1) saves me duplicating code; 2) allows me to share resources (memory and disk bandwidth come to mind) with others in the same Linux system fairly. There are many cases where those two benefits are outweighed by the benefits of using some other cache. If you're thinking of other benefits of using the pagecache, let's hear them. |
From: Arjan v. de V. <ar...@in...> - 2005-03-01 20:58:14
|
On Tue, 2005-03-01 at 12:53 -0800, Bryan Henderson wrote: > >You want to *use* the kernel pagecache as much as you can. > > No, I really don't. Not always. I can think of only 2 reasons to > maximize my use of the kernel pagecache: 1) saves me duplicating code; 2) > allows me to share resources (memory and disk bandwidth come to mind) with > others in the same Linux system fairly. There are many cases where those > two benefits are outweighed by the benefits of using some other cache. If > you're thinking of other benefits of using the pagecache, let's hear them. The page cache is capable of using more ram than apps can on some architectures. The page cache knows about numa and other topologic issues. For IO the pagecache has highly tuned algorithms in the 2.6 kernel that throttle write out based on per spindle congestion (assuming DM/MD raid). You can implement most of that yourself, sure. But why duplicate and tune it for all those different systems out there if the kernel already did that work |