Re: [Iscsitarget-devel] Fileio Background dirty page flush

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Yucong Sun (叶雨飞) [mailto:sun...@gm...] wrote:
> On Mon, Feb 13, 2012 at 10:50 AM, Ross S. W. Walker
> <RW...@me...> wrote:
> > Yucong Sun (叶雨飞) [mailto:sun...@gm...] wrote:
> >>
> >> What is your setup? (Processors, NICs, HDDs, controllers, RAID, etc.)
> >>
> >> a normal linux server,  with 256M BBU hardware raid 10
> >> perc/6i as disk backend (system on another disk) , running
> >> ietd trunk.  have 8 works constantly writing highly violate
> >> data in random location (means they normally get changed
> >> right away after last write)
> >
> > The 6/i isn't as good as the 6/e but if space is tight.
> >
> > How many disks was that raid 10?
> 
> 4 disks
> 
> >
> > What type of disks was that raid 10?
> 
> 500G sata, and it's operating with 700iops under same 
> workload without any tweak

It is impossible for 4 7200 RPM disks to perform 700 IOPS in
a RAID10. This is of course for random IO, it makes no sense
to calculate sequential IO in IOPS as that is throughput and
is measured in bytes/sec.

SATA disks have an average seek of 8-12ms and 7200 RPM drives
have a average rotational latency of 4ms, taking each IO
12-16ms to seek and rotate. This means each SATA disk can do
62 - 84 IOPS.

In a perfectly designed RAID10 each disk can independantly
read giving 248 - 336 IOPS reading, but can only get write
IOPS of the number of mirrors which means 124 - 168 IOPS of
writes, but I suspect the PERCs don't do independant reads as
it takes more logic which means more $$$, so bet that your
array can only handle 124 -168 IOPS both reading and writing.

> >
> > What size is that raid 10?
> 
> 1Tb, so complete memory is not feasible.

Does each client read 1TB of data all the time?

No, it's only the current active working set that matters.

So, say it's mysql, and you've figured out that the max
table size is X and the min is Y and the average join is
4 tables then (((X + Y) / 2) * 4) is the client's working
set. Say you have 8 clients, multiply that by 8.

Set read-ahead on block devices to be able to pull that
working set into memory as quick as possible without
impacting each other or the background writes.

> >
> >> What Linux kernel?
> >> 2.6.29  , ubuntu lts
> >>
> >> What would you like to achieve?
> >>
> >> at first I was just using WT mode and relying on raid card
> >> write buffer, but I want to use 2G ram as a secondary write
> >> cache, from what I read (kernel code and documents i can
> >> found), I think page cache is just what I need, except for
> >> one thing, I can't control the page flush , ideally I guess I
> >> want to make all write in best effort mode,  only use
> >> available bandwidth unless there's a buffer under-run, but I
> >> realize that it is probably hard, but doable I'm sure, but no
> >> one seems to care enough to implement this.
> >
> > This document is good:
> >
> > http://www.westnet.com/~gsmith/content/linux-pdflush.htm
> >
> > Lets take a look at the tunables:
> >
> > dirty_bytes/dirty_ratio: total amount of dirty memory allowed
> > for process before process is blocked for flushing.
> >
> > - I would keep this high, say 50% of total memory cause if
> > this is hit the results could be unpredictable for IET, maybe
> > all targets get blocked, maybe none, more investigation is
> > needed here.
> 
> Exactly, this controls the total up limit to prevent serious underun,
> I plan to set it to 2G.

What this tells kernel is, if this limit is reached, block the
process until it's flushed. All target threads will probably
be included in this calculation. Make sure it isn't hit.

> >
> > dirty_background_bytes/dirty_background_ratio: total amount of
> > dirty memory before a background flush operation is started.
> >
> > - Keep this small, say 256MB or 512MB to make sure the
> > controller can swallow it up in a single operation so the
> > flushes are tiny blips on the radar
> 
> controller has 256 write buffer, so I guess I should set it to 256M
> here, what's weird is that I observe in relatity,  is that the
> controller don't just swallow it into write cache,  that's why I am
> seeing huge ios blocking any other activities for at least 1 seconds
> when page flush happens.

Then tune it down until it doesn't.

> >
> > dirty_expire_centisecs: total time a page can be dirty before
> > it is flushed
> >
> > - Keep this small, say 1-3 seconds for data reliability in the
> > face of an accidental power-off or kernel panic. This one
> > tells you the recovery point of the volume, very important.
> 
> I am not actually caring too much about preservation of the data,
> since the data is not very important, just highly violate, it's mostly
> page swap data anyway, use raid10 just to reduce downtime.  keeping it
> small basically force Linux to flush pages, I think I can set it to at
> least minutes , I will experiment on that.

Even if the data isn't essential, don't do minutes, the default
30 seconds should be good enough.

If the data is only swap then I would do sparse or flat files
on top of a file system and let the file system worry about how
best to handle the page cache.

> >
> > Now the real benefit to page cached IET volumes is read data
> > caching, you want to get your whole workload in memory as
> > fast as you can, then you want to completely operate out of
> > that, so only writes go to disk. Of course this won't be
> > completely possible, but with enough RAM it can be
> > significantly reduced so the read operations fit nicely within
> > the flush operations.
> 
> I see what you mean, that's exactly how it operate now. but in WT
> mode, the write operation will not success until disk layer confirms,
> right? that could unnecessarily delay things up.

If the disks are slow it will delay.

> And by the way, what about SYNC operations that IET receives? I know
> upper layer metadata operations will sync it before goes, how would
> that work in the iscsi world?

When IET gets a sync from disk it flushes the whole target disk
page cache, which could mean the whole RAID 10 for some devices.

Another way is to make a big XFS file system with sparse files
for each client and serve those sparse files up over IET. Then
XFS can take care of the page cache corner issues. Then is there
is a flush it only flushes that file.

-Ross
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.