|
From: Ying X. <yin...@wi...> - 2013-10-25 10:11:37
|
On 10/25/2013 03:48 PM, jason wrote:
> Hi Ying,
>
> On Friday, October 25, 2013, Ying Xue wrote:
>
> On 10/25/2013 11:20 AM, jason wrote:
> > Hi Ying,
> > As I under standing of the memory barrier, it is used for other CPU to
> > see a same order of two instructions (one before the memory
> barrier and
> > one after it) operating on memory as the local CPU. So if one
> decide to
> > use memory barrier, there must be at lease two memory operations
> first.
> > In your patch, operation on 'block' is just a single operation on
> memory
> > which just store a value to a memory location, no subsequent memory
> > operations presented to be execute in a strong order (must take effect
> > on memory AFTER set the 'block' flag) among all other CPUs. So I here
> > agree on Jon too, no need to use atomic, memory barrier and locks.
> >
>
> Please carefully read my mail before you post your comments.
>
> I have placed emphasis which there is no required data dependency in our
> case. But I have listed two exceptions: one is that some older
> processors have weak or even no cache coherences system. Specially you
> can look through implementations of memory barriers in all archs.
>
>
> Sorry for my careless...
>
> Yes, I got your point ,and you are correct absolutely. Yes, many
> architectures do need memory barrier to force to update cache, but in
> most case, you do not need to do it explicitly. Many kinds of implicit
> memory barrier will appear in the subsequent code, such as a
> rtnl_lock/rtnl_unlock... If you can find one, then your memory barrier
> operation can be omitted and improve performance.
>
>
> Another case is compiler optimization factor.
>
>
> Yes, memory barrier can suppress compiler optimization on the reader
> side. But I think as you know, a volatile key word is enough for this job.
>
Thanks for your confirmation.
Regards,
Ying
>
>
>
> //Ying
>
> > On Friday, October 25, 2013, Ying Xue wrote:
> >
> > On 10/21/2013 05:14 AM, Jon Maloy wrote:
> > > On 09/30/2013 10:38 AM, Ying Xue wrote:
> > >> The 'blocked' flag in struct tipc_bearer is currently
> protected by
> > >> a big spinlock aggregated into tipc_bearer. We want to
> reduce all
> > >> dependencies to this lock, as part of our effort to make
> the locking
> > >> policy simpler and more manageable.
> > >>
> > >> Furthermore, the functions tipc_disc_recv_msg() and
> disc_send_msg()
> > >> access the flag without taking this lock at all, implying a
> risk that
> > >> they may occasionally obtain a wrong value and drop
> sending/receiving
> > >> of discovery messages. Since discovery messages are timer
> driven and
> > >> best-effort, the latter problem is not fatal, but it is clearly
> > avoidable.
> > >>
> > >> By converting the type of the flag from int to atomic_t we
> eliminate
> > >> both the spinlock dependency and the current race condition.
> > >>
> > >> We also make two small code changes:
> > >>
> > >> 1) The functions tipc_continue() and bearer_blocked() now
> become
> > >> one-liners doing a single access to the flag. So we just
> remove
> > >> those redundant functions.
> > >>
> > >> 2) The initialization of the flag is moved from the
> enable_media()
> > >> functions in ib_media.c and eth_media.c respecively, to the
> > generic
> > >> tipc_enable_bearer() function, so that it is done in one
> place
> > only.
> > >
> > > Maybe it is just me, but I still fail to see the benefit (or
> harm)
> > of this change.
> > >
> > > As I understand it, the time line for the current
> implementation,
> > and the
> > > one you are suggesting, look like this:
> > >
> > > Without atomc_t:
> > > ------------------------
> > > 1: Device driver detects loss of carrier or similar.
> > > 2: It issues a call (work_queue?) to all registered users.
> > > 3: Until the call has reached the users, they (including TIPC)
> > will continue
> > > to send packets, although the driver can only drop them.
> > > 4: The field bearer->blocked is set as a one single write
> both on
> > 32-bits and
> > > 64-bist cPUs, since it is one word, and a small constant
> > value (1 or 0) that is written.
> > > 5: Other cores happening to work on the same cache line at
> that moment
> > > will notice nothing; they will continue to send the packet
> > they were working
> > > on, seeing the previous value of the field.
> > > 6: The cache line is flushed and reloaded as soon as next packet
> > is sent or received
> > > on these cores. (I don't think a single 64-byte cache line
> > will remain between
> > > two transmissions. There is way to many data fields
> involved.)
> > > 7: The next time the cache line is reloaded it has the correct
> > value of
> > > 'blocked', since the cache coherency HW support ensures that
> > > the cache is updated. Next packet is not sent.
> > >
> > >
> > > With atomic_t:
> > > ------------------
> > > 1: Device driver detects loss of carrier or similar.
> > > 2: It issues a call (work_queue?) to all registered users.
> > > 3: Until the call has reached the users, they (including TIPC)
> > will continue
> > > to send packets, although the driver can only drop them.
> > > 4: The field bearer->blocked is set as a single, atomic write,
> > which it would be
> > > anyway, just as in the previous case.
> > > 5: Other cores happening to read on the same cache line at
> that moment
> > > will now see the cache line invalidated, so they will
> have to
> > reread it with the
> > > new, correct value. The 'volatile' keyword ensures that.
> As I
> > understand
> > > GCC even puts in memory barrier around volatile
> variables, but
> > I fail to
> > > to see the relevance here, we don make any other changes
> to the
> > > bearer struct anyway, so ordereing is not an issue.
> > > 6: The cache line now has the correct read value. Packet is
> not sent.
> > >
> > > What have we won? That TIPC may have time to send one more
> > > packet per core than it would otherwise, but nothing can stop
> > > TIPC and others from sending a small number of packets
> during the
> > > interval between carrier loss detection and until the 'blocked'
> > flag is
> > > updated. The difference it makes if we cache line is invalidated
> > immediately
> > > or at next reload is microscopic in this context.
> > >
> > > Admittedly, the harm done is not big either. The memory
> > > barriers put in by GCC around the volatile read does not
> cost much.
> > > But why make such a change if we cannot motivate it?
> > >
> > > First of all, thank you for so detailed comments. In fact the
> > things you
> > concerned below often confuse me. So I spend some times trying to
> > understand memory cache system of modern processor as well as the
> > effects of kinds of Linux memory barrier primitives on memory
> cache
> > coherence. That's why I am late to respond you. But it's really so
> > complex that I am also afraid that I don't entirely understand
> them
> > although I consumed several days to research. Anyway, I am
> still going
> > to answer your questions :)
> >
> > Nowadays most of modern processors have cache coherence protocols
> > ensuring that all CPUs maintain a consistent view of data. The
> most
> > famous coherence protocol is called MESI. For example, the blocked
> > variable is manipulated by core 0 and is only written back to
> its local
> > cache, and other cores snoop/monitor the change of core 0 from
> internal
> > high speed bus and then invalidate their local cacheline where the
> > blocked variable resides. When other cores read the blocked,
> they can
> > get the up-to-date value from the cacheline of core 0. The
> whole process
> > is guaranteed by MESI. However, why do we still need memory
> barriers if
> > processor HW has cache coherence capability? This is because
> these CPUs
> > cannot ensure the order of accessing memory occurring on core
> 0 is same
> > as happening on other cores although they can guarantee that
> data is
> > consistent. But in our case, as there is no required data
> dependency,
> > memory barriers seem useless for us in this case.
> >
> > Unfortunately life doesn't sound like so easy! There are still
> some of
> > old CPUs being incapable of the cache coherence mechanisms,
> meaning that
> > manually flushing cache becomes mandatory for them. In this
> situation,
> > without memory barriers, the rate of getting stale data might
> > significantly arise. Even if many packets are involved(ie,
> many memory
> > areas would be access via CPU's local cache), in some
> extremely worst
> > case it doesn't mean the cacheline storing the stale blocked is
> > displaced/sacrificed within a short time. Therefore, it becomes
> > unpredictable for us when the stale blocked would be updated,
> which may
> > be unacceptable.
> >
> > If we find barrier.h files in kernel source tree, below
> results are got:
> >
> > */net-next-2.6$ find ./ -name barrier.h
> > ./arch/metag/include/asm/barrier.h
> > ./arch/sparc/include/asm/barrier.h
> > ./arch/h8300/include/asm/barrier.h
> > ./arch/x86/um/asm/barrier.h
> > ./arch/x86/include/asm/barrier.h
> > ./arch/unicore32/include/asm/barrier.h
> > ./arch/xtensa/include/asm/barrier.h
> > ./arch/frv/include/asm/barrier.h
> > ./arch/m68k/include/asm/barrier.h
> > ./arch/hexagon/include/asm/barrier.h
> > ./arch/mips/include/asm/barrier.h
> > ./arch/arm64/include/asm/barrier.h
> > ./arch/parisc/include/asm/barrier.h
> > ./arch/s390/include/asm/barrier.h
> > ./arch/ia64/include/asm/barrier.h
> > ./arch/m32r/include/asm/barrier.h
> > ./arch/powerpc/include/asm/barrier.h
> > ./arch/alpha/include/asm/barrier.h
> > ./arch/arm/include/asm/barrier.h
> > ./arch/avr32/include/asm/barrier.h
> > ./arch/sh/include/asm/barrier.h
> > ./arch/microblaze/include/asm/barrier.h
> > ./arch/mn10300/include/asm/barrier.h
> > ./arch/tile/include/asm/barrier.h
> > ./arch/arc/include/asm/barrier.h
> > ./arch/score/include/asm/barrier.h
> > ./arch/blackfin/include/asm/barrier.h
> > ./arch/cris/include/asm/barrier.h
> > ./include/asm-generic/barrier.h
> >
> > Different of architectures have different implementations.
> Even if arch
> > is same, different HW versions also have different
> implementations. If
> > we take a closer look at them, it's found many of implementations
> > forcedly flush/sync cache in memory barriers operations.
> >
> > Beside the effect on CPU cache, memory barriers also can
> suppress the
> > optimization of compiler. For it, I just attempt to state my
> > understanding of keyword - volatile, so please image below
> scenario:
> >
> > CPU 0 CPU 1
> > ----------- -----
> > T0: read blocked(blocked is 0 now)
> > T1: blocked = 1
> > T2: read blocked(blocked is 0)
> >
> > As compiler optimizes the process of accessing blocked,
> without volatile
> > keyword decorating the variable of blocked, at T0, CPU 0 reads
> blocked
> > from main memory/cache and load it into one of its registers.
> At T1,
> > blocked is set to 1 by CPU 1. But as CPU 0 is unaware of the
> change, it
> > gets the value from its register instead of main memory or
> cache when it
> > reads blocked variable again at T2. We don't know when the
> value in the
> > register will be overwritten by other values, that means we
> don't know
> > when CPU 0 would read the true data. This may also be unacceptable
> > for us.
> >
> > Therefore, without memory barriers, in most time I believe
> everything
> > works fine. But to be more cautious, I still prefer to apply some
> > limitations to ensure blocked is always sync.
> >
> > Of course, probably my above understanding is incomplete or
> incorrect.
> > If so, please correct me.
> >
> >
> > Regards,
> > Ying
> >
> >
> > > Regards
> > > ///jon
> > >
> > >
> > >>
> > >> Signed-off-by: Ying Xue <yin...@wi...>
> > >> ---
> > >> net/tipc/bcast.c | 4 ++--
> > >> net/tipc/bearer.c | 33 ++++-----------------------------
> > >> net/tipc/bearer.h | 5 +----
> > >> net/tipc/discover.c | 5 +++--
> > >> net/tipc/eth_media.c | 9 ++++-----
> > >> net/tipc/ib_media.c | 9 ++++-----
> > >> net/tipc/link.c | 12 ++++++------
> > >> 7 files changed, 24 insertions(+), 53 deletions(-)
> > >>
> > >> diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
> > >> index 716de1a..c78f30c 100644
> > >> --- a/net/tipc/bcast.c
> > >> +++ b/net/tipc/bcast.c
> > >> @@ -615,8 +615,8 @@ static int tipc_bcbearer_send(struct
> sk_buff
> > *buf, struct tipc_bearer *unused1,
> > >> if (!p)
> > >> break; /* No more bearers to try */
> > >>
> > >> - if (tipc_bearer_blocked(p)) {
> > >> - if (!s || tipc_bearer_blocked(s))
> > >> + if (atomic_read(&p->blocked)) {
> > >> + if (!s || atomic_read(&s->blocked))
> > >> continue; /* Can't use either
> bearer */
> > >> b = s;
> > >> }
> > >> diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
> > >> index 3f9707a..17f9824 100644
> > >> --- a/net/tipc/bearer.c
> > >> +++ b/net/tipc/bearer.c
> > >> @@ -275,31 +275,6 @@ void tipc_bearer_remove_dest(struct
> > tipc_bearer *b_ptr, u32 dest)
> > >> tipc_disc_remove_dest(b_ptr->link_req);
> > >> }
> > >>
> > >> -/*
> > >> - * Interrupt enabling new requests after bearer blocking:
> > >> - * See bearer_send().
> > >> - */
> > >> -void tipc_continue(struct tipc_bearer *b)
> > >> -{
> > >> - spin_lock_bh(&b->lock);
> > >> - b->blocked = 0;
> > >> - spin_unlock_bh(&b->lock);
> > >> -}
> > >> -
> > >> -/*
> > >> - * tipc_bearer_blocked - determines if bearer is currently
> blocked
> > >> - */
> > >> -int tipc_bearer_blocked(struct tipc_bearer *b)
> > >> -{
> > >> - int res;
> > >> -
> > >> - spin_lock_bh(&b->lock);
> > >> - res = b->blocked;
> > >> - spin_unlock_bh(&b->lock);
> > >> -
> > >> - return res;
> > >> -}
> > >> -
> > >> /**
> > >> * tipc_enable_bearer - enable bearer with the given name
> > >> */
> > >> @@ -387,6 +362,8 @@ restart:
> > >>
> > >> b_ptr = &tipc_bearers[bearer_id];
> > >> strcpy(b_ptr->name, name);
> > >> + atomic_set(&b_ptr->blocked, 0);
> > >> + smp_mb();
> > >> res = m_ptr->enable_media(b_ptr);
> > >> if (res) {
> > >> pr_warn("Bearer <%s> rejected, enable failure
> (%d)\n",
> > >> @@ -429,8 +406,8 @@ int tipc_block_bearer(struct
> tipc_bearer *b_ptr)
> > >>
> > >> read_lock_bh(&tipc_net_lock);
> > >> pr_info("Blocking bearer <%s>\n", b_ptr->name);
> > >> + atomic_add_unless(&b_ptr->blocked, 1, 1);
> > >> spin_lock_bh(&b_ptr->lock);
> > >> - b_ptr->blocked = 1;
> > >> list_for_each_entry_safe(l_ptr, temp_l_ptr, &b_ptr->links,
> > link_list) {
> > >> struct tipc_node *n_ptr = l_ptr->owner;
> > >>
> > >> @@ -455,8 +432,8 @@ static void bearer_disable(struct
> tipc_bearer
> > *b_ptr)
> > >> struct tipc_link_req *temp_req;
> > >>
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk
> > _______________________________________________
> > tipc-discussion mailing list
> > tip...@li... <javascript:;>
> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion
> >
> >
> >
> > --
> > Yours,
> > Jason
>
>
>
> --
> Yours,
> Jason
|