Share

NFS/RDMA ONC Transport

Tracker: Bugs

5 server crash in put_page - ID: 1613201
Last Update: Comment added ( jlentini )

Reported by Vu Pham <vu at mellanox.com>:

I got these errors in server's /var/log/messages and then the server stop
responding to login, I/O...; however, the server is still up, ipoib is
still working
--
Dec 8 06:38:21 ibd201 kernel: RIP: 0010:[<ffffffff8025dff7>]
[<ffffffff8025dff7>] put_page+0x17/0x40
Dec 8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08 EFLAGS:
00010246
Dec 8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 0000000000000001
RCX: 000000000003ffff
Dec 8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 0000000000000001
RDI: ffff8102274e92f8
Dec 8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 0000000000000034
R09: 0000000000000000
Dec 8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 0000000000000000
R12: ffff81020ef96800
Dec 8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 0000000000000000
R15: ffff8102053ee890
Dec 8 06:38:21 ibd201 kernel: FS: 00002ad76b8acb00(0000)
GS:ffff81022066eb40(0000) knlGS:0000000000000000
Dec 8 06:38:21 ibd201 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Dec 8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 000000021c22b000
CR4: 00000000000006e0
Dec 8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo
ffff810219dde000, task ffff81020d87f0c0)
Dec 8 06:38:21 ibd201 kernel: Stack: ffffffff8835e547 ffff81020ef96968
ffff81020ef96800 ffff81020ef96958
Dec 8 06:38:21 ibd201 kernel: ffffffff88360c72 000000010395dc90
ffffffff80424e05 0000000000000000
Dec 8 06:38:21 ibd201 kernel: 0000000000200200 000000010395dc90
ffffffff80239b90 ffff81020d87f0c0
Dec 8 06:38:21 ibd201 kernel: Call Trace:
Dec 8 06:38:21 ibd201 kernel: [<ffffffff8835e547>]
:sunrpc:svc_rdma_put_context+0x37/0xd0
Dec 8 06:38:21 ibd201 kernel: [<ffffffff88360c72>]
:sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0
Dec 8 06:38:21 ibd201 kernel: [<ffffffff80424e05>]
schedule_timeout+0x95/0xb0
Dec 8 06:38:21 ibd201 kernel: [<ffffffff80239b90>]
process_timeout+0x0/0x10
Dec 8 06:38:21 ibd201 kernel: [<ffffffff80423c2d>]
wait_for_completion_timeout+0xcd/0x150
Dec 8 06:38:21 ibd201 kernel: [<ffffffff80228db0>]
default_wake_function+0x0/0x10
Dec 8 06:38:21 ibd201 kernel: [<ffffffff881c1402>]
:ib_mthca:mthca_cmd_post+0x232/0x260
Dec 8 06:38:21 ibd201 kernel: [<ffffffff80228db0>]
default_wake_function+0x0/0x10
Dec 8 06:38:21 ibd201 kernel: [<ffffffff802fac39>] __next_cpu+0x19/0x30
Dec 8 06:38:21 ibd201 kernel: [<ffffffff80227dae>]
find_busiest_group+0x24e/0x6d0
Dec 8 06:38:21 ibd201 kernel: [<ffffffff80424772>]
thread_return+0x0/0xde
Dec 8 06:38:21 ibd201 kernel: [<ffffffff804263f8>]
_spin_unlock_irqrestore+0x8/0x10
Dec 8 06:38:21 ibd201 kernel: [<ffffffff8023a331>]
try_to_del_timer_sync+0x51/0x60
Dec 8 06:38:21 ibd201 kernel: [<ffffffff8023a34c>]
del_timer_sync+0xc/0x20
Dec 8 06:38:21 ibd201 kernel: [<ffffffff80424e05>]
schedule_timeout+0x95/0xb0
Dec 8 06:38:21 ibd201 kernel: [<ffffffff883559e6>]
:sunrpc:svc_recv+0x416/0x510
Dec 8 06:38:21 ibd201 kernel: [<ffffffff80228db0>]
default_wake_function+0x0/0x10
Dec 8 06:38:21 ibd201 kernel: [<ffffffff80228db0>]
default_wake_function+0x0/0x10
Dec 8 06:38:21 ibd201 kernel: [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
Dec 8 06:38:21 ibd201 kernel: [<ffffffff883a9651>]
:nfsd:nfsd+0x111/0x380
Dec 8 06:38:21 ibd201 kernel: [<ffffffff8020ab9c>] child_rip+0xa/0x12
Dec 8 06:38:21 ibd201 kernel: [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
Dec 8 06:38:21 ibd201 kernel: [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
Dec 8 06:38:21 ibd201 kernel: [<ffffffff8020ab92>] child_rip+0x0/0x12
Dec 8 06:38:21 ibd201 kernel:
Dec 8 06:38:21 ibd201 kernel:
Dec 8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 f0 ff 4f
08 0f 94 c0 84 c0 74
Dec 8 06:38:21 ibd201 kernel: RIP [<ffffffff8025dff7>]
put_page+0x17/0x40
Dec 8 06:38:21 ibd201 kernel: RSP <ffff810219ddfb08>


James Lentini ( jlentini ) - 2006-12-11 15:02

5

Closed

None

Nobody/Anonymous

None

None

Public


Comments ( 3 )

Date: 2007-05-29 20:31
Sender: jlentiniProject Admin


This appears to have been fixed a long
time ago (see Tom Tucker's patch below).


Date: 2006-12-12 21:59
Sender: jlentiniProject Admin


This was the result of attempting to free the same context twice.
Tom Tucker sent this fix:

---

net/sunrpc/svc_rdma_recvfrom.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/svc_rdma_recvfrom.c
b/net/sunrpc/svc_rdma_recvfrom.c
index ec62000..059f5ff 100644
--- a/net/sunrpc/svc_rdma_recvfrom.c
+++ b/net/sunrpc/svc_rdma_recvfrom.c
@@ -527,6 +527,7 @@ int svc_rdma_recvfrom(struct svc_rqst *r
/* Close the transport */
set_bit(SK_CLOSE, &xprt->sk_flags);
svc_rdma_put_context(ctxt, 1);
+ ctxt = NULL;
goto poll_dto_q;
}




Date: 2006-12-12 18:59
Sender: jlentiniProject Admin


This bug is on:
2.6.18.5
NFS-RDMA release 7
Dual woodcrest xeon based CPUs

put_page objdump
================

0000000000000324 <put_page>:
put_page():
include/asm/bitops.h:230
324: 8b 07 mov (%rdi),%eax
include/asm/bitops.h:229
326: f6 c4 40 test $0x40,%ah
329: 74 05 je 330 <put_page+0xc>
mm/swap.c:51
32b: e9 87 fd ff ff jmpq b7 <put_compound_page>
include/linux/mm.h:300
330: 8b 47 08 mov 0x8(%rdi),%eax
333: 85 c0 test %eax,%eax
335: 75 0a jne 341 <put_page+0x1d>
337: 0f 0b ud2a
339: 68 00 00 00 00 pushq $0x0
33a: R_X86_64_32S .rodata.str1.1+0xa
33e: c2 2c 01 retq $0x12c
include/asm/atomic.h:135
341: f0 ff 4f 08 lock decl 0x8(%rdi)
345: 0f 94 c0 sete %al
include/linux/mm.h:299
348: 84 c0 test %al,%al
34a: 74 05 je 351 <put_page+0x2d>
mm/swap.c:53
34c: e9 00 00 00 00 jmpq 351 <put_page+0x2d>
34d: R_X86_64_PC32
__page_cache_release+0xfffffffffffffffc
351: c3 retq
=============

This has been seen in several contexts:

- openSM restart
- I/O after timeout (I have stopped doing I/O or accessing the mounted
directory since
last night. This morning I just try to do *ls* the mounted directory
and get this error)
- I/O in large volumes (I just ran iozone with 9 GB file size (both
client and server machines
have 8 GB of memory, dual woodcrest xeon cpus, 2.6.18.5 kernel, nfsrdma
release 7)
After this happened other nfsrdma clients can still do I/O to the
server


Attached File

No Files Currently Attached

Changes ( 2 )

Field Old Value Date By
status_id Open 2007-05-29 20:31 jlentini
close_date - 2007-05-29 20:31 jlentini