From: Jon M. <jon...@er...> - 2016-09-14 14:39:12
|
I am working on it. Until now, I didn’t know have to have applied an already applied patch “further back” in the stable line, after consulting Greg-KH (the responsible for the stable line) I think I have the answer. BR ///jon From: Arndt, Jonas [mailto:jon...@hp...] Sent: Wednesday, 14 September, 2016 08:57 To: Jon Maloy <jon...@er...>; Parthasarathy Bhuvaragan <par...@er...> Cc: tip...@li...; Ying Xue <yin...@wi...> Subject: Re: [tipc-discussion] [Kernel oops in 4.4.18] Hi Guys, Do you know if c7cad0d6f70cd4ce86 also will show up in 4.4? Thanks, // Jonas On 09/12/2016 10:31 AM, Jonas Arndt wrote: Jon, It works great with c7cad0d6f70cd4ce86 and the following from stable-queue/queue-4.4 0086-tipc-fix-nullptr-crash-during-subscription-cancel.patch 0134-tipc-fix-an-infoleak-in-tipc_nl_compat_link_dump.patch 0135-tipc-fix-nl-compat-regression-for-link-statistics.patch (not sure if 134 & 135 are needed but they were there so I applied) Thanks! // Jonas On 09/12/2016 08:40 AM, Jon Maloy wrote: Hi, What you are seeing is a typical symptom of a problem that was fixed in commit c7cad0d6f70cd4ce86 (“tipc: move linearization of buffers to generic code”) For unknown reason this fix doesn’t seem to have made it back to 4.20 (the code I was checking), and then probably not to 4.18 either. BR ///jon From: Arndt, Jonas [mailto:jon...@hp...] Sent: Sunday, 11 September, 2016 20:12 To: Parthasarathy Bhuvaragan <par...@er...><mailto:par...@er...> Cc: Jon Maloy <jon...@er...><mailto:jon...@er...>; tip...@li...<mailto:tip...@li...>; Ying Xue <yin...@wi...><mailto:yin...@wi...> Subject: Re: [tipc-discussion] [Kernel oops in 4.4.18] On 09/02/2016 02:02 AM, Parthasarathy Bhuvaragan wrote: Hi, You need this fix: https://sourceforge.net/p/tipc/mailman/message/34768934/ But it wont apply cleanly, so you need this entire series to fix all issues related to topology server. https://sourceforge.net/p/tipc/mailman/message/34768927/ They were too intrusive to be pushed to net, hence were pushed to net-next and were merged in 4.7. /Partha Parta, Thanks for this. While it appears TIPC is up and my nodes can join the cluster, they cannot leave and come back. What I get in the syslog on the nod that tries to join is: 2016-09-11T14:14:33.394673-06:00 rack13-ctrl4 kernel: [ 880.688856] Dropping name table update (0) of {1651649891, 1819082752, 0} from <1.1.1> key=402710022 2016-09-11T14:14:33.394688-06:00 rack13-ctrl4 kernel: [ 880.688862] Dropping name table update (0) of {4029808599, 2711729614, 1639218685} from <1.1.1> key=18102394 2016-09-11T14:14:33.394690-06:00 rack13-ctrl4 kernel: [ 880.688865] Dropping name table update (0) of {134218495, 4278191616, 100669184} from <1.1.1> key=0 2016-09-11T14:14:33.394692-06:00 rack13-ctrl4 kernel: [ 880.688868] Dropping name table update (0) of {0, 0, 0} from <1.1.1> key=0 2016-09-11T14:14:33.394693-06:00 rack13-ctrl4 kernel: [ 880.688870] Dropping name table update (0) of {0, 0, 0} from <1.1.1> key=0 2016-09-11T14:14:33.394694-06:00 rack13-ctrl4 kernel: [ 880.688872] Dropping name table update (0) of {0, 0, 0} from <1.1.1> key=0 2016-09-11T14:14:33.394696-06:00 rack13-ctrl4 kernel: [ 880.688875] Dropping name table update (0) of {0, 0, 0} from <1.1.1> key=0 2016-09-11T14:14:33.394697-06:00 rack13-ctrl4 kernel: [ 880.688877] Dropping name table update (0) of {0, 0, 0} from <1.1.1> key=0 2016-09-11T14:14:33.394699-06:00 rack13-ctrl4 kernel: [ 880.688879] Dropping name table update (0) of {0, 0, 16463} from <1.1.1> key=4294915584 2016-09-11T14:14:33.394700-06:00 rack13-ctrl4 kernel: [ 880.688882] Dropping name table update (0) of {0, 0, 0} from <1.1.1> key=0 We are running an HA cluster called OpenSAF that is using the TIPC protocol. I have also traced (tshark) the traffic. Not sure if I can attach a file here though as that is what made my earlier mail not go through (MIME key attachment). Please advise if you want to look at the trace. I also tried: 0086-tipc-fix-nullptr-crash-during-subscription-cancel.patch 0134-tipc-fix-an-infoleak-in-tipc_nl_compat_link_dump.patch 0135-tipc-fix-nl-compat-regression-for-link-statistics.patch From the stable-queue and got the same result. With 4.5 kernel it works great. Thanks, // Jonas On 09/01/2016 08:43 PM, Jon Maloy wrote: Hi Jonas, I don’t think there is any such thing as a “long-term” kernel from the community viewpoint. But distros such as SLES or Ubuntu use this term, so I suspect that is what you mean. I believe the latest version of both of those are based on 4.4. I honestly don’t know how often and on which criteria those distros pick upgrades from the upstream kernel, but if this is a serious problem we certainly have to push them to adopt a fix for this. I believe Partha will recognize this bug, and can tell whether there is a fix to it or not. If so he can also tell what has happened to it. If this is a distro specific problem we need to know which one you are using. Regards ///jon From: Arndt, Jonas [mailto:jon...@hp...] Sent: Thursday, 01 September, 2016 14:11 To: Jon Maloy <jon...@er...><mailto:jon...@er...> Subject: Fwd: [tipc-discussion] [Kernel oops in 4.4.18] Jon, Sorry for reaching out to you directly. I have posted to the mailing list multiple time and I don't understand why it is getting stuck. I am a subscriber and got and email indicating that I can post. Cheers, // Jonas -------- Forwarded Message -------- Subject: [tipc-discussion] [Kernel oops in 4.4.18] Date: Wed, 31 Aug 2016 09:11:42 -0600 From: Jonas Arndt <mailto:jon...@hp...><mailto:jon...@hp...> <jon...@hp...><mailto:jon...@hp...> To: tip...@li...<mailto:tip...@li...> <mailto:tip...@li...><mailto:tip...@li...> Resending as it appears it didn't show up on the mailing list. Sorry for any duplicates.... Hi Guys, My apologies if this has been covered before. I am getting this kernel null pointer when trying TIPC with 4.4.18 kernel (running OpenSAF). It works fine with 4.5.x. There seems to have been a number of patches applied to net/tipc between the versions. Why is it not back-ported to 4.4.x? Isn't that a longterm kernel? Thanks, // Jonas ================================================================================ 2016-08-17T09:19:49.656792-06:00 rack13-ctrl2 kernel: [ 302.348407] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 2016-08-17T09:19:49.656808-06:00 rack13-ctrl2 kernel: [ 302.348474] IP: [<ffffffffa0702749>] tipc_nametbl_subscribe+0x19/0x180 [tipc] 2016-08-17T09:19:49.656810-06:00 rack13-ctrl2 kernel: [ 302.348540] PGD 0 2016-08-17T09:19:49.656812-06:00 rack13-ctrl2 kernel: [ 302.348559] Oops: 0000 1 SMP 2016-08-17T09:19:49.656814-06:00 rack13-ctrl2 kernel: [ 302.348585] Modules linked in: tipc rpcsec_gss_krb5 nfsv4 dns_resolver ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables openvswitch nf_defrag_ipv6 nf_conntrack libcrc32c crc32c_generic nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass mgag200 ttm crc32_pclmul drm_kms_helper drm hmac drbg fb_sys_fops ansi_cprng syscopyarea aesni_intel sysfillrect aes_x86_64 sysimgblt lrw gf128mul glue_helper ablk_helper cryptd ipmi_si iTCO_wdt hpilo evdev pcspkr wmi ipmi_msghandler iTCO_vendor_support hpwdt acpi_power_meter button sb_edac ioatdma lpc_ich edac_core pcc_cpufreq mfd_core acpi_cpufreq processor autofs4 ext4 crc16 mbcache jbd2 dm_mod sg sd_mod ata_generic pata_acpi crc32c_intel psmouse ata_piix libata uhci_hcd ehci_pci ehci_hcd igb scsi_mod i2c_algo_bit i2c_core usbcore usb_common ixgbe dca mdio ptp pps_core thermal 2016-08-17T09:19:49.656817-06:00 rack13-ctrl2 kernel: [ 302.349237] CPU: 16 PID: 98 Comm: kworker/u130:0 Not tainted 4.4.18-tipc #1 2016-08-17T09:19:49.656843-06:00 rack13-ctrl2 kernel: [ 302.349278] Hardware name: HP ProLiant SL210t Gen8/, BIOS P83 11/01/2014 2016-08-17T09:19:49.656846-06:00 rack13-ctrl2 kernel: [ 302.349321] Workqueue: tipc_rcv tipc_recv_work [tipc] 2016-08-17T09:19:49.656848-06:00 rack13-ctrl2 kernel: [ 302.349354] task: ffff881ff93a5640 ti: ffff881ff93b0000 task.ti: ffff881ff93b0000 2016-08-17T09:19:49.656850-06:00 rack13-ctrl2 kernel: [ 302.349395] RIP: 0010:[<ffffffffa0702749>] [<ffffffffa0702749>] tipc_nametbl_subscribe+0x19/0x180 [tipc] 2016-08-17T09:19:49.656852-06:00 rack13-ctrl2 kernel: [ 302.349464] RSP: 0018:ffff881ff93b3cc0 EFLAGS: 00010286 2016-08-17T09:19:49.656853-06:00 rack13-ctrl2 kernel: [ 302.349494] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000180200017 2016-08-17T09:19:49.656855-06:00 rack13-ctrl2 kernel: [ 302.349534] RDX: 0000000180200018 RSI: 0000000000000200 RDI: 0000000000000000 2016-08-17T09:19:49.656857-06:00 rack13-ctrl2 kernel: [ 302.349573] RBP: ffff881ff93b3d00 R08: 00000000f7970601 R09: 0000000180200017 2016-08-17T09:19:49.656858-06:00 rack13-ctrl2 kernel: [ 302.349613] R10: ffffea003fde5c00 R11: ffff880ff7970600 R12: 0000000000000000 2016-08-17T09:19:49.656859-06:00 rack13-ctrl2 kernel: [ 302.349652] R13: ffff881ff54ac0a0 R14: ffff880fee6edd00 R15: ffff880ff7970200 2016-08-17T09:19:49.656860-06:00 rack13-ctrl2 kernel: [ 302.349692] FS: 0000000000000000(0000) GS:ffff88203f880000(0000) knlGS:0000000000000000 2016-08-17T09:19:49.656860-06:00 rack13-ctrl2 kernel: [ 302.349736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2016-08-17T09:19:49.656861-06:00 rack13-ctrl2 kernel: [ 302.349785] CR2: 0000000000000018 CR3: 0000000001a09000 CR4: 00000000001406e0 2016-08-17T09:19:49.656863-06:00 rack13-ctrl2 kernel: [ 302.349833] Stack: 2016-08-17T09:19:49.656865-06:00 rack13-ctrl2 kernel: [ 302.349853] ffffffff811a5d1b ffff880ff7970200 ffff880ff80f6000 0000000000000000 2016-08-17T09:19:49.656865-06:00 rack13-ctrl2 kernel: [ 302.349915] ffff880ff87898c0 ffff881ff54ac0a0 ffff880fee6edd00 ffff880ff7970200 2016-08-17T09:19:49.656866-06:00 rack13-ctrl2 kernel: [ 302.349976] ffff881ff93b3d48 ffffffffa070143a ffff881ff93b3d48 ffff880ff87898c8 2016-08-17T09:19:49.656867-06:00 rack13-ctrl2 kernel: [ 302.350037] Call Trace: 2016-08-17T09:19:49.656868-06:00 rack13-ctrl2 kernel: [ 302.350069] [<ffffffff811a5d1b>] ? kfree+0x13b/0x150 2016-08-17T09:19:49.656870-06:00 rack13-ctrl2 kernel: [ 302.350114] [<ffffffffa070143a>] tipc_subscrb_rcv_cb+0xfa/0x370 [tipc] 2016-08-17T09:19:49.656872-06:00 rack13-ctrl2 kernel: [ 302.350165] [<ffffffffa070d43f>] tipc_receive_from_sock+0xaf/0x100 [tipc] 2016-08-17T09:19:49.656874-06:00 rack13-ctrl2 kernel: [ 302.350219] [<ffffffffa070d61b>] tipc_recv_work+0x2b/0x60 [tipc] 2016-08-17T09:19:49.656874-06:00 rack13-ctrl2 kernel: [ 302.350266] [<ffffffff8107bad8>] process_one_work+0x158/0x420 2016-08-17T09:19:49.656875-06:00 rack13-ctrl2 kernel: [ 302.350310] [<ffffffff8107c529>] worker_thread+0x69/0x480 2016-08-17T09:19:49.656876-06:00 rack13-ctrl2 kernel: [ 302.350351] [<ffffffff8107c4c0>] ? rescuer_thread+0x310/0x310 2016-08-17T09:19:49.656877-06:00 rack13-ctrl2 kernel: [ 302.350395] [<ffffffff810818cb>] kthread+0xdb/0x100 2016-08-17T09:19:49.656879-06:00 rack13-ctrl2 kernel: [ 302.350434] [<ffffffff810817f0>] ? kthread_park+0x60/0x60 2016-08-17T09:19:49.656880-06:00 rack13-ctrl2 kernel: [ 302.350487] [<ffffffff815575cf>] ret_from_fork+0x3f/0x70 2016-08-17T09:19:49.656881-06:00 rack13-ctrl2 kernel: [ 302.350528] [<ffffffff810817f0>] ? kthread_park+0x60/0x60 2016-08-17T09:19:49.656882-06:00 rack13-ctrl2 kernel: [ 302.350567] Code: 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 18 <48> 8b 47 18 8b 5f 08 48 8b 90 e0 12 00 00 8b 05 27 ff 00 00 83 2016-08-17T09:19:49.656883-06:00 rack13-ctrl2 kernel: [ 302.350870] RIP [<ffffffffa0702749>] tipc_nametbl_subscribe+0x19/0x180 [tipc] 2016-08-17T09:19:49.656884-06:00 rack13-ctrl2 kernel: [ 302.352594] RSP <ffff881ff93b3cc0> 2016-08-17T09:19:49.656886-06:00 rack13-ctrl2 kernel: [ 302.354220] CR2: 0000000000000018 2016-08-17T09:19:49.656888-06:00 rack13-ctrl2 kernel: [ 302.355816] --[ end trace 3bc92e0fb0a9c178 ]-- 2016-08-17T09:19:49.656894-06:00 rack13-ctrl2 kernel: [ 302.362309] BUG: unable to handle kernel paging request at ffffffffffffffd8 2016-08-17T09:19:49.670952-06:00 rack13-ctrl2 osafntfd[1776]: Started 2016-08-17T09:19:57.670994-06:00 rack13-ctrl2 osafntfd[1776]: MDTM:TIPC Failed to connect to topology server in mdtm_check_for_endianness err :Connection timed out 2016-08-17T09:19:57.671340-06:00 rack13-ctrl2 osafntfd[1776]: ER ncs_core_agents_startup FAILED 2016-08-17T09:19:57.671695-06:00 rack13-ctrl2 osafntfd[1776]: ncs_sel_obj_rmv_ind: recv failed - Socket operation on non-socket, raise_obj: 0 rmv_obj: 0 2016-08-17T09:19:57.671935-06:00 rack13-ctrl2 osafntfd[1776]: osaf_abort(-1) called from 0x7f3fca8d8938 with errno=88 2016-08-17T09:19:57.693637-06:00 rack13-ctrl2 osafclmd[1783]: Started 2016-08-17T09:20:05.695009-06:00 rack13-ctrl2 osafclmd[1783]: MDTM:TIPC Failed to connect to topology server in mdtm_check_for_endianness err :Connection timed out 2016-08-17T09:20:05.695408-06:00 rack13-ctrl2 osafclmd[1783]: ER clms_init failed 2016-08-17T09:20:05.695678-06:00 rack13-ctrl2 osafclmd[1783]: ER Failed, exiting... 2016-08-17T09:20:05.695932-06:00 rack13-ctrl2 opensafd[1699]: ER Failed #012 DESC:CLMD 2016-08-17T09:20:05.696303-06:00 rack13-ctrl2 opensafd[1699]: ER Going for recovery ==================================================================================== |