Skip to content

Limit netlink communication length #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dmah42 opened this issue Feb 3, 2014 · 0 comments
Open

Limit netlink communication length #8

dmah42 opened this issue Feb 3, 2014 · 0 comments

Comments

@dmah42
Copy link
Collaborator

dmah42 commented Feb 3, 2014

In some cases, where lots of connections are present, netlink can run out of buffer space. By providing a max number of sockets to return, and a start connection id, the userspace can manage the netlink buffer better, and can keep receiving messages until none are returned (or the kernel can send a special end-of-list message).

rapier1 pushed a commit that referenced this issue Apr 24, 2014
…king

Daniel Borkmann reported a VM_BUG_ON assertion failing:

  ------------[ cut here ]------------
  kernel BUG at mm/mlock.c:528!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ccm arc4 iwldvm [...]
   video
  CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
  Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
  task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
  RIP: 0010:[<ffffffff81171ad0>]  [<ffffffff81171ad0>] munlock_vma_pages_range+0x2e0/0x2f0
  Call Trace:
    do_munmap+0x18f/0x3b0
    vm_munmap+0x41/0x60
    SyS_munmap+0x22/0x30
    system_call_fastpath+0x1a/0x1f
  RIP   munlock_vma_pages_range+0x2e0/0x2f0
  ---[ end trace a0088dcf07ae10f2 ]---

because munlock_vma_pages_range() thinks it's unexpectedly in the middle
of a THP page.  This can be reproduced with default config since 3.11
kernels.  A reproducer can be found in the kernel's selftest directory
for networking by running ./psock_tpacket.

The problem is that an order=2 compound page (allocated by
alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
by packet_mmap()) and mistaken for a THP page and assumed to be order=9.

The checks for THP in munlock came with commit ff6a6da ("mm:
accelerate munlock() treatment of THP pages"), i.e.  since 3.9, but did
not trigger a bug.  It just makes munlock_vma_pages_range() skip such
compound pages until the next 512-pages-aligned page, when it encounters
a head page.  This is however not a problem for vma's where mlocking has
no effect anyway, but it can distort the accounting.

Since commit 7225522 ("mm: munlock: batch non-THP page isolation
and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
PageTransHuge() check.

This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
list of flags that make vma's non-mlockable and non-mergeable.  The
reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
already on the VM_SPECIAL list, and both are intended for non-LRU pages
where mlocking makes no sense anyway.  Related Lkml discussion can be
found in [2].

 [1] tools/testing/selftests/net/psock_tpacket
 [2] https://lkml.org/lkml/2014/1/10/427

Signed-off-by: Vlastimil Babka <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reported-by: Daniel Borkmann <[email protected]>
Tested-by: Daniel Borkmann <[email protected]>
Cc: Thomas Hellstrom <[email protected]>
Cc: John David Anglin <[email protected]>
Cc: HATAYAMA Daisuke <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Carsten Otte <[email protected]>
Cc: Jared Hulbert <[email protected]>
Tested-by: Hannes Frederic Sowa <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: <[email protected]> [3.11.x+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
rapier1 pushed a commit that referenced this issue Apr 24, 2014
vmxnet3's netpoll driver is incorrectly coded.  It directly calls
vmxnet3_do_poll, which is the driver internal napi poll routine.  As the netpoll
controller method doesn't block real napi polls in any way, there is a potential
for race conditions in which the netpoll controller method and the napi poll
method run concurrently.  The result is data corruption causing panics such as this
one recently observed:
PID: 1371   TASK: ffff88023762caa0  CPU: 1   COMMAND: "rs:main Q:Reg"
 #0 [ffff88023abd5780] machine_kexec at ffffffff81038f3b
 #1 [ffff88023abd57e0] crash_kexec at ffffffff810c5d92
 #2 [ffff88023abd58b0] oops_end at ffffffff8152b570
 #3 [ffff88023abd58e0] die at ffffffff81010e0b
 #4 [ffff88023abd5910] do_trap at ffffffff8152add4
 #5 [ffff88023abd5970] do_invalid_op at ffffffff8100cf95
 #6 [ffff88023abd5a10] invalid_op at ffffffff8100bf9b
    [exception RIP: vmxnet3_rq_rx_complete+1968]
    RIP: ffffffffa00f1e80  RSP: ffff88023abd5ac8  RFLAGS: 00010086
    RAX: 0000000000000000  RBX: ffff88023b5dcee0  RCX: 00000000000000c0
    RDX: 0000000000000000  RSI: 00000000000005f2  RDI: ffff88023b5dcee0
    RBP: ffff88023abd5b48   R8: 0000000000000000   R9: ffff88023a3b6048
    R10: 0000000000000000  R11: 0000000000000002  R12: ffff8802398d4cd8
    R13: ffff88023af35140  R14: ffff88023b60c890  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff88023abd5b50] vmxnet3_do_poll at ffffffffa00f204a [vmxnet3]
 #8 [ffff88023abd5b80] vmxnet3_netpoll at ffffffffa00f209c [vmxnet3]
 #9 [ffff88023abd5ba0] netpoll_poll_dev at ffffffff81472bb7

The fix is to do as other drivers do, and have the poll controller call the top
half interrupt handler, which schedules a napi poll properly to recieve frames

Tested by myself, successfully.

Signed-off-by: Neil Horman <[email protected]>
CC: Shreyas Bhatewara <[email protected]>
CC: "VMware, Inc." <[email protected]>
CC: "David S. Miller" <[email protected]>
CC: [email protected]
Reviewed-by: Shreyas N Bhatewara <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
rapier1 pushed a commit that referenced this issue Oct 17, 2014
If recovery failed ath10k returned 0 (success) and
mac80211 continued to call other driver callbacks.
This caused null dereference.

This is how the failure looked like:

ath10k: ctl_resp never came in (-110)
ath10k: failed to connect to HTC: -110
ath10k: could not init core (-110)
BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffffa0b355c1>] ath10k_ce_send+0x1d/0x15d [ath10k_pci]
PGD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: ath10k_pci ath10k_core ath5k ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nf_nat_ipv4 ]
CPU: 1 PID: 36 Comm: kworker/1:1 Tainted: G        WC   3.13.0-rc8-wl-ath+ #8
Hardware name: To be filled by O.E.M. To be filled by O.E.M./HURONRIVER, BIOS 4.6.5 05/02/2012
Workqueue: events ieee80211_restart_work [mac80211]
task: ffff880215b521c0 ti: ffff880215e18000 task.ti: ffff880215e18000
RIP: 0010:[<ffffffffa0b355c1>]  [<ffffffffa0b355c1>] ath10k_ce_send+0x1d/0x15d [ath10k_pci]
RSP: 0018:ffff880215e19af8  EFLAGS: 00010292
RAX: ffff880215e19b10 RBX: 0000000000000000 RCX: 0000000000000018
RDX: 00000000d9ccf800 RSI: ffff8800c965ad00 RDI: 0000000000000000
RBP: ffff880215e19b58 R08: 0000000000000002 R09: 0000000000000000
R10: ffffffff812e1a23 R11: 0000000000000292 R12: 0000000000000018
R13: 0000000000000000 R14: 0000000000000002 R15: ffff88021562d700
FS:  0000000000000000(0000) GS:ffff88021fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000001a0d000 CR4: 00000000000407e0
Stack:
 d9ccf8000d47df40 ffffffffa0b367a0 ffff880215e19b10 0000000000000010
 ffff880215e19b68 ffff880215e19b28 0000000000000018 ffff8800c965ad00
 0000000000000018 0000000000000000 0000000000000002 ffff88021562d700
Call Trace:
 [<ffffffffa0b3251d>] ath10k_pci_hif_send_head+0xa7/0xcb [ath10k_pci]
 [<ffffffffa0b16cbe>] ath10k_htc_send+0x23d/0x2d0 [ath10k_core]
 [<ffffffffa0b1a169>] ath10k_wmi_cmd_send_nowait+0x5d/0x85 [ath10k_core]
 [<ffffffffa0b1aaef>] ath10k_wmi_cmd_send+0x62/0x115 [ath10k_core]
 [<ffffffff814e8abd>] ? __netdev_alloc_skb+0x4b/0x9b
 [<ffffffffa0b1c438>] ath10k_wmi_vdev_set_param+0x91/0xa3 [ath10k_core]
 [<ffffffffa0b0e0d5>] ath10k_mac_set_rts+0x3e/0x40 [ath10k_core]
 [<ffffffffa0b0e1d0>] ath10k_set_frag_threshold+0x5e/0x9c [ath10k_core]
 [<ffffffffa09d60eb>] ieee80211_reconfig+0x12a/0x7b3 [mac80211]
 [<ffffffff815a8069>] ? mutex_unlock+0x9/0xb
 [<ffffffffa09b3a40>] ieee80211_restart_work+0x5e/0x68 [mac80211]
 [<ffffffff810c01d0>] process_one_work+0x1d7/0x2fc
 [<ffffffff810c0166>] ? process_one_work+0x16d/0x2fc
 [<ffffffff810c06c8>] worker_thread+0x12e/0x1fb
 [<ffffffff810c059a>] ? rescuer_thread+0x27b/0x27b
 [<ffffffff810c5aee>] kthread+0xb5/0xbd
 [<ffffffff815a9220>] ? _raw_spin_unlock_irq+0x28/0x42
 [<ffffffff810c5a39>] ? __kthread_parkme+0x5c/0x5c
 [<ffffffff815ae04c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810c5a39>] ? __kthread_parkme+0x5c/0x5c
Code: df ff d0 48 83 c4 18 5b 41 5c 41 5d 5d c3 55 48 89 e5 41 57 41 56 45 89 c6 41 55 41 54 41 89 cc 53 48 89
RIP  [<ffffffffa0b355c1>] ath10k_ce_send+0x1d/0x15d [ath10k_pci]
 RSP <ffff880215e19af8>
CR2: 0000000000000000

Reported-By: Ben Greear <[email protected]>
Signed-off-by: Michal Kazior <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
rapier1 pushed a commit that referenced this issue Oct 17, 2014
When plugging a specific micro SD card at MMC socket of a custom i.MX28 board,
we get the following kernel warning:

WARNING: CPU: 0 PID: 30 at drivers/mmc/host/mxs-mmc.c:342 mxs_mmc_start_cmd+0x34c/0x378()
Modules linked in:
CPU: 0 PID: 30 Comm: kworker/u2:1 Not tainted 3.14.0-rc5 #8
Workqueue: kmmcd mmc_rescan
[<c0015420>] (unwind_backtrace) from [<c0012cb0>] (show_stack+0x10/0x14)
[<c0012cb0>] (show_stack) from [<c001daf8>] (warn_slowpath_common+0x6c/0x8c)
[<c001daf8>] (warn_slowpath_common) from [<c001db34>] (warn_slowpath_null+0x1c/0x24)
[<c001db34>] (warn_slowpath_null) from [<c0349478>] (mxs_mmc_start_cmd+0x34c/0x378)
[<c0349478>] (mxs_mmc_start_cmd) from [<c0338fa0>] (mmc_start_request+0xc4/0xf4)
[<c0338fa0>] (mmc_start_request) from [<c03390b4>] (mmc_wait_for_req+0x50/0x164)
[<c03390b4>] (mmc_wait_for_req) from [<c03405b8>] (mmc_app_send_scr+0x158/0x1c8)
[<c03405b8>] (mmc_app_send_scr) from [<c033ee1c>] (mmc_sd_setup_card+0x80/0x3c8)
[<c033ee1c>] (mmc_sd_setup_card) from [<c033f788>] (mmc_sd_init_card+0x124/0x66c)
[<c033f788>] (mmc_sd_init_card) from [<c033fd7c>] (mmc_attach_sd+0xac/0x174)
[<c033fd7c>] (mmc_attach_sd) from [<c033a658>] (mmc_rescan+0x25c/0x2d8)
[<c033a658>] (mmc_rescan) from [<c003597c>] (process_one_work+0x1b4/0x4ec)
[<c003597c>] (process_one_work) from [<c0035de4>] (worker_thread+0x130/0x464)
[<c0035de4>] (worker_thread) from [<c003c824>] (kthread+0xb4/0xd0)
[<c003c824>] (kthread) from [<c000f420>] (ret_from_fork+0x14/0x34)

The error is due to an invalid value in CSD register of a specific 2GB
micro SD card. The CSD version of this card is 1.0 but the TACC field
has the invalid value 0.

cid:0000005553442020000000000000583f
csd:00000032535a83bfedb7ffbf1680003f
date:08/2005
erase_size:512
fwrev:0x0
hwrev:0x0
manfid:0x000000
name:USD
oemid:0x0000
preferred_erase_size:4194304
scr:0225000000000000
serial:0x00000000
type:SD

Since the kernel is making use of this TACC field to calculate the SD
card timeout, an invalid value 0 leads to a warning at
mxs_ns_to_ssp_ticks() and later the following misleading error message
appears in a loop:

mxs-mmc 80010000.ssp: card claims to support voltages below defined range
mxs-mmc 80010000.ssp: no support for card's volts
mmc0: error -22 whilst initialising MMC card

This error is only found on this 2GB SD card on mxs platform.
On x86 this card works without any problems.

The following patch based on the work of Peter Chan and Otavio Salvador.
It catches the case that the determined timeout is still 0 and sets it
to a valid value.

Successful tested on a i.MX28 board.

Signed-off-by: Stefan Wahren <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
Signed-off-by: Chris Ball <[email protected]>
rapier1 pushed a commit that referenced this issue Oct 17, 2014
All tests should pass with and without JIT.

Example output:
test_bpf: #0 TAX 35 16 16 PASS
test_bpf: #1 TXA 7 7 7 PASS
test_bpf: #2 ADD_SUB_MUL_K 10 PASS
test_bpf: #3 DIV_KX 33 PASS
test_bpf: #4 AND_OR_LSH_K 10 10 PASS
test_bpf: #5 LD_IND 8 8 8 PASS
test_bpf: #6 LD_ABS 8 8 8 PASS
test_bpf: #7 LD_ABS_LL 13 14 PASS
test_bpf: #8 LD_IND_LL 12 12 12 PASS
test_bpf: #9 LD_ABS_NET 10 12 PASS
test_bpf: #10 LD_IND_NET 11 12 12 PASS
...

Numbers are times in nsec per filter for given input data.

Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
rapier1 pushed a commit that referenced this issue Oct 17, 2014
This patch tries to fix this crash:

 #5 [ffff88003c1cd690] do_invalid_op at ffffffff810166d5
 #6 [ffff88003c1cd730] invalid_op at ffffffff8159b2de
    [exception RIP: ocfs2_direct_IO_get_blocks+359]
    RIP: ffffffffa05dfa27  RSP: ffff88003c1cd7e8  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff88003c1cdaa8  RCX: 0000000000000000
    RDX: 000000000000000c  RSI: ffff880027a95000  RDI: ffff88003c79b540
    RBP: ffff88003c1cd858   R8: 0000000000000000   R9: ffffffff815f6ba0
    R10: 00000000000001c9  R11: 00000000000001c9  R12: ffff88002d271500
    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000001000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff88003c1cd860] do_direct_IO at ffffffff811cd31b
 #8 [ffff88003c1cd950] direct_IO_iovec at ffffffff811cde9c
 #9 [ffff88003c1cd9b0] do_blockdev_direct_IO at ffffffff811ce764
#10 [ffff88003c1cdb80] __blockdev_direct_IO at ffffffff811ce7cc
#11 [ffff88003c1cdbb0] ocfs2_direct_IO at ffffffffa05df756 [ocfs2]
#12 [ffff88003c1cdbe0] generic_file_direct_write_iter at ffffffff8112f935
#13 [ffff88003c1cdc40] ocfs2_file_write_iter at ffffffffa0600ccc [ocfs2]
#14 [ffff88003c1cdd50] do_aio_write at ffffffff8119126c
#15 [ffff88003c1cddc0] aio_rw_vect_retry at ffffffff811d9bb4
#16 [ffff88003c1cddf0] aio_run_iocb at ffffffff811db880
#17 [ffff88003c1cde30] io_submit_one at ffffffff811dc238
#18 [ffff88003c1cde80] do_io_submit at ffffffff811dc437
#19 [ffff88003c1cdf70] sys_io_submit at ffffffff811dc530
#20 [ffff88003c1cdf80] system_call_fastpath at ffffffff8159a159

It crashes at
        BUG_ON(create && (ext_flags & OCFS2_EXT_REFCOUNTED));
in ocfs2_direct_IO_get_blocks.

ocfs2_direct_IO_get_blocks is expecting the OCFS2_EXT_REFCOUNTED be removed in
ocfs2_prepare_inode_for_write() if it was there. But no cluster lock is taken
during the time before (or inside) ocfs2_prepare_inode_for_write() and after
ocfs2_direct_IO_get_blocks().

It can happen in this case:

Node A(which crashes)				Node B
------------------------                 ---------------------------
ocfs2_file_aio_write
  ocfs2_prepare_inode_for_write
    ocfs2_inode_lock
    ...
    ocfs2_inode_unlock
  #no refcount found
....					ocfs2_reflink
                                          ocfs2_inode_lock
                                          ...
                                          ocfs2_inode_unlock
                                          #now, refcount flag set on extent

                                        ...
                                        flush change to disk

ocfs2_direct_IO_get_blocks
  ocfs2_get_clusters
    #extent map miss
    #buffer_head miss
    read extents from disk
  found refcount flag on extent
  crash..

Fix:
Take rw_lock in ocfs2_reflink path

Signed-off-by: Wengang Wang <[email protected]>
Reviewed-by: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
rapier1 pushed a commit that referenced this issue Oct 17, 2014
When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).

When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.

Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.

My tests would sometimes trigger this bug on my i386 box:

WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
 00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
 f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
 ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
 [<c18796b3>] dump_stack+0x4b/0x75
 [<c103a0e3>] warn_slowpath_common+0x7e/0x95
 [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
 [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
 [<c103a185>] warn_slowpath_fmt+0x33/0x35
 [<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
 [<c10bed04>] trace_find_cmdline+0x40/0x64
 [<c10c3c16>] trace_print_context+0x27/0xec
 [<c10c4360>] ? trace_seq_printf+0x37/0x5b
 [<c10c0b15>] print_trace_line+0x319/0x39b
 [<c10ba3fb>] ? ring_buffer_read+0x47/0x50
 [<c10c13b1>] s_show+0x192/0x1ab
 [<c10bfd9a>] ? s_next+0x5a/0x7c
 [<c112e76e>] seq_read+0x267/0x34c
 [<c1115a25>] vfs_read+0x8c/0xef
 [<c112e507>] ? seq_lseek+0x154/0x154
 [<c1115ba2>] SyS_read+0x54/0x7f
 [<c188488e>] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
>>>> ##### CPU 1 buffer started ####

Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.

After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.

After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().

As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.

I originally wrote a patch to have it set the iter->head to 0 instead
of iter->head_page->read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid.  The list_empty(reader_page->list) just means that it was
successful in swapping out. But the reader_page may still have data.

There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.

Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.

Cc: [email protected] # 2.6.28+
Fixes: d769041 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt <[email protected]>
rapier1 pushed a commit that referenced this issue Oct 22, 2014
For commit ocfs2 journal, ocfs2 journal thread will acquire the mutex
osb->journal->j_trans_barrier and wake up jbd2 commit thread, then it
will wait until jbd2 commit thread done. In order journal mode, jbd2
needs flushing dirty data pages first, and this needs get page lock.
So osb->journal->j_trans_barrier should be got before page lock.

But ocfs2_write_zero_page() and ocfs2_write_begin_inline() obey this
locking order, and this will cause deadlock and hung the whole cluster.

One deadlock catched is the following:

PID: 13449  TASK: ffff8802e2f08180  CPU: 31  COMMAND: "oracle"
 #0 [ffff8802ee3f79b0] __schedule at ffffffff8150a524
 #1 [ffff8802ee3f7a58] schedule at ffffffff8150acbf
 #2 [ffff8802ee3f7a68] rwsem_down_failed_common at ffffffff8150cb85
 #3 [ffff8802ee3f7ad8] rwsem_down_read_failed at ffffffff8150cc55
 #4 [ffff8802ee3f7ae8] call_rwsem_down_read_failed at ffffffff812617a4
 #5 [ffff8802ee3f7b50] ocfs2_start_trans at ffffffffa0498919 [ocfs2]
 #6 [ffff8802ee3f7ba0] ocfs2_zero_start_ordered_transaction at ffffffffa048b2b8 [ocfs2]
 #7 [ffff8802ee3f7bf0] ocfs2_write_zero_page at ffffffffa048e9bd [ocfs2]
 #8 [ffff8802ee3f7c80] ocfs2_zero_extend_range at ffffffffa048ec83 [ocfs2]
 #9 [ffff8802ee3f7ce0] ocfs2_zero_extend at ffffffffa048edfd [ocfs2]
 #10 [ffff8802ee3f7d50] ocfs2_extend_file at ffffffffa049079e [ocfs2]
 #11 [ffff8802ee3f7da0] ocfs2_setattr at ffffffffa04910ed [ocfs2]
 #12 [ffff8802ee3f7e70] notify_change at ffffffff81187d29
 #13 [ffff8802ee3f7ee0] do_truncate at ffffffff8116bbc1
 #14 [ffff8802ee3f7f50] sys_ftruncate at ffffffff8116bcbd
 #15 [ffff8802ee3f7f80] system_call_fastpath at ffffffff81515142
    RIP: 00007f8de750c6f7  RSP: 00007fffe786e478  RFLAGS: 00000206
    RAX: 000000000000004d  RBX: ffffffff81515142  RCX: 0000000000000000
    RDX: 0000000000000200  RSI: 0000000000028400  RDI: 000000000000000d
    RBP: 00007fffe786e040   R8: 0000000000000000   R9: 000000000000000d
    R10: 0000000000000000  R11: 0000000000000206  R12: 000000000000000d
    R13: 00007fffe786e710  R14: 00007f8de70f8340  R15: 0000000000028400
    ORIG_RAX: 000000000000004d  CS: 0033  SS: 002b

crash64> bt
PID: 7610   TASK: ffff88100fd56140  CPU: 1   COMMAND: "ocfs2cmt"
 #0 [ffff88100f4d1c50] __schedule at ffffffff8150a524
 #1 [ffff88100f4d1cf8] schedule at ffffffff8150acbf
 #2 [ffff88100f4d1d08] jbd2_log_wait_commit at ffffffffa01274fd [jbd2]
 #3 [ffff88100f4d1d98] jbd2_journal_flush at ffffffffa01280b4 [jbd2]
 #4 [ffff88100f4d1dd8] ocfs2_commit_cache at ffffffffa0499b14 [ocfs2]
 #5 [ffff88100f4d1e38] ocfs2_commit_thread at ffffffffa0499d38 [ocfs2]
 #6 [ffff88100f4d1ee8] kthread at ffffffff81090db6
 #7 [ffff88100f4d1f48] kernel_thread_helper at ffffffff81516284

crash64> bt
PID: 7609   TASK: ffff88100f2d4480  CPU: 0   COMMAND: "jbd2/dm-20-86"
 #0 [ffff88100def3920] __schedule at ffffffff8150a524
 #1 [ffff88100def39c8] schedule at ffffffff8150acbf
 #2 [ffff88100def39d8] io_schedule at ffffffff8150ad6c
 #3 [ffff88100def39f8] sleep_on_page at ffffffff8111069e
 #4 [ffff88100def3a08] __wait_on_bit_lock at ffffffff8150b30a
 #5 [ffff88100def3a58] __lock_page at ffffffff81110687
 #6 [ffff88100def3ab8] write_cache_pages at ffffffff8111b752
 #7 [ffff88100def3be8] generic_writepages at ffffffff8111b901
 #8 [ffff88100def3c48] journal_submit_data_buffers at ffffffffa0120f67 [jbd2]
 #9 [ffff88100def3cf8] jbd2_journal_commit_transaction at ffffffffa0121372[jbd2]
 #10 [ffff88100def3e68] kjournald2 at ffffffffa0127a86 [jbd2]
 #11 [ffff88100def3ee8] kthread at ffffffff81090db6
 #12 [ffff88100def3f48] kernel_thread_helper at ffffffff81516284

Signed-off-by: Junxiao Bi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Alex Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
rapier1 pushed a commit that referenced this issue Nov 12, 2014
This patch wires up the new syscall sys_bpf() on powerpc.

Passes the tests in samples/bpf:

    #0 add+sub+mul OK
    #1 unreachable OK
    #2 unreachable2 OK
    #3 out of range jump OK
    #4 out of range jump2 OK
    #5 test1 ld_imm64 OK
    #6 test2 ld_imm64 OK
    #7 test3 ld_imm64 OK
    #8 test4 ld_imm64 OK
    #9 test5 ld_imm64 OK
    #10 no bpf_exit OK
    #11 loop (back-edge) OK
    #12 loop2 (back-edge) OK
    #13 conditional loop OK
    #14 read uninitialized register OK
    #15 read invalid register OK
    #16 program doesn't init R0 before exit OK
    #17 stack out of bounds OK
    #18 invalid call insn1 OK
    #19 invalid call insn2 OK
    #20 invalid function call OK
    #21 uninitialized stack1 OK
    #22 uninitialized stack2 OK
    #23 check valid spill/fill OK
    #24 check corrupted spill/fill OK
    #25 invalid src register in STX OK
    #26 invalid dst register in STX OK
    #27 invalid dst register in ST OK
    #28 invalid src register in LDX OK
    #29 invalid dst register in LDX OK
    #30 junk insn OK
    #31 junk insn2 OK
    #32 junk insn3 OK
    #33 junk insn4 OK
    #34 junk insn5 OK
    #35 misaligned read from stack OK
    #36 invalid map_fd for function call OK
    #37 don't check return value before access OK
    #38 access memory with incorrect alignment OK
    #39 sometimes access memory with incorrect alignment OK
    #40 jump test 1 OK
    #41 jump test 2 OK
    #42 jump test 3 OK
    #43 jump test 4 OK

Signed-off-by: Pranith Kumar <[email protected]>
[mpe: test using samples/bpf]
Signed-off-by: Michael Ellerman <[email protected]>
rapier1 pushed a commit that referenced this issue Nov 12, 2014
File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

This bash command reproduces problem:

$ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
	   echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done

	divide error: 0000 [#1] SMP
	Modules linked in:
	CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
	task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
	RIP: 0010:[<ffffffff81074191>]  [<ffffffff81074191>] task_scan_min+0x21/0x50
	RSP: 0000:ffff880037a6bce0  EFLAGS: 00010246
	RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
	RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
	RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
	R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
	R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
	FS:  00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
	CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
	Stack:
	 ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
	 ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
	 ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
	Call Trace:
	 [<ffffffff810741d1>] task_scan_max+0x11/0x40
	 [<ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0
	 [<ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300
	 [<ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0
	 [<ffffffff8103e2f1>] __do_page_fault+0x191/0x510
	 [<ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60
	 [<ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0
	 [<ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0
	 [<ffffffff8104887d>] ? do_fork+0x14d/0x340
	 [<ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30
	 [<ffffffff811799df>] ? __fd_install+0x1f/0x60
	 [<ffffffff8103e67c>] do_page_fault+0xc/0x10
	 [<ffffffff8150d322>] page_fault+0x22/0x30
	RIP  [<ffffffff81074191>] task_scan_min+0x21/0x50
	RSP <ffff880037a6bce0>
	---[ end trace 9a826d16936c04de ]---

Also fix race in task_scan_min (it depends on compiler behaviour).

Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Aaron Tomlin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Dario Faggioli <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Rik van Riel <[email protected]>
Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
Signed-off-by: Ingo Molnar <[email protected]>
rapier1 pushed a commit that referenced this issue Dec 8, 2014
The function chandef_to_chanspec() failed when converting a
chandef with bandwidth set to NL80211_CHAN_WIDTH_20_NOHT. This
was reported by user running the device in AP mode.

------------[ cut here ]------------
WARNING: CPU: 0 PID: 304 at
	drivers/net/wireless/brcm80211/brcmfmac/wl_cfg80211.c:381
		chandef_to_chanspec.isra.11+0x158/0x184()

Modules linked in:

CPU: 0 PID: 304 Comm: hostapd Not tainted 3.16.0-rc7-abb+g64aa90f #8

[<c0014bb4>] (unwind_backtrace) from [<c0012314>] (show_stack+0x10/0x14)
[<c0012314>] (show_stack) from [<c001d3f8>] (warn_slowpath_common+0x6c/0x8c)
[<c001d3f8>] (warn_slowpath_common) from [<c001d4b4>] (warn_slowpath_null+0x1c/0x24)
[<c001d4b4>] (warn_slowpath_null) from [<c03449a4>] (chandef_to_chanspec.isra.11+0x158/0x184)
[<c03449a4>] (chandef_to_chanspec.isra.11) from [<c0348e00>] (brcmf_cfg80211_start_ap+0x1e4/0x614)
[<c0348e00>] (brcmf_cfg80211_start_ap) from [<c04d1468>] (nl80211_start_ap+0x288/0x414)
[<c04d1468>] (nl80211_start_ap) from [<c043d144>] (genl_rcv_msg+0x21c/0x38c)
[<c043d144>] (genl_rcv_msg) from [<c043c740>] (netlink_rcv_skb+0xac/0xc0)
[<c043c740>] (netlink_rcv_skb) from [<c043cf14>] (genl_rcv+0x20/0x34)
[<c043cf14>] (genl_rcv) from [<c043c0a0>] (netlink_unicast+0x150/0x20c)
[<c043c0a0>] (netlink_unicast) from [<c043c4b8>] (netlink_sendmsg+0x2b8/0x398)
[<c043c4b8>] (netlink_sendmsg) from [<c04066a4>] (sock_sendmsg+0x84/0xa8)
[<c04066a4>] (sock_sendmsg) from [<c0407c5c>] (___sys_sendmsg.part.29+0x268/0x278)
[<c0407c5c>] (___sys_sendmsg.part.29) from [<c0408bdc>] (__sys_sendmsg+0x4c/0x7c)
[<c0408bdc>] (__sys_sendmsg) from [<c000ec60>] (ret_fast_syscall+0x0/0x44)
---[ end trace 965ee2158c9905a2 ]---

Cc: [email protected] # v3.17
Reported-by: Pontus Fuchs <[email protected]>
Reviewed-by: Hante Meuleman <[email protected]>
Reviewed-by: Daniel (Deognyoun) Kim <[email protected]>
Reviewed-by: Franky (Zhenhui) Lin <[email protected]>
Reviewed-by: Pieter-Paul Giesberts <[email protected]>
Signed-off-by: Arend van Spriel <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant