- 22 Apr, 2013 1 commit
-
-
Rusty Russell authored
Signed-off-by:
Rusty Russell <rusty@rustcorp.com.au> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 21 Apr, 2013 1 commit
-
-
Paul E. McKenney authored
The following RCU splat indicates lack of RCU protection: [ 953.267649] =============================== [ 953.267652] [ INFO: suspicious RCU usage. ] [ 953.267657] 3.9.0-0.rc6.git2.4.fc19.ppc64p7 #1 Not tainted [ 953.267661] ------------------------------- [ 953.267664] include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage! [ 953.267669] [ 953.267669] other info that might help us debug this: [ 953.267669] [ 953.267675] [ 953.267675] rcu_scheduler_active = 1, debug_locks = 0 [ 953.267680] 1 lock held by glxgears/1289: [ 953.267683] #0: (&sig->cred_guard_mutex){+.+.+.}, at: [<c00000000027f884>] .prepare_bprm_creds+0x34/0xa0 [ 953.267700] [ 953.267700] stack backtrace: [ 953.267704] Call Trace: [ 953.267709] [c0000001f0d1b6e0] [c000000000016e30] .show_stack+0x130/0x200 (unreliable) [ 953.267717] [c0000001f0d1b7b0] [c0000000001267f8] .lockdep_rcu_suspicious+0x138/0x180 [ 953.267724] [c0000001f0d1b840] [c0000000001d43a4] .perf_event_comm+0x4c4/0x690 [ 953.267731] [c0000001f0d1b950] [c00000000027f6e4] .set_task_comm+0x84/0x1f0 [ 953.267737] [c0000001f0d1b9f0] [c000000000280414] .setup_new_exec+0x94/0x220 [ 953.267744] [c0000001f0d1ba70] [c0000000002f665c] .load_elf_binary+0x58c/0x19b0 ... This commit therefore adds the required RCU read-side critical section to perf_event_comm(). Reported-by:
Adam Jackson <ajax@redhat.com> Signed-off-by:
Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: a.p.zijlstra@chello.nl Cc: paulus@samba.org Cc: acme@ghostprotocols.net Link: http://lkml.kernel.org/r/20130419190124.GA8638@linux.vnet.ibm.com Signed-off-by:
Ingo Molnar <mingo@kernel.org> Tested-by:
Gustavo Luiz Duarte <gusld@br.ibm.com>
-
- 18 Apr, 2013 2 commits
-
-
Linus Torvalds authored
This reverts commit 3a366e61 . Wanlong Gao reports that it causes a kernel panic on his machine several minutes after boot. Reverting it removes the panic. Jens says: "It's not quite clear why that is yet, so I think we should just revert the commit for 3.9 final (which I'm assuming is pretty close). The wifi is crap at the LSF hotel, so sending this email instead of queueing up a revert and pull request." Reported-by:
Wanlong Gao <gaowanlong@cn.fujitsu.com> Requested-by:
Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Masami Hiramatsu authored
Fix a double locking bug caused when debug.kprobe-optimization=0. While the proc_kprobes_optimization_handler locks kprobe_mutex, wait_for_kprobe_optimizer locks it again and that causes a double lock. To fix the bug, this introduces different mutex for protecting sysctl parameter and locks it in proc_kprobes_optimization_handler. Of course, since we need to lock kprobe_mutex when touching kprobes resources, that is done in *optimize_all_kprobes(). This bug was introduced by commit ad72b3be ("kprobes: fix wait_for_kprobe_optimizer()") Signed-off-by:
Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by:
Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 17 Apr, 2013 4 commits
-
-
Emese Revfy authored
This fixes a kernel memory contents leak via the tkill and tgkill syscalls for compat processes. This is visible in the siginfo_t->_sifields._rt.si_sigval.sival_ptr field when handling signals delivered from tkill. The place of the infoleak: int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from) { ... put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr); ... } Signed-off-by:
Emese Revfy <re.emese@gmail.com> Reviewed-by:
PaX Team <pageexec@freemail.hu> Signed-off-by:
Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Yinghai Lu authored
We can extend kexec-tools to support multiple "Crash kernel" in /proc/iomem instead. So we can use "Crash kernel" instead of "Crash kernel low" in /proc/iomem. Suggested-by:
Vivek Goyal <vgoyal@redhat.com> Signed-off-by:
Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-3-git-send-email-yinghai@kernel.org Acked-by:
Vivek Goyal <vgoyal@redhat.com> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Yinghai Lu authored
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of crashkernel_hign=X crashkernel_low=Y. As that could be extensible. -v2: according to Vivek, change delimiter to ; -v3: let hign and low only handle simple form and it conforms to description in kernel-parameters.txt still keep crashkernel=X override any crashkernel=X,high crashkernel=Y,low -v4: update get_last_crashkernel returning and add more strict checking in parse_crashkernel_simple() found by HATAYAMA. -v5: Change delimiter back to , according to HPA. also separate parse_suffix from parse_simper according to vivek. so we can avoid @pos in that path. -v6: Tight the checking about crashkernel=X,highblahblah,high found by HTYAYAMA. Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Signed-off-by:
Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org Acked-by:
Vivek Goyal <vgoyal@redhat.com> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Yinghai Lu authored
Vivek found old kexec-tools does not work new kernel anymore. So change back crashkernel= back to old behavoir, and add crashkernel_high= to let user decide if buffer could be above 4G, and also new kexec-tools will be needed. -v2: let crashkernel=X override crashkernel_high= update description about _high will be ignored by crashkernel=X -v3: update description about kernel-parameters.txt according to Vivek. Signed-off-by:
Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-4-git-send-email-yinghai@kernel.org Acked-by:
Vivek Goyal <vgoyal@redhat.com> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
- 15 Apr, 2013 4 commits
-
-
Tommi Rantala authored
Trinity discovered that we fail to check all 64 bits of attr.config passed by user space, resulting to out-of-bounds access of the perf_swevent_enabled array in sw_perf_event_destroy(). Introduced in commit b0a873eb ("perf: Register PMU implementations"). Signed-off-by:
Tommi Rantala <tt.rantala@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Link: http://lkml.kernel.org/r/1365882554-30259-1-git-send-email-tt.rantala@gmail.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Andy Lutomirski authored
Changing uid/gid/projid mappings doesn't change your id within the namespace; it reconfigures the namespace. Unprivileged programs should *not* be able to write these files. (We're also checking the privileges on the wrong task.) Given the write-once nature of these files and the other security checks, this is likely impossible to usefully exploit. Signed-off-by:
Andy Lutomirski <luto@amacapital.net>
-
Andy Lutomirski authored
Signed-off-by:
Andy Lutomirski <luto@amacapital.net>
-
Eric W. Biederman authored
When we require privilege for setting /proc/<pid>/uid_map or /proc/<pid>/gid_map no longer allow an unprivileged user to open the file and pass it to a privileged program to write to the file. Instead when privilege is required require both the opener and the writer to have the necessary capabilities. I have tested this code and verified that setting /proc/<pid>/uid_map fails when an unprivileged user opens the file and a privielged user attempts to set the mapping, that unprivileged users can still map their own id, and that a privileged users can still setup an arbitrary mapping. Reported-by:
Andy Lutomirski <luto@amacapital.net> Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by:
Andy Lutomirski <luto@amacapital.net>
-
- 14 Apr, 2013 1 commit
-
-
Linus Torvalds authored
Nothing is using it yet, but this will allow us to delay the open-time checks to use time, without breaking the normal UNIX permission semantics where permissions are determined by the opener (and the file descriptor can then be passed to a different process, or the process can drop capabilities). Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 12 Apr, 2013 4 commits
-
-
Steven Rostedt (Red Hat) authored
As ftrace_filter_lseek is now used with ftrace_pid_fops, it needs to be moved out of the #ifdef CONFIG_DYNAMIC_FTRACE section as the ftrace_pid_fops is defined when DYNAMIC_FTRACE is not. Cc: stable@vger.kernel.org Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by:
Steven Rostedt <rostedt@goodmis.org>
-
Namhyung Kim authored
Currently set_ftrace_pid and set_graph_function files use seq_lseek for their fops. However seq_open() is called only for FMODE_READ in the fops->open() so that if an user tries to seek one of those file when she open it for writing, it sees NULL seq_file and then panic. It can be easily reproduced with following command: $ cd /sys/kernel/debug/tracing $ echo 1234 | sudo tee -a set_ftrace_pid In this example, GNU coreutils' tee opens the file with fopen(, "a") and then the fopen() internally calls lseek(). Link: http://lkml.kernel.org/r/1365663302-2170-1-git-send-email-namhyung@kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: stable@vger.kernel.org Signed-off-by:
Namhyung Kim <namhyung@kernel.org> Signed-off-by:
Steven Rostedt <rostedt@goodmis.org>
-
Thomas Gleixner authored
The smpboot threads rely on the park/unpark mechanism which binds per cpu threads on a particular core. Though the functionality is racy: CPU0 CPU1 CPU2 unpark(T) wake_up_process(T) clear(SHOULD_PARK) T runs leave parkme() due to !SHOULD_PARK bind_to(CPU2) BUG_ON(wrong CPU) We cannot let the tasks move themself to the target CPU as one of those tasks is actually the migration thread itself, which requires that it starts running on the target cpu right away. The solution to this problem is to prevent wakeups in park mode which are not from unpark(). That way we can guarantee that the association of the task to the target cpu is working correctly. Add a new task state (TASK_PARKED) which prevents other wakeups and use this state explicitly for the unpark wakeup. Peter noticed: Also, since the task state is visible to userspace and all the parked tasks are still in the PID space, its a good hint in ps and friends that these tasks aren't really there for the moment. The migration thread has another related issue. CPU0 CPU1 Bring up CPU2 create_thread(T) park(T) wait_for_completion() parkme() complete() sched_set_stop_task() schedule(TASK_PARKED) The sched_set_stop_task() call is issued while the task is on the runqueue of CPU1 and that confuses the hell out of the stop_task class on that cpu. So we need the same synchronizaion before sched_set_stop_task(). Reported-by:
Dave Jones <davej@redhat.com> Reported-and-tested-by:
Dave Hansen <dave@sr71.net> Reported-and-tested-by:
Borislav Petkov <bp@alien8.de> Acked-by:
Peter Ziljstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: dhillf@gmail.com Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos Signed-off-by:
Thomas Gleixner <tglx@linutronix.de>
-
Wei Yongjun authored
Fix to return -ENOMEM in the allocation error case instead of 0 (if pmu_bus_running == 1), as done elsewhere in this function. Signed-off-by:
Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: a.p.zijlstra@chello.nl Cc: paulus@samba.org Cc: acme@ghostprotocols.net Link: http://lkml.kernel.org/r/CAPgLHd8j_fWcgqe%3DKLWjpBj%2B%3Do0Pw6Z-SEq%3DNTPU08c2w1tngQ@mail.gmail.com [ Tweaked the error code setting placement and the changelog. ] Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 09 Apr, 2013 1 commit
-
-
Namhyung Kim authored
On the failure path, stat->start and stat->pages will refer same page. So it'll attempt to free the same page again and get kernel panic. Link: http://lkml.kernel.org/r/1364820385-32027-1-git-send-email-namhyung@kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: stable@vger.kernel.org Signed-off-by:
Namhyung Kim <namhyung@kernel.org> Signed-off-by:
Steven Rostedt <rostedt@goodmis.org>
-
- 08 Apr, 2013 10 commits
-
-
Huacai Chen authored
As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core subsystems PM) say, syscore_ops operations should be carried with one CPU on-line and interrupts disabled. However, after commit f96972f2 (kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()), syscore_shutdown() is called before disable_nonboot_cpus(), so break the rules. We have a MIPS machine with a 8259A PIC, and there is an external timer (HPET) linked at 8259A. Since 8259A has been shutdown too early (by syscore_shutdown()), disable_nonboot_cpus() runs without timer interrupt, so it hangs and reboot fails. This patch call syscore_shutdown() a little later (after disable_nonboot_cpus()) to avoid reboot failure, this is the same way as poweroff does. For consistency, add disable_nonboot_cpus() to kernel_halt(). Signed-off-by:
Huacai Chen <chenhc@lemote.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
Steven Rostedt (Red Hat) authored
The function tracing control loop used by perf spits out a warning if the called function is not a control function. This is because the control function references a per cpu allocated data structure on struct ftrace_ops that is not allocated for other types of functions. commit 0a016409 "ftrace: Optimize the function tracer list loop" Had an optimization done to all function tracing loops to optimize for a single registered ops. Unfortunately, this allows for a slight race when tracing starts or ends, where the stub function might be called after the current registered ops is removed. In this case we get the following dump: root# perf stat -e ftrace:function sleep 1 [ 74.339105] WARNING: at include/linux/ftrace.h:209 ftrace_ops_control_func+0xde/0xf0() [ 74.349522] Hardware name: PRIMERGY RX200 S6 [ 74.357149] Modules linked in: sg igb iTCO_wdt ptp pps_core iTCO_vendor_support i7core_edac dca lpc_ich i2c_i801 coretemp edac_core crc32c_intel mfd_core ghash_clmulni_intel dm_multipath acpi_power_meter pcspk r microcode vhost_net tun macvtap macvlan nfsd kvm_intel kvm auth_rpcgss nfs_acl lockd sunrpc uinput xfs libcrc32c sd_mod crc_t10dif sr_mod cdrom mgag200 i2c_algo_bit drm_kms_helper ttm qla2xxx mptsas ahci drm li bahci scsi_transport_sas mptscsih libata scsi_transport_fc i2c_core mptbase scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [ 74.446233] Pid: 1377, comm: perf Tainted: G W 3.9.0-rc1 #1 [ 74.453458] Call Trace: [ 74.456233] [<ffffffff81062e3f>] warn_slowpath_common+0x7f/0xc0 [ 74.462997] [<ffffffff810fbc60>] ? rcu_note_context_switch+0xa0/0xa0 [ 74.470272] [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0 [ 74.478117] [<ffffffff81062e9a>] warn_slowpath_null+0x1a/0x20 [ 74.484681] [<ffffffff81102ede>] ftrace_ops_control_func+0xde/0xf0 [ 74.491760] [<ffffffff8162f400>] ftrace_call+0x5/0x2f [ 74.497511] [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f [ 74.503486] [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f [ 74.509500] [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50 [ 74.516088] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40 [ 74.522268] [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50 [ 74.528837] [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0 [ 74.536696] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40 [ 74.542878] [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50 [ 74.548869] [<ffffffff81105c67>] unregister_ftrace_function+0x27/0x50 [ 74.556243] [<ffffffff8111eadf>] perf_ftrace_event_register+0x9f/0x140 [ 74.563709] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40 [ 74.569887] [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50 [ 74.575898] [<ffffffff8111e94e>] perf_trace_destroy+0x2e/0x50 [ 74.582505] [<ffffffff81127ba9>] tp_perf_event_destroy+0x9/0x10 [ 74.589298] [<ffffffff811295d0>] free_event+0x70/0x1a0 [ 74.595208] [<ffffffff8112a579>] perf_event_release_kernel+0x69/0xa0 [ 74.602460] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40 [ 74.608667] [<ffffffff8112a640>] put_event+0x90/0xc0 [ 74.614373] [<ffffffff8112a740>] perf_release+0x10/0x20 [ 74.620367] [<ffffffff811a3044>] __fput+0xf4/0x280 [ 74.625894] [<ffffffff811a31de>] ____fput+0xe/0x10 [ 74.631387] [<ffffffff81083697>] task_work_run+0xa7/0xe0 [ 74.637452] [<ffffffff81014981>] do_notify_resume+0x71/0xb0 [ 74.643843] [<ffffffff8162fa92>] int_signal+0x12/0x17 To fix this a new ftrace_ops flag is added that denotes the ftrace_list_end ftrace_ops stub as just that, a stub. This flag is now checked in the control loop and the function is not called if the flag is set. Thanks to Jovi for not just reporting the bug, but also pointing out where the bug was in the code. Link: http://lkml.kernel.org/r/514A8855.7090402@redhat.com Link: http://lkml.kernel.org/r/1364377499-1900-15-git-send-email-jovi.zhangwei@huawei.com Tested-by:
WANG Chao <chaowang@redhat.com> Reported-by:
WANG Chao <chaowang@redhat.com> Reported-by:
zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by:
Steven Rostedt <rostedt@goodmis.org>
-
Jan Kiszka authored
If we reenable ftrace via syctl, we currently set ftrace_trace_function based on the previous simplistic algorithm. This is inconsistent with what update_ftrace_function does. So better call that helper instead. Link: http://lkml.kernel.org/r/5151D26F.1070702@siemens.com Cc: stable@vger.kernel.org Signed-off-by:
Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by:
Steven Rostedt <rostedt@goodmis.org>
-
Steven Rostedt (Red Hat) authored
The commit 34600f0e "tracing: Fix race with max_tr and changing tracers" fixed the updating of the main buffers with the race of changing tracers, but left out the fix to the updating of just a per cpu buffer. Signed-off-by:
Steven Rostedt <rostedt@goodmis.org>
-
Stanislaw Gruszka authored
Recent commit 6fac4829 ("cputime: Use accessors to read task cputime stats") introduced a bug, where we account many times the cputime of the first thread, instead of cputimes of all the different threads. Signed-off-by:
Stanislaw Gruszka <sgruszka@redhat.com> Acked-by:
Frederic Weisbecker <fweisbec@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130404085740.GA2495@redhat.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Chen Gang authored
For NUL terminated string we always need to set '\0' at the end. Signed-off-by:
Chen Gang <gang.chen@asianux.com> Cc: rostedt@goodmis.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/516243B7.9020405@asianux.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Chen Gang authored
For NUL terminated string we always need to set '\0' at the end. Signed-off-by:
Chen Gang <gang.chen@asianux.com> Cc: rostedt@goodmis.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/51624254.30301@asianux.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Chen Gang authored
For NUL terminated string, always make sure that there's '\0' at the end. In our case we need a return value, so still use strncpy() and fix up the tail explicitly. (strlcpy() returns the size, not the pointer) Signed-off-by:
Chen Gang <gang.chen@asianux.com> Cc: a.p.zijlstra@chello.nl <a.p.zijlstra@chello.nl> Cc: paulus@samba.org <paulus@samba.org> Cc: acme@ghostprotocols.net <acme@ghostprotocols.net> Link: http://lkml.kernel.org/r/51623E0B.7070101@asianux.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
libin authored
Commit 201c373e ("sched/debug: Limit sd->*_idx range on sysctl") was an incomplete bug fix. This patch fixes sd->*_idx limit range to [0 ~ CPU_LOAD_IDX_MAX-1] avoiding array overflow caused by setting sd->*_idx to CPU_LOAD_IDX_MAX on sysctl. Signed-off-by:
Libin <huawei.libin@huawei.com> Cc: <jiang.liu@huawei.com> Cc: <guohanjun@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/51626610.2040607@huawei.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Thomas Gleixner authored
The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. #1 can be excluded because sched_clock would continue to increase for one jiffy and then go stale. #2 can be excluded because it would not make the clock jump forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Reported-by:
Siegfried Wulsch <Siegfried.Wulsch@rovema.de> Acked-by:
Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
-
- 31 Mar, 2013 1 commit
-
-
Paul Walmsley authored
This reverts commit 6aa97070. Commit 6aa97070 ("lockdep: check that no locks held at freeze time") causes problems with NFS root filesystems. The failures were noticed on OMAP2 and 3 boards during kernel init: [ BUG: swapper/0/1 still has locks held! ] 3.9.0-rc3-00344-ga937536b #1 Not tainted ------------------------------------- 1 lock held by swapper/0/1: #0: (&type->s_umount_key#13/1){+.+.+.}, at: [<c011e84c>] sget+0x248/0x574 stack backtrace: rpc_wait_bit_killable __wait_on_bit out_of_line_wait_on_bit __rpc_execute rpc_run_task rpc_call_sync nfs_proc_get_root nfs_get_root nfs_fs_mount_common nfs_try_mount nfs_fs_mount mount_fs vfs_kern_mount do_mount sys_mount do_mount_root mount_root prepare_namespace kernel_init_freeable kernel_init Although the rootfs mounts, the system is unstable. Here's a transcript from a PM test: http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt Here's what the test log should look like: http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt Mailing list discussion is here: http://lkml.org/lkml/2013/3/4/221 Deal with this for v3.9 by reverting the problem commit, until folks can figure out the right long-term course of action. Signed-off-by:
Paul Walmsley <paul@pwsan.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: Jeff Layton <jlayton@redhat.com> Cc: Shawn Guo <shawn.guo@linaro.org> Cc: <maciej.rutecki@gmail.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ben Chan <benchan@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 27 Mar, 2013 2 commits
-
-
Eric W. Biederman authored
Only allow unprivileged mounts of proc and sysfs if they are already mounted when the user namespace is created. proc and sysfs are interesting because they have content that is per namespace, and so fresh mounts are needed when new namespaces are created while at the same time proc and sysfs have content that is shared between every instance. Respect the policy of who may see the shared content of proc and sysfs by only allowing new mounts if there was an existing mount at the time the user namespace was created. In practice there are only two interesting cases: proc and sysfs are mounted at their usual places, proc and sysfs are not mounted at all (some form of mount namespace jail). Cc: stable@vger.kernel.org Acked-by:
Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Guarantee that the policy of which files may be access that is established by setting the root directory will not be violated by user namespaces by verifying that the root directory points to the root of the mount namespace at the time of user namespace creation. Changing the root is a privileged operation, and as a matter of policy it serves to limit unprivileged processes to files below the current root directory. For reasons of simplicity and comprehensibility the privilege to change the root directory is gated solely on the CAP_SYS_CHROOT capability in the user namespace. Therefore when creating a user namespace we must ensure that the policy of which files may be access can not be violated by changing the root directory. Anyone who runs a processes in a chroot and would like to use user namespace can setup the same view of filesystems with a mount namespace instead. With this result that this is not a practical limitation for using user namespaces. Cc: stable@vger.kernel.org Acked-by:
Serge Hallyn <serge.hallyn@canonical.com> Reported-by:
Andy Lutomirski <luto@amacapital.net> Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com>
-
- 26 Mar, 2013 2 commits
-
-
Michael Bohan authored
The current code makes the assumption that a cpu_base lock won't be held if the CPU corresponding to that cpu_base is offline, which isn't always true. If a hrtimer is not queued, then it will not be migrated by migrate_hrtimers() when a CPU is offlined. Therefore, the hrtimer's cpu_base may still point to a CPU which has subsequently gone offline if the timer wasn't enqueued at the time the CPU went down. Normally this wouldn't be a problem, but a cpu_base's lock is blindly reinitialized each time a CPU is brought up. If a CPU is brought online during the period that another thread is performing a hrtimer operation on a stale hrtimer, then the lock will be reinitialized under its feet, and a SPIN_BUG() like the following will be observed: <0>[ 28.082085] BUG: spinlock already unlocked on CPU#0, swapper/0/0 <0>[ 28.087078] lock: 0xc4780b40, value 0x0 .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1 <4>[ 42.451150] [<c0014398>] (unwind_backtrace+0x0/0x120) from [<c0269220>] (do_raw_spin_unlock+0x44/0xdc) <4>[ 42.460430] [<c0269220>] (do_raw_spin_unlock+0x44/0xdc) from [<c071b5bc>] (_raw_spin_unlock+0x8/0x30) <4>[ 42.469632] [<c071b5bc>] (_raw_spin_unlock+0x8/0x30) from [<c00a9ce0>] (__hrtimer_start_range_ns+0x1e4/0x4f8) <4>[ 42.479521] [<c00a9ce0>] (__hrtimer_start_range_ns+0x1e4/0x4f8) from [<c00aa014>] (hrtimer_start+0x20/0x28) <4>[ 42.489247] [<c00aa014>] (hrtimer_start+0x20/0x28) from [<c00e6190>] (rcu_idle_enter_common+0x1ac/0x320) <4>[ 42.498709] [<c00e6190>] (rcu_idle_enter_common+0x1ac/0x320) from [<c00e6440>] (rcu_idle_enter+0xa0/0xb8) <4>[ 42.508259] [<c00e6440>] (rcu_idle_enter+0xa0/0xb8) from [<c000f268>] (cpu_idle+0x24/0xf0) <4>[ 42.516503] [<c000f268>] (cpu_idle+0x24/0xf0) from [<c06ed3c0>] (rest_init+0x88/0xa0) <4>[ 42.524319] [<c06ed3c0>] (rest_init+0x88/0xa0) from [<c0c00978>] (start_kernel+0x3d0/0x434) As an example, this particular crash occurred when hrtimer_start() was executed on CPU #0. The code locked the hrtimer's current cpu_base corresponding to CPU #1. CPU #0 then tried to switch the hrtimer's cpu_base to an optimal CPU which was online. In this case, it selected the cpu_base corresponding to CPU #3. Before it could proceed, CPU #1 came online and reinitialized the spinlock corresponding to its cpu_base. Thus now CPU #0 held a lock which was reinitialized. When CPU #0 finally ended up unlocking the old cpu_base corresponding to CPU #1 so that it could switch to CPU #3, we hit this SPIN_BUG() above while in switch_hrtimer_base(). CPU #0 CPU #1 ---- ---- ... <offline> hrtimer_start() lock_hrtimer_base(base #1) ... init_hrtimers_cpu() switch_hrtimer_base() ... ... raw_spin_lock_init(&cpu_base->lock) raw_spin_unlock(&cpu_base->lock) ... <spin_bug> Solve this by statically initializing the lock. Signed-off-by:
Michael Bohan <mbohan@codeaurora.org> Link: http://lkml.kernel.org/r/1363745965-23475-1-git-send-email-mbohan@codeaurora.org Cc: stable@vger.kernel.org Signed-off-by:
Thomas Gleixner <tglx@linutronix.de>
-
Eric W. Biederman authored
When a multi-threaded init exits and the initial thread is not the last thread to exit the initial thread hangs around as a zombie until the last thread exits. In that case zap_pid_ns_processes needs to wait until there are only 2 hashed pids in the pid namespace not one. v2. Replace thread_pid_vnr(me) == 1 with the test thread_group_leader(me) as suggested by Oleg. Cc: stable@vger.kernel.org Cc: Oleg Nesterov <oleg@redhat.com> Reported-by:
Caj Larsson <caj@omnicloud.com> Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com>
-
- 22 Mar, 2013 2 commits
-
-
Oleg Nesterov authored
David said: Commit 6c0c0d4d ("poweroff: fix bug in orderly_poweroff()") apparently fixes one bug in orderly_poweroff(), but introduces another. The comments on orderly_poweroff() claim it can be called from any context - and indeed we call it from interrupt context in arch/powerpc/platforms/pseries/ras.c for example. But since that commit this is no longer safe, since call_usermodehelper_fns() is not safe in interrupt context without the UMH_NO_WAIT option. orderly_poweroff() can be used from any context but UMH_WAIT_EXEC is sleepable. Move the "force" logic into __orderly_poweroff() and change orderly_poweroff() to use the global poweroff_work which simply calls __orderly_poweroff(). While at it, remove the unneeded "int argc" and change argv_split() to use GFP_KERNEL. We use the global "bool poweroff_force" to pass the argument, this can obviously affect the previous request if it is pending/running. So we only allow the "false => true" transition assuming that the pending "true" should succeed anyway. If schedule_work() fails after that we know that work->func() was not called yet, it must see the new value. This means that orderly_poweroff() becomes async even if we do not run the command and always succeeds, schedule_work() can only fail if the work is already pending. We can export __orderly_poweroff() and change the non-atomic callers which want the old semantics. Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Reported-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org> Reported-by:
David Gibson <david@gibson.dropbear.id.au> Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Feng Hong <hongfeng@marvell.com> Cc: Kees Cook <keescook@chromium.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Frederic Weisbecker authored
wake_up_klogd() is useless when CONFIG_PRINTK=n because neither printk() nor printk_sched() are in use and there are actually no waiter on log_wait waitqueue. It should be a stub in this case for users like bust_spinlocks(). Otherwise this results in this warning when CONFIG_PRINTK=n and CONFIG_IRQ_WORK=n: kernel/built-in.o In function `wake_up_klogd': (.text.wake_up_klogd+0xb4): undefined reference to `irq_work_queue' To fix this, provide an off-case for wake_up_klogd() when CONFIG_PRINTK=n. There is much more from console_unlock() and other console related code in printk.c that should be moved under CONFIG_PRINTK. But for now, focus on a minimal fix as we passed the merged window already. [akpm@linux-foundation.org: include printk.h in bust_spinlocks.c] Signed-off-by:
Frederic Weisbecker <fweisbec@gmail.com> Reported-by:
James Hogan <james.hogan@imgtec.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 21 Mar, 2013 2 commits
-
-
Stephane Eranian authored
This patch fixes a flaw in perf_output_space(). In case the size of the space needed is bigger than the actual buffer size, there may be situations where the function would return true (i.e., there is space) when it should not. head > offset due to rounding of the masking logic. The problem can be tested by activating BTS on Intel processors. A BTS record can be as big as 16 pages. The following command fails: $ perf record -m 4 -c 1 -e branches:u my_test_program You will get a buffer corruption with this. Perf report won't be able to parse the perf.data. The fix is to first check that the requested space is smaller than the buffer size. If so, then the masking logic will work fine. If not, then there is no chance the record can be saved and it will be gracefully handled by upper code layers. [ In v2, we also make the logic for the writable more explicit by renaming it to rb->overwrite because it tells whether or not the buffer can overwrite its tail (suggested by PeterZ). ] Signed-off-by:
Stephane Eranian <eranian@google.com> Acked-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: fweisbec@gmail.com Link: http://lkml.kernel.org/r/20130318133327.GA3056@quad Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Tejun Heo authored
try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130318192234.GD3042@htj.dyndns.org Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 18 Mar, 2013 2 commits
-
-
Namhyung Kim authored
perf_event_task_event() iterates pmu list and generate events for each eligible pmu context. But if task_event has task_ctx like in EXIT it'll generate events even though the pmu doesn't have an eligible one. Fix it by moving the code to proper places. Before this patch: $ perf record -n true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.006 MB perf.data (~248 samples) ] $ perf report -D | tail Aggregated stats: TOTAL events: 73 MMAP events: 67 COMM events: 2 EXIT events: 4 cycles stats: TOTAL events: 73 MMAP events: 67 COMM events: 2 EXIT events: 4 After this patch: $ perf report -D | tail Aggregated stats: TOTAL events: 70 MMAP events: 67 COMM events: 2 EXIT events: 1 cycles stats: TOTAL events: 70 MMAP events: 67 COMM events: 2 EXIT events: 1 Signed-off-by:
Namhyung Kim <namhyung@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1363332433-7637-1-git-send-email-namhyung@kernel.org Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Namhyung Kim authored
When cpu/task clock events are initialized, their sampling frequencies are converted to have a fixed value. However it missed to update the hwc->last_period which was set to 1 for initial sampling frequency calibration. Because this hwc->last_period value is used as a period in perf_swevent_ hrtime(), every recorded sample will have an incorrected period of 1. $ perf record -e task-clock noploop 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.158 MB perf.data (~6919 samples) ] $ perf report -n --show-total-period --stdio # Samples: 4K of event 'task-clock' # Event count (approx.): 4000 # # Overhead Samples Period Command Shared Object Symbol # ........ ............ ............ ....... ............. .................. # 99.95% 3998 3998 noploop noploop [.] main 0.03% 1 1 noploop libc-2.15.so [.] init_cacheinfo 0.03% 1 1 noploop ld-2.15.so [.] open_verify Note that it doesn't affect the non-sampling event so that the perf stat still gets correct value with or without this patch. $ perf stat -e task-clock noploop 1 Performance counter stats for 'noploop 1': 1000.272525 task-clock # 1.000 CPUs utilized 1.000560605 seconds time elapsed Signed-off-by:
Namhyung Kim <namhyung@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1363574507-18808-1-git-send-email-namhyung@kernel.org Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 15 Mar, 2013 1 commit
-
-
Steven Rostedt (Red Hat) authored
The latency tracers require the buffers to be in overwrite mode, otherwise they get screwed up. Force the buffers to stay in overwrite mode when latency tracers are enabled. Added a flag_changed() method to the tracer structure to allow the tracers to see what flags are being changed, and also be able to prevent the change from happing. Cc: stable@vger.kernel.org Signed-off-by:
Steven Rostedt <rostedt@goodmis.org>
-