Commit graph

11028 commits

Author SHA1 Message Date
Jens Axboe
18ce3751cc Properly notify block layer of sync writes
fsync_buffers_list() and sync_dirty_buffer() both issue async writes and
then immediately wait on them. Conceptually, that makes them sync writes
and we should treat them as such so that the IO schedulers can handle
them appropriately.

This patch fixes a write starvation issue that Lin Ming reported, where
xx is stuck for more than 2 minutes because of a large number of
synchronous IO in the system:

INFO: task kjournald:20558 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
kjournald     D ffff810010820978  6712 20558      2
ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
Call Trace:
[<ffffffff803ba6f2>] kobject_get+0x12/0x17
[<ffffffff80247537>] getnstimeofday+0x2f/0x83
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d195>] io_schedule+0x5d/0x9f
[<ffffffff8029c1e7>] sync_buffer+0x3b/0x3f
[<ffffffff8066d3f0>] __wait_on_bit+0x40/0x6f
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d48b>] out_of_line_wait_on_bit+0x6c/0x78
[<ffffffff80243909>] wake_bit_function+0x0/0x23
[<ffffffff8029e3ad>] sync_dirty_buffer+0x98/0xcb
[<ffffffff8030056b>] journal_commit_transaction+0x97d/0xcb6
[<ffffffff8023a676>] lock_timer_base+0x26/0x4b
[<ffffffff8030300a>] kjournald+0xc1/0x1fb
[<ffffffff802438db>] autoremove_wake_function+0x0/0x2e
[<ffffffff80302f49>] kjournald+0x0/0x1fb
[<ffffffff802437bb>] kthread+0x47/0x74
[<ffffffff8022de51>] schedule_tail+0x28/0x5d
[<ffffffff8020cac8>] child_rip+0xa/0x12
[<ffffffff80243774>] kthread+0x0/0x74
[<ffffffff8020cabe>] child_rip+0x0/0x12

Lin Ming confirms that this patch fixes the issue. I've run tests with
it for the past week and no ill effects have been observed, so I'm
proposing it for inclusion into 2.6.26.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-07-01 09:07:34 +02:00
Michael Neuling
f3e909c275 powerpc: Update for VSX core file and ptrace
This correctly hooks the VSX dump into Roland McGrath core file
infrastructure.  It adds the VSX dump information as an additional elf
note in the core file (after talking more to the tool chain/gdb guys).
This also ensures the formats are consistent between signals, ptrace
and core files.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-07-01 14:47:09 +10:00
Adrian Bunk
9b4a8dd2e9 drivers/macintosh: Various cleanups
This contains the following cleanups:
- make the following needlessly global code static:
  - adb.c: adb_controller
  - adb.c: adb_init()
  - adbhid.c: adb_to_linux_keycodes[]  (also make it const)
  - via-pmu68k.c: backlight_level
  - via-pmu68k.c: backlight_enabled
- remove the following unused code:
  - via-pmu68k.c: sleep_notifier_list

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-07-01 11:28:06 +10:00
Linus Torvalds
e1441b9a41 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: fix locking in force-feedback core
  Input: add KEY_MEDIA_REPEAT definition
2008-06-30 08:58:09 -07:00
Bastien Nocera
4bbff7e408 Input: add KEY_MEDIA_REPEAT definition
This patch adds the Repeat key to the input layer. The usage
in the HUT is 0xBC (listed under "15.7 Transport Controls").

Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
2008-06-30 09:25:12 -04:00
Paul Mackerras
e9a4b6a3f6 Merge branch 'linux-2.6' 2008-06-30 10:16:50 +10:00
Linus Torvalds
0acbbee440 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
  dock: bay: Don't call acpi_walk_namespace() when ACPI is disabled.
  ACPI: don't walk tables if ACPI was disabled
  thermal: Create CONFIG_THERMAL_HWMON=n
2008-06-29 12:22:30 -07:00
Linus Torvalds
535e49f48e Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-fixes:
  kbuild: fix a.out.h export to userspace with O= build.
2008-06-29 12:21:56 -07:00
Linus Torvalds
a4480ac4f9 Merge branch 'audit.b52' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b52' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] remove useless argument type in audit_filter_user()
  [PATCH] audit: fix kernel-doc parameter notation
  [PATCH] kernel/audit.c: nlh->nlmsg_type is gotten more than once
2008-06-29 12:15:10 -07:00
Linus Torvalds
4f46accee4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  [patch 2/3] vfs: dcache cleanups
  [patch 1/3] vfs: dcache sparse fixes
  [patch 3/3] vfs: make d_path() consistent across mount operations
  [patch 4/4] flock: remove unused fields from file_lock_operations
  [patch 3/4] vfs: fix ERR_PTR abuse in generic_readlink
  [patch 2/4] fs: make struct file arg to d_path const
  [patch 1/4] vfs: path_{get,put}() cleanups
  [patch for 2.6.26 4/4] vfs: utimensat(): fix write access check for futimens()
  [patch for 2.6.26 3/4] vfs: utimensat(): fix error checking for {UTIME_NOW,UTIME_OMIT} case
  [patch for 2.6.26 1/4] vfs: utimensat(): ignore tv_sec if tv_nsec == UTIME_OMIT or UTIME_NOW
  [patch for 2.6.26 2/4] vfs: utimensat(): be consistent with utime() for immutable and append-only files
  [PATCH] fix cgroup-inflicted breakage in block_dev.c
2008-06-29 12:14:37 -07:00
Eli Cohen
251a4b320f net/inet_lro: remove setting skb->ip_summed when not LRO-able
When an SKB cannot be chained to a session, the current code attempts
to "restore" its ip_summed field from lro_mgr->ip_summed. However,
lro_mgr->ip_summed does not hold the original value; in fact, we'd
better not touch skb->ip_summed since it is not modified by the code
in the path leading to a failure to chain it.  Also use a cleaer
comment to the describe the ip_summed field of struct net_lro_mgr.

Issue raised by Or Gerlitz <ogerlitz@voltaire.com>

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 20:09:00 -07:00
Adrian Bunk
c88e6f51c2 include/linux/netdevice.h: don't export MAX_HEADER to userspace
Due to the CONFIG_'s the value is anyway not correct in userspace.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 19:54:54 -07:00
David Woodhouse
b660398101 kbuild: fix a.out.h export to userspace with O= build.
We need to check for existence of the a.out.h header in the source tree,
not the object tree, if we want it to get the right answer with O=.

Signed-off-by: David Woodhouse <david.woodhouse@intel.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2008-06-27 23:13:54 +02:00
Peter Zijlstra
2398f2c6d3 sched: update shares on wakeup
We found that the affine wakeup code needs rather accurate load figures
to be effective. The trouble is that updating the load figures is fairly
expensive with group scheduling. Therefore ratelimit the updating.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:45 +02:00
Peter Zijlstra
b6a86c746f sched: fix sched_domain aggregation
Keeping the aggregate on the first cpu of the sched domain has two problems:
 - it could collide between different sched domains on different cpus
 - it could slow things down because of the remote accesses

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:32 +02:00
Peter Zijlstra
c09595f63b sched: revert revert of: fair-group: SMP-nice for group scheduling
Try again..

Initial commit: 18d95a2832
Revert: 6363ca57c7

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27 14:31:29 +02:00
Steven Whitehouse
b2cad26cfc [GFS2] Remove obsolete conversion deadlock avoidance code
This is only used by GFS1 so can be removed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-27 09:39:47 +01:00
Steven Whitehouse
1bdad60633 [GFS2] Remove remote lock dropping code
There are several reasons why this is undesirable:

 1. It never happens during normal operation anyway
 2. If it does happen it causes performance to be very, very poor
 3. It isn't likely to solve the original problem (memory shortage
    on remote DLM node) it was supposed to solve
 4. It uses a bunch of arbitrary constants which are unlikely to be
    correct for any particular situation and for which the tuning seems
    to be a black art.
 5. In an N node cluster, only 1/N of the dropped locked will actually
    contribute to solving the problem on average.

So all in all we are better off without it. This also makes merging
the lock_dlm module into GFS2 a bit easier.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-27 09:39:44 +01:00
Jens Axboe
15c8b6c1aa on_each_cpu(): kill unused 'retry' parameter
It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.

Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-06-26 11:24:38 +02:00
Jens Axboe
8691e5a8f6 smp_call_function: get rid of the unused nonatomic/retry argument
It's never used and the comments refer to nonatomic and retry
interchangably. So get rid of it.

Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-06-26 11:24:35 +02:00
Jens Axboe
3d44223327 Add generic helpers for arch IPI function calls
This adds kernel/smp.c which contains helpers for IPI function calls. In
addition to supporting the existing smp_call_function() in a more efficient
manner, it also adds a more scalable variant called smp_call_function_single()
for calling a given function on a single CPU only.

The core of this is based on the x86-64 patch from Nick Piggin, lots of
changes since then. "Alan D. Brunelle" <Alan.Brunelle@hp.com> has
contributed lots of fixes and suggestions as well. Also thanks to
Paul E. McKenney <paulmck@linux.vnet.ibm.com> for reviewing RCU usage
and getting rid of the data allocation fallback deadlock.

Acked-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-06-26 11:21:34 +02:00
Ingo Molnar
9a13150109 Merge commit 'v2.6.26-rc8' into core/rcu 2008-06-26 09:24:23 +02:00
Rene Herman
16d7523973 thermal: Create CONFIG_THERMAL_HWMON=n
A bug in libsensors <= 2.10.6 is exposed
when this new hwmon I/F is enabled.
Create CONFIG_THERMAL_HWMON=n
until some time after libsensors 2.10.7 ships
so those users can run the latest kernel.

libsensors 3.x is already fixed -- those users
can use CONFIG_THERMAL_HWMON=y now.

Signed-off-by: Rene Herman <rene.herman@gmail.com>
Acked-by: Mark M. Hoffman <mhoffman@lightlink.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-06-25 19:25:42 -04:00
Ingo Molnar
cbd6712406 Merge branch 'linus' into x86/irq 2008-06-25 12:30:49 +02:00
Ingo Molnar
28f73e51d0 Merge branch 'linus' into x86/delay
Conflicts:

	arch/x86/kernel/tsc_32.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-25 12:30:10 +02:00
Ingo Molnar
97e6722b8d Merge branch 'linus' into tracing/ftrace 2008-06-25 12:27:56 +02:00
Ingo Molnar
773dc8eaca Merge branch 'linus' into sched/new-API-sched_setscheduler 2008-06-25 12:27:05 +02:00
Ingo Molnar
f57aec5a87 Merge branch 'linus' into sched/devel
Conflicts:

	kernel/sched_rt.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-25 12:26:59 +02:00
Ingo Molnar
ace7f1b796 Merge branch 'linus' into core/softirq 2008-06-25 12:23:59 +02:00
Ingo Molnar
d02859ecb3 Merge commit 'v2.6.26-rc8' into x86/xen
Conflicts:

	arch/x86/xen/enlighten.c
	arch/x86/xen/mmu.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-25 12:16:51 +02:00
Peng Haitao
d8de72473e [PATCH] remove useless argument type in audit_filter_user()
The second argument "type" is not used in audit_filter_user(), so I think that type can be removed. If I'm wrong, please tell me.

Signed-off-by: Peng Haitao <penght@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-24 23:36:35 -04:00
Alok Kataria
f3f3149f35 x86: use cpu_khz for loops_per_jiffy calculation, cleanup
As suggested by Ingo, remove all references to tsc from init/calibrate.c

TSC is x86 specific, and using tsc in variable names in a generic file should
be avoided. lpj_tsc is now called lpj_fine, since it is related to fine tuning
of lpj value. Also tsc_rate_*  is called timer_rate_*

Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Daniel Hecht <dhecht@vmware.com>
Cc: Tim Mann <mann@vmware.com>
Cc: Zach Amsden <zach@vmware.com>
Cc: Sahil Rihan <srihan@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-24 13:53:46 +02:00
Li Zefan
a033c332e0 lockdep: remove duplicate definition of STATIC_LOCKDEP_MAP_INIT
STATIC_LOCKDEP_MAP_INIT is defined twice in lockdep.h. I guess
it's a copy & paste.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-24 13:43:07 +02:00
Marcelo Tosatti
06e0564566 KVM: close timer injection race window in __vcpu_run
If a timer fires after kvm_inject_pending_timer_irqs() but before
local_irq_disable() the code will enter guest mode and only inject such
timer interrupt the next time an unrelated event causes an exit.

It would be simpler if the timer->pending irq conversion could be done
with IRQ's disabled, so that the above problem cannot happen.

For now introduce a new vcpu requests bit to cancel guest entry.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-06-24 12:16:59 +03:00
Rusty Russell
961ccddd59 sched: add new API sched_setscheduler_nocheck: add a flag to control access checks
Hidehiro Kawai noticed that sched_setscheduler() can fail in
stop_machine: it calls sched_setscheduler() from insmod, which can
have CAP_SYS_MODULE without CAP_SYS_NICE.

Two cases could have failed, so are changed to sched_setscheduler_nocheck:
  kernel/softirq.c:cpu_callback()
	- CPU hotplug callback
  kernel/stop_machine.c:__stop_machine_run()
	- Called from various places, including modprobe()

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23 22:57:56 +02:00
Alok Kataria
3da757daf8 x86: use cpu_khz for loops_per_jiffy calculation
On the x86 platform we can use the value of tsc_khz computed during tsc
calibration to calculate the loops_per_jiffy value. Its very important
to keep the error in lpj values to minimum as any error in that may
result in kernel panic in check_timer. In virtualization environment, On
a highly overloaded host the guest delay calibration may sometimes
result in errors beyond the ~50% that timer_irq_works can handle,
resulting in the guest panicking.

Does some formating changes to lpj_setup code to now have a single
printk to print the bogomips value.

We do this only for the boot processor because the AP's can have
different base frequencies or the BIOS might boot a AP at a different
frequency.

Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Daniel Hecht <dhecht@vmware.com>
Cc: Tim Mann <mann@vmware.com>
Cc: Zach Amsden <zach@vmware.com>
Cc: Sahil Rihan <srihan@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23 22:51:33 +02:00
Abhishek Sagar
ecea656d1d ftrace: freeze kprobe'd records
Let records identified as being kprobe'd be marked as "frozen". The trouble
with records which have a kprobe installed on their mcount call-site is
that they don't get updated. So if such a function which is currently being
traced gets its tracing disabled due to a new filter rule (or because it
was added to the notrace list) then it won't be updated and continue being
traced. This patch allows scanning of all frozen records during tracing to
check if they should be traced.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23 22:10:58 +02:00
Abhishek Sagar
785656a41f kprobes: enable clean usage of get_kprobe
Allow clean use of get_kprobe() outside of core kprobe code. Ftrace makes use
of get_kprobe to identify probes installed on mcount call-sites.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: jkenisto@us.ibm.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23 22:10:57 +02:00
Abhishek Sagar
395a59d0f8 ftrace: store mcount address in rec->ip
Record the address of the mcount call-site. Currently all archs except sparc64
record the address of the instruction following the mcount call-site. Some
general cleanups are entailed. Storing mcount addresses in rec->ip enables
looking them up in the kprobe hash table later on to check if they're kprobe'd.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: davem@davemloft.net
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23 22:10:56 +02:00
Alan Cox
36c7343b4e tty_driver: Update required method documentation
Some of the requirement rules are now more relaxed. Also correct a
contradiction in the previous update

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-23 10:36:47 -07:00
Denis V. Lunev
f9f48ec72b [patch 4/4] flock: remove unused fields from file_lock_operations
fl_insert and fl_remove are not used right now in the kernel. Remove them.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 11:52:30 -04:00
Jan Engelhardt
20d4fdc1a7 [patch 2/4] fs: make struct file arg to d_path const
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 11:52:30 -04:00
Ingo Molnar
1de8644cc7 Merge branch 'linus' into sched/devel 2008-06-23 11:30:23 +02:00
Ingo Molnar
1e74f9cbbb Merge branch 'linus' into core/rcu 2008-06-23 11:29:11 +02:00
Ingo Molnar
f34bfb1bee Merge branch 'linus' into tracing/ftrace 2008-06-23 11:11:42 +02:00
Ingo Molnar
a60b33cf59 Merge branch 'linus' into core/softirq 2008-06-23 10:52:59 +02:00
Bernhard Walle
71c2742f5e Add return value to reserve_bootmem_node()
This patch changes the function reserve_bootmem_node() from void to int,
returning -ENOMEM if the allocation fails.

This fixes a build problem on x86 with CONFIG_KEXEC=y and
CONFIG_NEED_MULTIPLE_NODES=y

Signed-off-by: Bernhard Walle <bwalle@suse.de>
Reported-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-21 11:25:10 -07:00
Jonathan Corbet
0b28067688 Add cycle_kernel_lock()
A number of driver functions are so obviously trivial that they do not need
the big kernel lock - at least not overtly.  It turns out that the
acquisition of the BKL in driver open() functions can perform a sort of
poor-hacker's serialization function, delaying the open operation until the
driver is certain to have completed its initialization.  Add a simple
cycle_kernel_lock() function for these cases to make it clear that there is
no need to *hold* the BKL, just to be sure that we can acquire it.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-06-20 14:05:53 -06:00
Huang, Ying
443cd507ce lockdep: add lock_class information to lock_chain and output it
This patch records array of lock_class into lock_chain, and export
lock_chain information via /proc/lockdep_chains.

It is based on x86/master branch of git-x86 tree, and has been tested
on x86_64 platform.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-20 12:21:33 +02:00
Linus Torvalds
3506ba7b08 Merge branch 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6
* 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6:
  agp/intel: cleanup some serious whitespace badness
  [AGP] intel_agp: Add support for Intel 4 series chipsets
  [AGP] intel_agp: extra stolen mem size available for IGD_GM chipset
  agp: more boolean conversions.
  drivers/char/agp - use bool
  agp: two-stage page destruction issue
  agp/via: fixup pci ids
2008-06-18 21:52:35 -07:00
Dave Airlie
9516b030b4 agp: more boolean conversions.
Signed-off-by: Dave Airlie <airlied@redhat.com>
2008-06-19 10:42:17 +10:00
Joe Perches
c725801292 drivers/char/agp - use bool
Use boolean in AGP instead of having own TRUE/FALSE

--
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2008-06-19 10:04:20 +10:00
Linus Torvalds
d83b14c0db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (43 commits)
  netlink: genl: fix circular locking
  Revert "mac80211: Use skb_header_cloned() on TX path."
  af_unix: fix 'poll for write'/ connected DGRAM sockets
  tun: Proper handling of IPv6 header in tun driver when TUN_NO_PI is set
  atl1: relax eeprom mac address error check
  net/enc28j60: low power mode
  net/enc28j60: section fix
  sky2: 88E8040T pci device id
  netxen: download firmware in pci probe
  netxen: cleanup debug messages
  netxen: remove global physical_port array
  netxen: fix portnum for hp mezz cards
  ibm_newemac: select CRC32 in Kconfig
  xfrm: fix fragmentation for ipv4 xfrm tunnel
  netfilter: nf_conntrack_h323: fix module unload crash
  netfilter: nf_conntrack_h323: fix memory leak in module initialization error path
  netfilter: nf_nat: fix RCU races
  atm: [he] send idle cells instead of unassigned when in SDH mode
  atm: [he] limit queries to the device's register space
  atm: [br2864] fix routed vcmux support
  ...
2008-06-18 11:48:40 -07:00
Jiri Slaby
e17ba73b0e x86, generic: mark early_printk as asmlinkage
It's not explicitly marked as asmlinkage, but invoked from x86_32
startup code with parameters on stack.

No other architectures define early_printk and none of them are affected
by this change, since defines asmlinkage as empty token.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-18 13:11:01 +02:00
YOSHIFUJI Hideaki
2b4743bd6b ipv6 sit: Avoid extra need for compat layer in PRL management.
We've introduced extra need of compat layer for ip_tunnel_prl{}
for PRL (Potential Router List) management.  Though compat_ioctl
is still missing in ipv4/ipv6, let's make the interface more
straight-forward and eliminate extra need for nasty compat layer
anyway since the interface is new for 2.6.26.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-16 16:48:20 -07:00
David Woodhouse
e53d6a1527 Export <linux/a.out.h> to userspace again.
This seems to have been removed accidentally in commit
ed7b1889da ("Unexport asm/page.h"), but
wasn't supposed to have been -- the original patch at
http://lkml.org/lkml/2007/10/30/144 just moved it from $(header-y) to
$(unifdef-y)

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-16 10:20:58 -07:00
David Woodhouse
9a8ea36967 Remove #ifdef CONFIG_ARCH_SUPPORTS_AOUT from <linux/a.out.h>
This file is only included where it makes sense now, so there's no need
for the CONFIG_ARCH_SUPPORTS_AOUT conditional -- and that conditional is
bad, because we want to export <linux/a.out.h> to userspace.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-16 10:20:58 -07:00
Ingo Molnar
d939d2851f Merge branch 'linus' into x86/irq 2008-06-16 11:27:45 +02:00
Ingo Molnar
9583f3d9c0 Merge branch 'linus' into core/softirq 2008-06-16 11:24:17 +02:00
Ingo Molnar
766d02786e Merge branch 'linus' into core/rcu 2008-06-16 11:23:36 +02:00
Ingo Molnar
688d22e23a Merge branch 'linus' into x86/xen 2008-06-16 11:21:27 +02:00
Ingo Molnar
e765ee90da Merge branch 'linus' into tracing/ftrace 2008-06-16 11:15:58 +02:00
Ingo Molnar
f9e8e07e07 Merge branch 'linus' into sched-devel 2008-06-16 11:15:21 +02:00
Linus Torvalds
0269c5c6d9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
  PCI: fixup write combine comment in pci_mmap_resource
  x86: PAT export resource_wc in pci sysfs
  x86, pci-dma.c: don't always add __GFP_NORETRY to gfp
  suspend-vs-iommu: prevent suspend if we could not resume
  x86: pci-dma.c: use __GFP_NO_OOM instead of __GFP_NORETRY
  pci, x86: add workaround for bug in ASUS A7V600 BIOS (rev 1005)
  PCI: use dev_to_node in pci_call_probe
  PCI: Correct last two HP entries in the bfsort whitelist
2008-06-14 13:32:56 -07:00
Linus Torvalds
51558576ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  tcp: Revert 'process defer accept as established' changes.
  ipv6: Fix duplicate initialization of rawv6_prot.destroy
  bnx2x: Updating the Maintainer
  net: Eliminate flush_scheduled_work() calls while RTNL is held.
  drivers/net/r6040.c: correct bad use of round_jiffies()
  fec_mpc52xx: MPC52xx_MESSAGES_DEFAULT: 2nd NETIF_MSG_IFDOWN => IFUP
  ipg: fix receivemode IPG_RM_RECEIVEMULTICAST{,HASH} in ipg_nic_set_multicast_list()
  netfilter: nf_conntrack: fix ctnetlink related crash in nf_nat_setup_info()
  netfilter: Make nflog quiet when no one listen in userspace.
  ipv6: Fail with appropriate error code when setting not-applicable sockopt.
  ipv6: Check IPV6_MULTICAST_LOOP option value.
  ipv6: Check the hop limit setting in ancillary data.
  ipv6 route: Fix route lifetime in netlink message.
  ipv6 mcast: Check address family of gf_group in getsockopt(MS_FILTER).
  dccp: Bug in initial acknowledgment number assignment
  dccp ccid-3: X truncated due to type conversion
  dccp ccid-3: TFRC reverse-lookup Bug-Fix
  dccp ccid-2: Bug-Fix - Ack Vectors need to be ignored on request sockets
  dccp: Fix sparse warnings
  dccp ccid-3: Bug-Fix - Zero RTT is possible
2008-06-13 07:34:47 -07:00
Ben Hutchings
c50cbb05a0 cpu topology: always define CPU topology information
This can result in an empty topology directory in sysfs, and requires
in-kernel users to protect all uses with #ifdef - see
<http://marc.info/?l=linux-netdev&m=120639033904472&w=2>.

The documentation of CPU topology specifies what the defaults should be if
only partial information is available from the hardware.  So we can
provide these defaults as a fallback.

This patch:

- Adds default definitions of the 4 topology macros to <linux/topology.h>
- Changes drivers/base/topology.c to use the topology macros unconditionally
  and to cope with definitions that aren't lvalues
- Updates documentation accordingly

[ From: Andrew Morton <akpm@linux-foundation.org>
  - fold now-duplicated code
  - fix layout
]

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: John Hawkes <hawkes@sgi.com>
Cc: Zhang, Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-13 10:09:46 +02:00
Dave Hansen
2165009bdf pagemap: pass mm into pagewalkers
We need this at least for huge page detection for now, because powerpc
needs the vm_area_struct to be able to determine whether a virtual address
is referring to a huge page (its pmd_huge() doesn't work).

It might also come in handy for some of the other users.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12 18:05:41 -07:00
Mike Miller
24aac480e7 cciss: add new hardware support
Add support for the next generation of HP Smart Array SAS/SATA
controllers.  Shipping date is late Fall 2008.

Bump the driver version to 3.6.20 to reflect the new hardware support from
patch 1 of this set.

Signed-off-by: Mike Miller <mike.miller@hp.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12 18:05:40 -07:00
Ben Nizette
57d3c64fd8 proc_fs.h: move struct mm_struct forward-declaration
Move the forward-declaration of struct mm_struct a little way up
proc_fs.h.  This fixes a bunch of "'struct mm_struct' declared inside
parameter list" warnings with CONFIG_PROC_FS=n

Signed-off-by: Ben Nizette <bn@niasdigital.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12 18:05:40 -07:00
David S. Miller
ec0a196626 tcp: Revert 'process defer accept as established' changes.
This reverts two changesets, ec3c0982a2
("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
the follow-on bug fix 9ae27e0adb
("tcp: Fix slab corruption with ipv6 and tcp6fuzz").

This change causes several problems, first reported by Ingo Molnar
as a distcc-over-loopback regression where connections were getting
stuck.

Ilpo Järvinen first spotted the locking problems.  The new function
added by this code, tcp_defer_accept_check(), only has the
child socket locked, yet it is modifying state of the parent
listening socket.

Fixing that is non-trivial at best, because we can't simply just grab
the parent listening socket lock at this point, because it would
create an ABBA deadlock.  The normal ordering is parent listening
socket --> child socket, but this code path would require the
reverse lock ordering.

Next is a problem noticed by Vitaliy Gusev, he noted:

----------------------------------------
>--- a/net/ipv4/tcp_timer.c
>+++ b/net/ipv4/tcp_timer.c
>@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
> 		goto death;
> 	}
>
>+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
>+		tcp_send_active_reset(sk, GFP_ATOMIC);
>+		goto death;

Here socket sk is not attached to listening socket's request queue. tcp_done()
will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
release this sk) as socket is not DEAD. Therefore socket sk will be lost for
freeing.
----------------------------------------

Finally, Alexey Kuznetsov argues that there might not even be any
real value or advantage to these new semantics even if we fix all
of the bugs:

----------------------------------------
Hiding from accept() sockets with only out-of-order data only
is the only thing which is impossible with old approach. Is this really
so valuable? My opinion: no, this is nothing but a new loophole
to consume memory without control.
----------------------------------------

So revert this thing for now.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-12 16:34:35 -07:00
Jesse Barnes
883eed1b3e Merge branch 'pci-for-jesse' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip into for-linus 2008-06-12 13:51:05 -07:00
Linus Torvalds
1b3cba8e60 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: 64-bit: fix arithmetics overflow
  sched: fair group: fix overflow(was: fix divide by zero)
  sched: fix TASK_WAKEKILL vs SIGKILL race
2008-06-12 12:55:18 -07:00
Linus Torvalds
dc10885d68 Merge branch 'core/iter-div' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core/iter-div' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  always_inline timespec_add_ns
  add an inlined version of iter_div_u64_rem
  common implementation of iterative div/mod
2008-06-12 07:47:44 -07:00
Jeremy Fitzhardinge
9412e28649 always_inline timespec_add_ns
timespec_add_ns is used from the x86-64 vdso, which cannot call out to
other kernel code.  Make sure that timespec_add_ns is always inlined
(and only uses always_inlined functions) to make sure there are no
unexpected calls.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12 10:48:00 +02:00
Jeremy Fitzhardinge
d5e181f78a add an inlined version of iter_div_u64_rem
iter_div_u64_rem is used in the x86-64 vdso, which cannot call other
kernel code.  For this case, provide the always_inlined version,
__iter_div_u64_rem.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12 10:47:58 +02:00
Jeremy Fitzhardinge
f595ec964d common implementation of iterative div/mod
We have a few instances of the open-coded iterative div/mod loop, used
when we don't expcet the dividend to be much bigger than the divisor.
Unfortunately modern gcc's have the tendency to strength "reduce" this
into a full mod operation, which isn't necessarily any faster, and
even if it were, doesn't exist if gcc implements it in libgcc.

The workaround is to put a dummy asm statement in the loop to prevent
gcc from performing the transformation.

This patch creates a single implementation of this loop, and uses it
to replace the open-coded versions I know about.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Robert Hancock <hancockr@shaw.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12 10:47:56 +02:00
venkatesh.pallipadi@intel.com
45aec1ae72 x86: PAT export resource_wc in pci sysfs
For the ranges with IORESOURCE_PREFETCH, export a new resource_wc interface in
pci /sysfs along with resource (which is uncached).

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-12 10:12:42 +02:00
Linus Torvalds
da50ccc6a0 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (23 commits)
  ACPICA: fix stray va_end() caused by mis-merge
  ACPI: Reject below-freezing temperatures as invalid critical temperatures
  ACPICA: Fix for access to deleted object <regression>
  ACPICA: Fix to make _SST method optional
  ACPICA: Fix for Load operator, load table at the namespace root
  ACPICA: Ignore ACPI table signature for Load() operator
  ACPICA: Fix to allow zero-length ASL field declarations
  ACPI: use memory_read_from_buffer()
  bay: exit if notify handler cannot be installed
  dock.c remove trailing printk whitespace
  proper prototype for acpi_processor_tstate_has_changed()
  ACPI: handle invalid ACPI SLIT table
  PNPACPI: use _CRS IRQ descriptor length for _SRS
  pnpacpi: fix shareable IRQ encode/decode
  pnpacpi: fix IRQ flag decoding
  MAINTAINERS: update ACPI homepage
  ACPI 2.6.26-rc2: Add missing newline to DSDT/SSDT warning message
  ACPI: EC: Use msleep instead of udelay while waiting for event.
  thinkpad-acpi: fix LED handling on older ThinkPads
  thinkpad-acpi: fix initialization error paths
  ...
2008-06-11 17:16:32 -07:00
Bjorn Helgaas
e9fe9e1881 pnpacpi: fix IRQ flag decoding
When decoding IRQ trigger mode and polarity, it is not enough to mask by
IORESOURCE_BITS because there are now additional bits defined.  For
example, if IORESOURCE_IRQ_SHAREABLE was set, we failed to set *triggering
and *polarity at all.

I can't point to a failure that this patch fixes, but
bugs in this area have caused problems when resuming after
suspend, for example:

    http://bugzilla.kernel.org/show_bug.cgi?id=6316
    http://bugzilla.kernel.org/show_bug.cgi?id=9487
    https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/152187

This is based on a patch by Tom Jaeger:
    http://bugzilla.kernel.org/show_bug.cgi?id=9487#c32

[rene.herman@keyaccess.nl: fix comment]
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-06-11 19:13:46 -04:00
Venkatesh Pallipadi
dcb84f335b cpuidle acpi driver: fix oops on AC<->DC
cpuidle and acpi driver interaction bug with the way cpuidle_register_driver()
is called. Due to this bug, there will be oops on
AC<->DC on some systems, where they support C-states in one DC and not in AC.

The current code does
ON BOOT:
	Look at CST and other C-state info to see whether more than C1 is
	supported. If it is, then acpi processor_idle does a
	cpuidle_register_driver() call, which internally enables the device.

ON CST change notification (AC<->DC) and on suspend-resume:
	acpi driver temporarily disables device, updates the device with
	any new C-states, and reenables the device.

The problem is is on boot, there are no C2, C3 states supported and we skip
the register. Later on AC<->DC, we may get a CST notification and we try
to reevaluate CST and enabled the device, without actually registering it.
This causes breakage as we try to create /sys fs sub directory, without the
parent directory which is created at register time.

Thanks to Sanjeev for reporting the problem here.
http://bugzilla.kernel.org/show_bug.cgi?id=10394

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-06-11 19:13:45 -04:00
Linus Torvalds
a4df1ac12d Merge branch 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm
* 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm:
  KVM: MMU: Fix is_empty_shadow_page() check
  KVM: MMU: Fix printk() format string
  KVM: IOAPIC: only set remote_irr if interrupt was injected
  KVM: MMU: reschedule during shadow teardown
  KVM: VMX: Clear CR4.VMXE in hardware_disable
  KVM: migrate PIT timer
  KVM: ppc: Report bad GFNs
  KVM: ppc: Use a read lock around MMU operations, and release it on error
  KVM: ppc: Remove unmatched kunmap() call
  KVM: ppc: add lwzx/stwz emulation
  KVM: ppc: Remove duplicate function
  KVM: s390: Fix race condition in kvm_s390_handle_wait
  KVM: s390: Send program check on access error
  KVM: s390: fix interrupt delivery
  KVM: s390: handle machine checks when guest is running
  KVM: s390: fix locking order problem in enable_sie
  KVM: s390: use yield instead of schedule to implement diag 0x44
  KVM: x86 emulator: fix hypercall return value on AMD
  KVM: ia64: fix zero extending for mmio ld1/2/4 emulation in KVM
2008-06-11 10:35:44 -07:00
Linus Torvalds
f7f866eed0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (42 commits)
  net: Fix routing tables with id > 255 for legacy software
  sky2: Hold RTNL while calling dev_close()
  s2io iomem annotations
  atl1: fix suspend regression
  qeth: start dev queue after tx drop error
  qeth: Prepare-function to call s390dbf was wrong
  qeth: reduce number of kernel messages
  qeth: Use ccw_device_get_id().
  qeth: layer 3 Oops in ip event handler
  virtio: use callback on empty in virtio_net
  virtio: virtio_net free transmit skbs in a timer
  virtio: Fix typo in virtio_net_hdr comments
  virtio_net: Fix skb->csum_start computation
  ehea: set mac address fix
  sfc: Recover from RX queue flush failure
  add missing lance_* exports
  ixgbe: fix typo
  forcedeth: msi interrupts
  ipsec: pfkey should ignore events when no listeners
  pppoe: Unshare skb before anything else
  ...
2008-06-11 08:39:51 -07:00
David S. Miller
513fd370e6 Merge branch 'davem-fixes' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6 2008-06-10 16:21:55 -07:00
Krzysztof Piotr Oledzki
709772e6e0 net: Fix routing tables with id > 255 for legacy software
Most legacy software do not like tables > 255 as rtm_table is u8
so tb_id is sent &0xff and it is possible to mismatch for example
table 510 with table 254 (main).

This patch introduces RT_TABLE_COMPAT=252 so the code uses it if
tb_id > 255. It makes such old applications happy, new
ones are still able to use RTA_TABLE to get a proper table id.

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-10 15:44:49 -07:00
Mark McLoughlin
2506ece0c0 virtio: Fix typo in virtio_net_hdr comments
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-10 18:20:30 -04:00
Arnaldo Carvalho de Melo
ce4a7d0d48 inet{6}_request_sock: Init ->opt and ->pktopts in the constructor
Wei Yongjun noticed that we may call reqsk_free on request sock objects where
the opt fields may not be initialized, fix it by introducing inet_reqsk_alloc
where we initialize ->opt to NULL and set ->pktopts to NULL in
inet6_reqsk_alloc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-10 12:39:35 -07:00
Adrian Bunk
b76916462d ide: remove the ide_etrax100 chipset type
I forgot to remove the ide_etrax100 chipset type when removing the
ETRAX_IDE driver.

Reported-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-06-10 20:56:36 +02:00
David Rientjes
9985b0bab3 sched: prevent bound kthreads from changing cpus_allowed
Kthreads that have called kthread_bind() are bound to specific cpus, so
other tasks should not be able to change their cpus_allowed from under
them.  Otherwise, it is possible to move kthreads, such as the migration
or software watchdog threads, so they are not allowed access to the cpu
they work on.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 12:26:16 +02:00
Abhishek Sagar
0eb967012e ftrace: prevent freeing of all failed updates
Prevent freeing of records which cause problems and correspond to function from
core kernel text. A new flag, FTRACE_FL_CONVERTED is used to mark a record
as "converted". All other records are patched lazily to NOPs. Failed records
now also remain on frace_hash table. Each invocation of ftrace_record_ip now
checks whether the traced function has ever been recorded (including past
failures) and doesn't re-record it again.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 11:56:57 +02:00
Oleg Nesterov
16882c1e96 sched: fix TASK_WAKEKILL vs SIGKILL race
schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
this allows us to do

	current->state = TASK_INTERRUPTIBLE;
	schedule();

without fear to sleep with pending signal.

However, the code like

	current->state = TASK_KILLABLE;
	schedule();

is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
no effect).

Introduce the new helper, signal_pending_state(), and change schedule() to
use it. Hopefully it will have more users, that is why the task's state is
passed separately.

Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
This is needed to preserve the current behaviour (ptrace_notify). I hope
this check will be removed soon, but this (afaics good) change needs the
separate discussion.

The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)",
basically the same that schedule() does now. However, this patch of course
bloats schedule().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 11:37:25 +02:00
Yinghai Lu
cc1a9d86ce mm, x86: shrink_active_range() should check all
Now we are using register_e820_active_regions() instead of
add_active_range() directly. So end_pfn could be different between the
value in early_node_map to node_end_pfn.

So we need to make shrink_active_range() smarter.

shrink_active_range() is a generic MM function in mm/page_alloc.c but
it is only used on 32-bit x86. Should we move it back to some file in
arch/x86?

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 11:31:44 +02:00
Adrian Bunk
585c5434f0 include/linux/ssb/ssb_driver_gige.h typo fix
This patch fixes a typo in the name of a config variable.

Reported-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Reviewed-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-06-09 15:53:37 -04:00
Linus Torvalds
64a3dcd5d3 Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
  [POWERPC] ehea: Remove dependency on MEMORY_HOTPLUG
  [POWERPC] Make walk_memory_resource available with MEMORY_HOTPLUG=n
  [POWERPC] Use dev_set_name in pci_64.c
  [POWERPC] Fix incorrect enabling of VMX when building signal or user context
  [POWERPC] boot/Makefile CONFIG_ variable fixes
2008-06-09 10:23:29 -07:00
Russ Anderson
dfa7e20cc0 mm: Minor clean-up of page flags in mm/page_alloc.c
Minor source code cleanup of page flags in mm/page_alloc.c.
Move the definition of the groups of bits to page-flags.h.

The purpose of this clean up is that the next patch will
conditionally add a page flag to the groups.  Doing that
in a header file is cleaner than adding #ifdefs to the
C code.

Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-09 10:22:24 -07:00
Paul Mackerras
8a3e1c670e Merge branch 'merge'
Conflicts:

	arch/powerpc/sysdev/fsl_soc.c
2008-06-09 12:19:41 +10:00
Nathan Lynch
0d5799449f [POWERPC] Make walk_memory_resource available with MEMORY_HOTPLUG=n
The ehea driver was recently changed[1] to use walk_memory_resource() to
detect the system's memory layout.  However, walk_memory_resource() is
available only when memory hotplug is enabled.  So CONFIG_EHEA was
made to depend on MEMORY_HOTPLUG [2], but it is inappropriate for a
network driver to have such a dependency.

Make the declaration of walk_memory_resource() and its powerpc
implementation (ehea is powerpc-specific) unconditionally available.

[1] 48cfb14f8b
    "ehea: Add DLPAR memory remove support"

[2] fb7b6ca2b6
    "ehea: Add dependency to Kconfig"

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-06-09 11:32:41 +10:00
Adrian Bunk
f751aa125d fat_valid_media() isn't for userspace
Commit 73f20e58b1 ("FAT_VALID_MEDIA():
remove pointless test") wrongly added the new fat_valid_media() function
to the userspace-visible part of include/linux/msdos_fs.h

Move it to the part of include/linux/msdos_fs.h that is not exported to
userspace.

Reported-by: Onur Küçük <onur@pardus.org.tr>
Reported-by: S.Çağlar Onur <caglar@pardus.org.tr>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-08 11:58:43 -07:00
Linus Torvalds
5f0e62c3e1 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: enable barriers by default
  jbd2: Fix barrier fallback code to re-lock the buffer head
  ext4: Display the journal_async_commit mount option in /proc/mounts
  jbd2: If a journal checksum error is detected, propagate the error to ext4
  jbd2: Fix memory leak when verifying checksums in the journal
  ext4: fix online resize bug
  ext4: Fix uninit block group initialization with FLEX_BG
  ext4: Fix use of uninitialized data with debug enabled.
2008-06-06 15:30:53 -07:00
Linus Torvalds
156a9ea43a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6:
  capabilities: remain source compatible with 32-bit raw legacy capability support.
  LSM: remove stale web site from MAINTAINERS
2008-06-06 11:31:55 -07:00
Jeff Layton
979b0fea2d vm: add kzalloc_node() inline
To get zeroed out memory from a particular NUMA node.  To be used by
sunrpc.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:14 -07:00
Nadia Derbey
68aa0a206a ipc: restore MSGPOOL original value
When posting:

	[PATCH 1/8] Scaling msgmni to the amount of lowmem

(see http://article.gmane.org/gmane.linux.kernel/637849/) I changed the
MSGPOOL value to make it fit what is said in the man pages (i.e.  a size
in bytes).

But Michael Kerrisk rightly complained that this change could affect the
ABI.  So I'm posting this patch to make MSGPOOL expressed back in Kbytes.
Michael, on his side, has fixed the man page.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Pierre Peiffer <peifferp@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:12 -07:00
Akinobu Mita
93b071139a introduce memory_read_from_buffer()
This patch introduces memory_read_from_buffer().

The only difference between memory_read_from_buffer() and
simple_read_from_buffer() is which address space the function copies to.

simple_read_from_buffer copies to user space memory.
memory_read_from_buffer copies to normal memory.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Doug Warzecha <Douglas_Warzecha@dell.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Cc: Abhay Salunke <Abhay_Salunke@dell.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Markus Rechberger <markus.rechberger@amd.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Bob Moore <robert.moore@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Len Brown <lenb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Cc: Michael Holzheu <holzheu@de.ibm.com>
Cc: Brian King <brking@us.ibm.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Andrew Vasquez <linux-driver@qlogic.com>
Cc: Seokmann Ju <seokmann.ju@qlogic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:11 -07:00
Harvey Harrison
3527fb326f lib: export bitrev16
Bluetooth will be able to use this.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Dave Young <hidave.darkstar@gmail.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:10 -07:00
David Woodhouse
44d1b980c7 Fix various old email addresses for dwmw2
Although if people have questions about ARCnet, perhaps it's _better_
for them to be mailing dwmw2@cam.ac.uk about it...

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:10 -07:00
Marcelo Tosatti
2f5997140f KVM: migrate PIT timer
Migrate the PIT timer to the physical CPU which vcpu0 is scheduled on,
similarly to what is done for the LAPIC timers, otherwise PIT interrupts
will be delayed until an unrelated event causes an exit.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-06-06 21:25:51 +03:00
Gregory Haskins
1f11eb6a8b sched: fix cpupri hotplug support
The RT folks over at RedHat found an issue w.r.t. hotplug support which
was traced to problems with the cpupri infrastructure in the scheduler:

https://bugzilla.redhat.com/show_bug.cgi?id=449676

This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel.  This patch
applies to 25.4-rt4, though it should trivially apply to most cpupri enabled
kernels mentioned above.

It turned out that the issue was that offline cpus could get inadvertently
registered with cpupri so that they were erroneously selected during
migration decisions.  The end result would be an OOPS as the offline cpu
had tasks routed to it.

This patch generalizes the old join/leave domain interface into an
online/offline interface, and adjusts the root-domain/hotplug code to
utilize it.

I was able to easily reproduce the issue prior to this patch, and am no
longer able to reproduce it after this patch.  I can offline cpus
indefinately and everything seems to be in working order.

Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point
me in the right direction.  Also thank you to Peter for reviewing the
early iterations of this patch.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:42 +02:00
Richard Kennedy
c7aceaba04 sched: reorder task_struct to reduce padding on 64bit builds
This patch removes 24 bytes of padding and allows 1 extra object per
slab on my fedora based config.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:35 +02:00
Ingo Molnar
554ec22f07 namespacecheck: more sched.c fixes
[ Stephen Rothwell <sfr@canb.auug.org.au>: build fix ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:33 +02:00
Max Krasnyansky
1840475676 genirq: Expose default irq affinity mask (take 3)
Current IRQ affinity interface does not provide a way to set affinity
for the IRQs that will be allocated/activated in the future.
This patch creates /proc/irq/default_smp_affinity that lets users set
default affinity mask for the newly allocated IRQs. Changing the default
does not affect affinity masks for the currently active IRQs, they
have to be changed explicitly.

Updated based on Paul J's comments and added some more documentation.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: a.p.zijlstra@chello.nl
Cc: tglx@linutronix.de
Cc: rdunlap@xenotime.net
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-05 15:18:30 +02:00
David Woodhouse
39028ec69b V4L/DVB (7166): [v4l] Add new user class controls and deprecate others
These were removed in commit 26d507fcfe:

> -#define V4L2_CID_HCENTER               (V4L2_CID_BASE+22)
> -#define V4L2_CID_VCENTER               (V4L2_CID_BASE+23)
> -#define V4L2_CID_LASTP1                        (V4L2_CID_BASE+24) /*
> last CID + 1 */
> +
> +/* Deprecated, use V4L2_CID_PAN_RESET and V4L2_CID_TILT_RESET */
> +#define V4L2_CID_HCENTER_DEPRECATED    (V4L2_CID_BASE+22)
> +#define V4L2_CID_VCENTER_DEPRECATED    (V4L2_CID_BASE+23)

But there was no warning in Documentation/feature-removal-schedule.txt
and I'm receiving reports that it's breaking userspace apps (the
gstreamer-v4l2 plugin breaks in Fedora rawhide). You can't just pull
things from the published userspace API like that.

Please can we revert the addition of _DEPRECATED to these ioctl
definitions. Perhaps we can add a runtime warning if they actually get
used? Or a compile-time warning if we can manage that?

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
2008-06-05 06:35:47 -03:00
Linus Torvalds
3e387fcdc4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits)
  l2tp: Fix possible oops if transmitting or receiving when tunnel goes down
  tcp: Fix for race due to temporary drop of the socket lock in skb_splice_bits.
  tcp: Increment OUTRSTS in tcp_send_active_reset()
  raw: Raw socket leak.
  lt2p: Fix possible WARN_ON from socket code when UDP socket is closed
  USB ID for Philips CPWUA054/00 Wireless USB Adapter 11g
  ssb: Fix context assertion in ssb_pcicore_dev_irqvecs_enable
  libertas: fix command size for CMD_802_11_SUBSCRIBE_EVENT
  ipw2200: expire and use oldest BSS on adhoc create
  airo warning fix
  b43legacy: Fix controller restart crash
  sctp: Fix ECN markings for IPv6
  sctp: Flush the queue only once during fast retransmit.
  sctp: Start T3-RTX timer when fast retransmitting lowest TSN
  sctp: Correctly implement Fast Recovery cwnd manipulations.
  sctp: Move sctp_v4_dst_saddr out of loop
  sctp: retran_path update bug fix
  tcp: fix skb vs fack_count out-of-sync condition
  sunhme: Cleanup use of deprecated calls to save_and_cli and restore_flags.
  xfrm: xfrm_algo: correct usage of RIPEMD-160
  ...
2008-06-04 17:39:33 -07:00
Linus Torvalds
246dd412d3 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  libata-sff: Fix oops reported in kerneloops.org for pnp devices with no ctl
  libata: kill unused constants
  sata_mv: PHY_MODE4 cleanups
  [libata] ata_piix: more acer short cable quirks
  [libata] ACPI: Properly handle bay devices in dock stations
2008-06-04 08:36:56 -07:00
Alan Cox
a57c1bade5 libata-sff: Fix oops reported in kerneloops.org for pnp devices with no ctl
- Make ata_sff_altstatus private so nobody uses it by mistake
- Drop the 400nS delay from it

Add

ata_sff_irq_status	-	encapsulates the IRQ check logic

This function keeps the existing behaviour for altstatus using devices. I
actually suspect the logic was wrong before the changes but -rc isn't the
time to play with that

ata_sff_sync		-	ensure writes hit the device

Really we want an io* operation for 'is posted' eg ioisposted(ioaddr) so
that we can fix the nasty delay this causes on most systems.

- ata_sff_pause		-	400nS delay

Ensure the command hit the device and delay 400nS

- ata_sff_dma_pause

Ensure the I/O hit the device and enforce an HDMA1:0 transition delay.
Requires altstatus register exists, BUG if not so we don't risk
corruption in MWDMA modes. (UDMA the checksum will save your backside in
theory)

The only other complication then is devices with their own handlers.
rb532 can use dma_pause but scc needs to access its own altstatus
register for internal errata workarounds so directly call the drivers own
altstatus function.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-04 06:40:41 -04:00
Tejun Heo
4f0ebe3cc5 libata: kill unused constants
Kill a few unused constants.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-04 06:29:14 -04:00
Thomas Graf
ab32cd793d route: Remove unused ifa_anycast field
The field was supposed to allow the creation of an anycast route by
assigning an anycast address to an address prefix. It was never
implemented so this field is unused and serves no purpose. Remove it.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-03 16:37:33 -07:00
Thomas Graf
1f9d11c7c9 route: Mark unused routing attributes as such
Also removes an unused policy entry for an attribute which is
only used in kernel->user direction.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-03 16:36:27 -07:00
Thomas Graf
51b77cae0d route: Mark unused route cache flags as such.
Also removes an obsolete check for the unused flag RTCF_MASQ.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-03 16:36:01 -07:00
Alan Cox
64e9159f5d serial_core: uart_set_ldisc infrastructure
The tty layer provides a callback that is used when the line discipline
is changed. Some hardware uses this to configure hardware specific
features such as IrDA mode on serial ports. Unfortunately the serial
layer does not provide this feature or pass it down to drivers.

Blackfin used to hack around this by rewriting the tty ops, but those are
now properly shared and const so the hack fails. Instead provide the
proper operations.

This change plus a follow up from the Blackfin guys is needed to avoid
blackfin losing features in this release.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-03 08:20:17 -07:00
Anton Vorontsov
63e14626ed mmc_spi: mmc_spi.h should include linux/interrupts.h
Since mmc_spi.h uses irqreturn_t type, it should include appropriate
header, otherwise build will break if users didn't include it (some of
them do not use interrupts).

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-02 15:27:10 -07:00
Steven Rostedt
ad90c0e3ce ftrace: user update and disable dynamic ftrace daemon
In dynamic ftrace, the mcount function starts off pointing to a stub
function that just returns.

On start up, the call to the stub is modified to point to a "record_ip"
function. The job of the record_ip function is to add the function to
a pre-allocated hash list. If the function is already there, it simply is
ignored, otherwise it is added to the list.

Later, a ftraced daemon wakes up and calls kstop_machine if any functions
have been recorded, and changes the calls to the recorded functions to
a simple nop.  If no functions were recorded, the daemon goes back to sleep.

The daemon wakes up once a second to see if it needs to update any newly
recorded functions into nops.  Usually it does not, but if a lot of code
has been executed for the first time in the kernel, the ftraced daemon
will call kstop_machine to update those into nops.

The problem currently is that there's no way to stop the daemon from doing
this, and it can cause unneeded latencies (800us which for some is bothersome).

This patch adds a new file /debugfs/tracing/ftraced_enabled. If the daemon
is active, reading this will return "enabled\n" and "disabled\n" when the
daemon is not running. To disable the daemon, the user can echo "0" or
"disable" into this file, and "1" or "enable" to re-enable the daemon.

Since the daemon is used to convert the functions into nops to increase
the performance of the system, I also added that anytime something is
written into the ftraced_enabled file, kstop_machine will run if there
are new functions that have been detected that need to be converted.

This way the user can disable the daemon but still be able to control the
conversion of the mcount calls to nops by simply,

  "echo 0 > /debugfs/tracing/ftraced_enabled"

when they need to do more conversions.

To see the number of converted functions:

  "cat /debugfs/tracing/dyn_ftrace_total_info"

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-02 12:50:04 +02:00
Andrew G. Morgan
ca05a99a54 capabilities: remain source compatible with 32-bit raw legacy capability support.
Source code out there hard-codes a notion of what the
_LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the
raw capability system calls capget() and capset().  Its unfortunate, but
true.

Since the confusing header file has been in a released kernel, there is
software that is erroneously using 64-bit capabilities with the semantics
of 32-bit compatibilities.  These recently compiled programs may suffer
corruption of their memory when sys_getcap() overwrites more memory than
they are coded to expect, and the raising of added capabilities when using
sys_capset().

As such, this patch does a number of things to clean up the situation
for all. It

  1. forces the _LINUX_CAPABILITY_VERSION define to always retain its
     legacy value.

  2. adopts a new #define strategy for the kernel's internal
     implementation of the preferred magic.

  3. deprecates v2 capability magic in favor of a new (v3) magic
     number. The functionality of v3 is entirely equivalent to v2,
     the only difference being that the v2 magic causes the kernel
     to log a "deprecated" warning so the admin can find applications
     that may be using v2 inappropriately.

[User space code continues to be encouraged to use the libcap API which
protects the application from details like this.  libcap-2.10 is the first
to support v3 capabilities.]

Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518.
Thanks to Bojan Smojver for the report.

[akpm@linux-foundation.org: s/depreciate/deprecate/g]
[akpm@linux-foundation.org: be robust about put_user size]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Bojan Smojver <bojan@rexursive.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-05-31 16:36:16 -07:00
Linus Torvalds
ab8cd81830 Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  lguest: notify on empty
  virtio: force callback on empty.
  virtio_blk: fix endianess annotations
  virtio_config: fix len calculation of config elements
  virtio_net: another race with virtio_net and enable_cb
  virtio: An entropy device, as suggested by hpa.
  virtio_blk: allow read-only disks
  lguest: fix ugly <NULL> in /proc/interrupts
  virtio: set device index in common code.
  virtio: virtio_pci should not set bus_id.
  virtio: bus_id for devices should contain 'virtio'
  Fix crash in virtio_blk during modprobe ; rmmod ; modprobe
  lguest: use ioremap_cache, not ioremap
2008-05-30 10:20:03 -07:00
Linus Torvalds
7536d7be7b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: rename SW_RADIO to SW_RFKILL_ALL
  Input: gtco - fix double kfree in error handling path
  Input: pxa27x_keypad - miscellaneous fixes
  Input: atkbd - mark keyboard as disabled when suspending/unloading
  Input: apanel - remove duplicate include
  Input: wm9713 - support five wire panels
  Input: wm97xx-core - fix race on PHY init
  Input: wm97xx-core - fix driver name
  Input: wm97xx-core - report a phys for WM97xx touchscreens
  Input: i8042 - make sure Dritek quirk is invoked at resume
  Input: i8042 - add Dritek quirk for Acer TravelMate 660
2008-05-30 10:17:19 -07:00
Henrique de Moraes Holschuh
5adad01339 Input: rename SW_RADIO to SW_RFKILL_ALL
The SW_RADIO code for EV_SW events has a name that is not descriptive
enough of its intended function, and could induce someone to think
KEY_RADIO is its EV_KEY counterpart, which is false.

Rename it to SW_RFKILL_ALL, and document what this event is for.  Keep
the old name around, to avoid userspace ABI breaks.

The SW_RFKILL_ALL event is meant to be used by rfkill master switches.  It
is not bound to a particular radio switch type, and usually applies to all
types.  It is semantically tied to master rfkill switches that enable or
disable every radio in a system.

Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
2008-05-30 10:40:46 -04:00
Rusty Russell
b4f68be6c5 virtio: force callback on empty.
virtio allows drivers to suppress callbacks (ie. interrupts) for
efficiency (no locking, it's just an optimization).

There's a similar mechanism for the host to suppress notifications
coming from the guest: in that case, we ignore the suppression if the
ring is completely full.

It turns out that life is simpler if the host similarly ignores
callback suppression when the ring is completely empty: the network
driver wants to free up old packets in a timely manner, and otherwise
has to use a timer to poll.

We have to remove the code which ignores interrupts when the driver
has disabled them (again, it had no locking and hence was unreliable
anyway).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-30 15:09:46 +10:00
Christian Borntraeger
7757f09c70 virtio_blk: fix endianess annotations
Since commit 72e61eb40b (virtio: change config
to guest endian) config space is no longer fixed endian.

Lets change the virtio_blk_config variables.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-30 15:09:45 +10:00
Christian Borntraeger
7f31fe0500 virtio_config: fix len calculation of config elements
Rusty,

This patch is a prereq for the virtio_blk blocksize patch, please apply it
first.

Adding an u32 value to the virtio_blk_config unconvered a small bug the config
space defintions:
v is a pointer, to we have to use sizeof(*v) instead of sizeof(v).

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-30 15:09:45 +10:00
Rusty Russell
f7f510ec19 virtio: An entropy device, as suggested by hpa.
Note that by itself, having a "hardware" random generator does very
little: you should probably run "rngd" in your guest to feed this into
the kernel entropy pool.

Included:
	virtio_rng: dont use vmalloced addresses for virtio

	If virtio_rng is build as a module, random_data is an address
	in vmalloc space. As virtio expects guest real addresses, this
	can cause any kind of funny behaviour, so lets allocate
	random_data dynamically with kmalloc.

	Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-30 15:09:44 +10:00
Christian Borntraeger
3ef5360954 virtio_blk: allow read-only disks
Hello Rusty,

sometimes it is useful to share a disk (e.g. usr). To avoid file system
corruption, the disk should be mounted read-only in that case. This patch
adds a new feature flag, that allows the host to specify, if the disk should
be considered read-only.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-30 15:09:44 +10:00
Linus Torvalds
916941b2bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6:
  driver-core: prepare for 2.6.27 api change by adding dev_set_name
2008-05-29 21:29:39 -07:00
Stephen Rothwell
413c239fad driver-core: prepare for 2.6.27 api change by adding dev_set_name
Create the dev_set_name function now so that various subsystems can
start changing over to it before other changes in 2.6.27 will make it
compulsory.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-05-29 21:10:01 -07:00
Linus Torvalds
a7f75d3bed Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: re-tune NUMA topologies
  sched: stop wake_affine from causing serious imbalance
  sched: fix sched_clock_cpu()
  revert ("sched: fair-group: SMP-nice for group scheduling")
  sched: cleanup
  show_schedstat(): fix memleak
  sched: unite unlikely pairs in rt_policy() and schedule_debug()
  revert ("sched: fair: weight calculations")
2008-05-29 09:26:17 -07:00
Ingo Molnar
6715930654 Merge commit 'linus/master' into sched-fixes-for-linus 2008-05-29 16:05:05 +02:00
Ingo Molnar
ea3f01f8af sched: re-tune NUMA topologies
improve the sysbench ramp-up phase and its peak throughput on
a 16way NUMA box, by turning on WAKE_AFFINE:

             tip/sched   tip/sched+wake-affine
-------------------------------------------------
    1:             700              830    +15.65%
    2:            1465             1391    -5.28%
    4:            3017             3105    +2.81%
    8:            5100             6021    +15.30%
   16:           10725            10745    +0.19%
   32:           10135            10150    +0.16%
   64:            9338             9240    -1.06%
  128:            8599             8252    -4.21%
  256:            8475             8144    -4.07%
-------------------------------------------------
  SUM:           57558            57882    +0.56%

this change also improves lat_ctx from 6.69 usecs to 1.11 usec:

  $ ./lat_ctx -s 0 2
  "size=0k ovr=1.19
  2 1.11

  $ ./lat_ctx -s 0 2
  "size=0k ovr=1.22
  2 6.69

in sysbench it's an overall win with some weakness at the lots-of-clients
side. That happens because we now under-balance this workload
a bit. To counter that effect, turn on NEWIDLE:

              wake-idle          wake-idle+newidle
 -------------------------------------------------
     1:             830              834    +0.43%
     2:            1391             1401    +0.65%
     4:            3105             3091    -0.43%
     8:            6021             6046    +0.42%
    16:           10745            10736    -0.08%
    32:           10150            10206    +0.55%
    64:            9240             9533    +3.08%
   128:            8252             8355    +1.24%
   256:            8144             8384    +2.87%
 -------------------------------------------------
   SUM:           57882            58591    +1.21%

as a bonus this not only improves the many-clients case but
also improves the (more important) rampup phase.

sysbench is a workload that quickly breaks down if the
scheduler over-balances, so since it showed an improvement
under NEWIDLE this change is definitely good.
2008-05-29 14:46:30 +02:00
Ingo Molnar
6363ca57c7 revert ("sched: fair-group: SMP-nice for group scheduling")
Yanmin Zhang reported:

Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1.
It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito.

With bisect, I located the following patch:

| 18d95a2832 is first bad commit
| commit 18d95a2832
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date:   Sat Apr 19 19:45:00 2008 +0200
|
|     sched: fair-group: SMP-nice for group scheduling

Revert it so that we get v2.6.25 behavior.

Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:28:57 +02:00
Jens Axboe
64565911cd block: make blktrace use per-cpu buffers for message notes
Currently it uses a single static char array, but that risks
being corrupted when multiple users issue message notes at the
same time. Make the buffers dynamically allocated when the trace
is setup and make them per-cpu instead.

The default max message size of 1k is also very large, the
interface is mainly for small text notes. So shrink it to 128 bytes.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Alan D. Brunelle
9d5f09a424 Added in MESSAGE notes for blktraces
Allows messages to be inserted into blktrace streams.

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Ingo Molnar
b1829d2705 ftrace: fix merge
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-28 01:25:04 +02:00
Jeremy Fitzhardinge
0e91398f2a xen: implement save/restore
This patch implements Xen save/restore and migration.

Saving is triggered via xenbus, which is polled in
drivers/xen/manage.c.  When a suspend request comes in, the kernel
prepares itself for saving by:

1 - Freeze all processes.  This is primarily to prevent any
    partially-completed pagetable updates from confusing the suspend
    process.  If CONFIG_PREEMPT isn't defined, then this isn't necessary.

2 - Suspend xenbus and other devices

3 - Stop_machine, to make sure all the other vcpus are quiescent.  The
    Xen tools require the domain to run its save off vcpu0.

4 - Within the stop_machine state, it pins any unpinned pgds (under
    construction or destruction), performs canonicalizes various other
    pieces of state (mostly converting mfns to pfns), and finally

5 - Suspend the domain

Restore reverses the steps used to save the domain, ending when all
the frozen processes are thawed.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-27 10:11:38 +02:00
Markus Armbruster
9e124fe16f xen: Enable console tty by default in domU if it's not a dummy
Without console= arguments on the kernel command line, the first
console to register becomes enabled and the preferred console (the one
behind /dev/console).  This is normally tty (assuming
CONFIG_VT_CONSOLE is enabled, which it commonly is).

This is okay as long tty is a useful console.  But unless we have the
PV framebuffer, and it is enabled for this domain, tty0 in domU is
merely a dummy.  In that case, we want the preferred console to be the
Xen console hvc0, and we want it without having to fiddle with the
kernel command line.  Commit b8c2d3dfbc
did that for us.

Since we now have the PV framebuffer, we want to enable and prefer tty
again, but only when PVFB is enabled.  But even then we still want to
enable the Xen console as well.

Problem: when tty registers, we can't yet know whether the PVFB is
enabled.  By the time we can know (xenstore is up), the console setup
game is over.

Solution: enable console tty by default, but keep hvc as the preferred
console.  Change the preferred console to tty when PVFB probes
successfully, unless we've been given console kernel parameters.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-27 10:11:36 +02:00
Mark Brown
43f83a8f99 Input: wm9713 - support five wire panels
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
2008-05-27 01:37:26 -04:00
Steven Rostedt
41c52c0db9 ftrace: set_ftrace_notrace feature
While debugging latencies in the RT kernel, I found that it would be nice
to be able to filter away functions from the trace than just to filter
on functions.

I added a new interface to the debugfs tracing directory called

  set_ftrace_notrace

When dynamic frace is enabled, this lets you filter away functions that will
not be recorded in the trace. It is similar to adding 'notrace' to those
functions but by doing it without recompiling the kernel.

Here's how set_ftrace_filter and set_ftrace_notrace interact. Remember, if
set_ftrace_filter is set, it removes all functions from the trace execpt for
those listed in the set_ftrace_filter. set_ftrace_notrace will prevent those
functions from being traced.

If you were to set one function in both set_ftrace_filter and
set_ftrace_notrace and that function was the same, then you would end up
with an empty trace.

the set of functions to trace is:

  set_ftrace_filter == empty then

     all functions not in set_ftrace_notrace

  else

     set of the set_ftrace_filter and not in set of set_ftrace_notrace.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-26 22:51:37 +02:00
Oleg Nesterov
cbaffba12c posix timers: discard SI_TIMER signals on exec
Based on Roland's patch. This approach was suggested by Austin Clements
from the very beginning, and then by Linus.

As Austin pointed out, the execing task can be killed by SI_TIMER signal
because exec flushes the signal handlers, but doesn't discard the pending
signals generated by posix timers. Perhaps not a bug, but people find this
surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26 10:37:07 -07:00
Linus Torvalds
84a881657d Merge branch 'i2c-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6
* 'i2c-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6:
  i2c: Align i2c_device_id
  tuner: Do not alter i2c_client.name
2008-05-26 10:24:06 -07:00
Linus Torvalds
c5e6fd28e5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (52 commits)
  vlan: Use bitmask of feature flags instead of seperate feature bits
  fmvj18x_cs: add NextCom NC5310 rev B support
  xirc2ps_cs: re-initialize the multicast address in do_reset
  3C509: rx_bytes should not be increased when alloc_skb failed
  NETFRONT: Use __skb_queue_purge()
  VIRTIO: Use __skb_queue_purge()
  phylib: do EXPORT_SYMBOL on get_phy_id
  netlink: Fix nla_parse_nested_compat() to call nla_parse() directly
  WAN: protect HDLC proto list while insmod/rmmod
  drivers/net/fs_enet: remove null pointer dereference
  S2io: Version update for napi and MSI-X patches
  S2io: Added napi support when MSIX is enabled.
  S2io: Move all the transmit completions to a single msi-x (alarm) vector
  drivers/net/ehea - remove unnecessary memset after kzalloc
  au1000_eth: remove useless check
  Blackfin EMAC Driver: Removed duplicated include <linux/ethtool.h>
  cpmac bugfixes and enhancements
  e1000e: use resource_size_t, not unsigned long, for phys addrs
  net/usb: add support for Apple USB Ethernet Adapter
  uli526x: add support for netpoll
  ...
2008-05-26 10:14:02 -07:00
Theodore Ts'o
624080eded jbd2: If a journal checksum error is detected, propagate the error to ext4
If a journal checksum error is detected, the ext4 filesystem will call
ext4_error(), and the mount will either continue, become a read-only
mount, or cause a kernel panic based on the superblock flags
indicating the user's preference of what to do in case of filesystem
corruption being detected.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-06-06 17:50:40 -04:00
Jiri Slaby
2548baa07d i2c: Align i2c_device_id
Align i2c_device_id.driver_data to 8 bytes to not fail on crossbuilds.

(Added in d2653e92732bd3911feff6bee5e23dbf959381db.)

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
2008-05-26 16:08:40 +02:00
Paul Jackson
c801ed3860 x86 boot: simplify pageblock_bits enum declaration
The use of #defines with '##' pre-processor concatenation is a useful
way to form several symbol names with a common pattern.  But when there
is just a single name obtained from that #define, it's just obfuscation.
Better to just write the plain symbol name, as is.

The following patch is a result of my wasting ten minutes looking through
the kernel to figure out what 'PB_migrate_end' meant, and forgetting what
I came to do, by the time I figured out that the #define PB_range macro
defined it.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-25 10:55:11 +02:00
Paul Jackson
e9197bf011 x86 boot: remove some unused extern function declarations
Remove three extern declarations for routines
that don't exist.  Fix a typo in a comment.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-25 10:55:10 +02:00
Carlos R. Mafra
962cf36c5b Remove argument from open_softirq which is always NULL
As git-grep shows, open_softirq() is always called with the last argument
being NULL

block/blk-core.c:       open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL);
kernel/hrtimer.c:       open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL);
kernel/rcuclassic.c:    open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
kernel/rcupreempt.c:    open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
kernel/sched.c: open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
kernel/softirq.c:       open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
kernel/softirq.c:       open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
kernel/timer.c: open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL);
net/core/dev.c: open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
net/core/dev.c: open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);

This observation has already been made by Matthew Wilcox in June 2002
(http://www.cs.helsinki.fi/linux/linux-kernel/2002-25/0687.html)

"I notice that none of the current softirq routines use the data element
passed to them."

and the situation hasn't changed since them. So it appears we can safely
remove that extra argument to save 128 (54) bytes of kernel data (text).

Signed-off-by: Carlos R. Mafra <crmafra@ift.unesp.br>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-25 07:43:15 +02:00
Jan Beulich
63687a528c x86: move tracedata to RODATA
.. allowing it to be write-protected just as other read-only data
under CONFIG_DEBUG_RODATA.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-25 07:09:47 +02:00
Thomas Gleixner
42fdfa238a namespacecheck: more kernel/printk.c fixes
[ Stephen Rothwell <sfr@canb.auug.org.au>: build fix ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-24 23:14:51 +02:00
Fernando Luis Vazquez Cao
12d15f0d51 for_each_online_pgdat(): kerneldoc fix
for_each_pgdat() was renamed to for_each_online_pgdat() and kerneldoc
comments should be updated accordingly.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:13 -07:00
David Brownell
6ea0205b56 gpio: build fixes
This fixes various gpio-related build errors (mostly potential)
reported in part by Russell King and Uwe Kleine-König.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Arnaud Patard <arnaud.patard@rtp-net.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:13 -07:00
Ben Dooks
cdc83ae245 SM501: reverse FPEN/VBIASEN flags behaviour
To keep backwards compatibility, reverse the meanings of these flags so
that when they are not set, the driver uses the original behvaiour.

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Cc: Arnaud Patard <arnaud.patard@rtp-net.org>
Acked-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:12 -07:00
NeilBrown
dfc7064500 md: restart recovery cleanly after device failure.
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.

For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.

We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
  - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
    which causes the recovery not be be checkpointed.
  - We remove spares and then re-added them which loses important state
    information.

The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed.  If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error.  So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.

Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded).  Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.

Issue:  If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available,  do we want to:
 1/ complete the current rebuild, then go back and rebuild the new spare or
 2/ restart the rebuild from the start and rebuild both devices in
    parallel.

Both options can be argued for.  The code currently takes option 2 as
  a/ this requires least code change
  b/ this results in a minimally-degraded array in minimal time.

Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:10 -07:00
Bernd Schubert
90b08710e4 md: allow parallel resync of md-devices.
In some configurations, a raid6 resync can be limited by CPU speed
(Calculating P and Q and moving data) rather than by device speed.  In
these cases there is nothing to be gained byt serialising resync of arrays
that share a device, and doing the resync in parallel can provide benefit.
 So add a sysfs tunable to flag an array as being allowed to resync in
parallel with other arrays that use (a different part of) the same device.

Signed-off-by: Bernd Schubert <bs@q-leap.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:10 -07:00
Christoph Hellwig
6bcfd60186 md: kill file_path wrapper
Kill the trivial and rather pointless file_path wrapper around d_path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:09 -07:00
Adrian Bunk
03de250a26 md: proper extern for mdp_major
This patch adds a proper extern for mdp_major in include/linux/raid/md.h

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:09 -07:00
Alan Cox
80119ef5c8 mm: fix atomic_t overflow in vm
The atomic_t type is 32bit but a 64bit system can have more than 2^32
pages of virtual address space available.  Without this we overflow on
ludicrously large mappings

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:09 -07:00
maximilian attems
6c7c6afbb8 types.h: don't expose struct ustat to userspace
<linux/types.h> can't be used together with <sys/ustat.h> because they
both define struct ustat:

    $ cat test.c
    #include <sys/ustat.h>
    #include <linux/types.h>
    $ gcc -c test.c
    In file included from test.c:2:
    /usr/include/linux/types.h:165: error: redefinition of 'struct ustat'

has been reported a while ago to debian, but seems to have been
lost in cat fighting: http://bugs.debian.org/429064

Signed-off-by: maximilian attems <max@stro.at>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:09 -07:00
Ignacio García Pérez
4b6f6ce97e serial: support for InstaShield IS-400 four port RS-232 PCI card
Add support for the InstaShield IS-400 four port RS-232 PCI card.

Signed-off-by: Ignacio García Pérez <iggarpe@t2i.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:09 -07:00
Darrick J. Wong
b8fdaf5a05 i5k_amb: support Intel 5400 chipset
Minor rework to support the Intel 5400 chipset.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Cc: "Mark M. Hoffman" <mhoffman@lightlink.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:08 -07:00
Pekka Paalanen
a50445d76c mmiotrace: rename kmmio_probe::user_data to :private.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-24 11:27:41 +02:00
Pekka Paalanen
dee310d0ad x86 mmiotrace: use resource_size_t for phys addresses
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-24 11:27:36 +02:00
Pekka Paalanen
970e6fa038 mmiotrace: code style cleanups
From c2da03771e29159627c5c7b9509ec70bce9f91ee Mon Sep 17 00:00:00 2001
From: Pekka Paalanen <pq@iki.fi>
Date: Mon, 28 Apr 2008 21:25:22 +0300

Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-24 11:27:28 +02:00
Pekka Paalanen
138295373c ftrace: mmiotrace update, #2
another weekend, another patch. This should apply on top of my previous patch
from March 23rd.

Summary of changes:
- Print PCI device list in output header
- work around recursive probe hits on SMP
- refactor dis/arm_kmmio_fault_page() and add check for page levels
- remove un/reference_kmmio(), the die notifier hook is registered
permanently into the list
- explicitly check for single stepping in die notifier callback

I have tested this version on my UP Athlon64 desktop with Nouveau, and
SMP Core 2 Duo laptop with the proprietary nvidia driver. Both systems
are 64-bit. One previously unknown bug crept into daylight: the ftrace
framework's output routines print the first entry last after buffer has
wrapped around.

The most important regressions compared to non-ftrace mmiotrace at this
time are:
- failure of trace_pipe file
- illegal lines in output file
- unaware of losing data due to buffer full

Personally I'd like to see these three solved before submitting to
mainline. Other issues may come up once we know when we lose events.

Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-24 11:25:16 +02:00
Pekka Paalanen
bd8ac686c7 ftrace: mmiotrace, updates
here is a patch that makes mmiotrace work almost well within the tracing
framework. The patch applies on top of my previous patch. I have my own
output formatting in place now.

Summary of changes:
- fix the NULL dereference that was due to not calling tracing_reset()
- add print_line() callback into struct tracer
- implement print_line() for mmiotrace, producing up-to-spec text
- add my output header, but that is not really called in the right place
- rewrote the main structs in mmiotrace
- added two new trace entry types: TRACE_MMIO_RW and TRACE_MMIO_MAP
- made some functions in trace.c non-static
- check current==NULL in tracing_generic_entry_update()
- fix(?) comparison in trace_seq_printf()

Things seem to work fine except a few issues. Markers (text lines injected
into mmiotrace log) are missing, I did not feel hacking them in before we
have variable length entries. My output header is printed only for 'trace'
file, but not 'trace_pipe'. For some reason, despite my quick fix,
iter->trace is NULL in print_trace_line() when called from 'trace_pipe'
file, which means I don't get proper output formatting.

I only tried by loading nouveau.ko, which just detects the card, and that
is traced fine. I didn't try further. Map, two reads and unmap. Works
perfectly.

I am missing the information about overflows, I'd prefer to have a
counter for lost events. I didn't try, but I guess currently there is no
way of knowning when it overflows?

So, not too far from being fully operational, it seems :-)
And looking at the diffstat, there also is some 700-900 lines of user space
code that just became obsolete.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-24 11:24:53 +02:00
Pekka Paalanen
f984b51e07 ftrace: add mmiotrace plugin
On Sat, 22 Mar 2008 13:07:47 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> > > i'd suggest the following: pull x86.git and sched-devel.git into a
> > > single tree [the two will combine without rejects]. Then try to add a
> > > kernel/tracing/trace_mmiotrace.c ftrace plugin. The trace_sysprof.c
> > > plugin might be a good example.
> >
> > I did this and now I have mmiotrace enabled/disabled via the tracing
> > framework (what do we call this, since ftrace is one of the tracers?).
>
> cool! could you send the patches for that? (even if they are not fully
> functional yet)

Patch attached in the end. Nice to see how much code disappeared. I tried
to mark all the features I had to break with XXX-comments.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-24 11:22:43 +02:00
Pekka Paalanen
d61fc44853 x86: mmiotrace, preview 2
Kconfig.debug, Makefile and testmmiotrace.c style fixes.
Use real mutex instead of mutex.
Fix failure path in register probe func.
kmmio: RCU read-locked over single stepping.
Generate mapping id's.
Make mmio-mod.c built-in and rewrite its locking.
Add debugfs file to enable/disable mmiotracing.
kmmio: use irqsave spinlocks.
Lots of cleanups in mmio-mod.c
Marker file moved from /proc into debugfs.
Call mmiotrace entrypoints directly from ioremap.c.

Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-24 11:22:24 +02:00
Pekka Paalanen
0fd0e3da45 x86: mmiotrace full patch, preview 1
kmmio.c handles the list of mmio probes with callbacks, list of traced
pages, and attaching into the page fault handler and die notifier. It
arms, traps and disarms the given pages, this is the core of mmiotrace.

mmio-mod.c is a user interface, hooking into ioremap functions and
registering the mmio probes. It also decodes the required information
from trapped mmio accesses via the pre and post callbacks in each probe.
Currently, hooking into ioremap functions works by redefining the symbols
of the target (binary) kernel module, so that it calls the traced
versions of the functions.

The most notable changes done since the last discussion are:
- kmmio.c is a built-in, not part of the module
- direct call from fault.c to kmmio.c, removing all dynamic hooks
- prepare for unregistering probes at any time
- make kmmio re-initializable and accessible to more than one user
- rewrite kmmio locking to remove all spinlocks from page fault path

Can I abuse call_rcu() like I do in kmmio.c:unregister_kmmio_probe()
or is there a better way?

The function called via call_rcu() itself calls call_rcu() again,
will this work or break? There I need a second grace period for RCU
after the first grace period for page faults.

Mmiotrace itself (mmio-mod.c) is still a module, I am going to attack
that next. At some point I will start looking into how to make mmiotrace
a tracer component of ftrace (thanks for the hint, Ingo). Ftrace should
make the user space part of mmiotracing as simple as
'cat /debug/trace/mmio > dump.txt'.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-24 11:22:12 +02:00
Pekka Paalanen
63ffa3e456 x86 mmiotrace: comment about user space ABI
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-24 11:21:48 +02:00
Pekka Paalanen
8b7d89d02e x86: mmiotrace - trace memory mapped IO
Mmiotrace is a tool for trapping memory mapped IO (MMIO) accesses within
the kernel. It is used for debugging and especially for reverse
engineering evil binary drivers.

Mmiotrace works by wrapping the ioremap family of kernel functions and
marking the returned pages as not present. Access to the IO memory
triggers a page fault, which will be handled by mmiotrace's custom page
fault handler. This will single-step the faulted instruction with the
MMIO page marked as present. Access logs are directed to user space via
relay and debug_fs.

This page fault approach is necessary, because binary drivers have
readl/writel etc. calls inlined and therefore extremely difficult to
trap with with e.g. kprobes.

This patch depends on the custom page fault handlers patch.

Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-24 11:21:14 +02:00
Ingo Molnar
489f139614 ftrace: fix build bug
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 22:37:04 +02:00
Ingo Molnar
d49dbf33f0 ftrace: fix include file dependency
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 22:36:51 +02:00
Ingo Molnar
74f4e369fc ftrace: stacktrace fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 22:34:56 +02:00
Mathieu Desnoyers
5b82a1b08a Port ftrace to markers
Porting ftrace to the marker infrastructure.

Don't need to chain to the wakeup tracer from the sched tracer, because markers
support multiple probes connected.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 22:29:25 +02:00
Mathieu Desnoyers
0aa977f592 Markers - define non optimized marker
To support the forthcoming "immediate values" marker optimization, we must have
a way to declare markers in few code paths that does not use instruction
modification based enable. This will be the case of printk(), some traps and
eventually lockdep instrumentation.

Changelog :
- Fix reversed boolean logic of "generic".

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 22:26:03 +02:00
Mathieu Desnoyers
dc102a8fae Markers - remove extra format argument
Denys Vlasenko <vda.linux@googlemail.com> :

> Not in this patch, but I noticed:
>
> #define __trace_mark(name, call_private, format, args...)               \
>         do {                                                            \
>                 static const char __mstrtab_##name[]                    \
>                 __attribute__((section("__markers_strings")))           \
>                 = #name "\0" format;                                    \
>                 static struct marker __mark_##name                      \
>                 __attribute__((section("__markers"), aligned(8))) =     \
>                 { __mstrtab_##name, &__mstrtab_##name[sizeof(#name)],   \
>                 0, 0, marker_probe_cb,                                  \
>                 { __mark_empty_function, NULL}, NULL };                 \
>                 __mark_check_format(format, ## args);                   \
>                 if (unlikely(__mark_##name.state)) {                    \
>                         (*__mark_##name.call)                           \
>                                 (&__mark_##name, call_private,          \
>                                 format, ## args);                       \
>                 }                                                       \
>         } while (0)
>
> In this call:
>
>                         (*__mark_##name.call)                           \
>                                 (&__mark_##name, call_private,          \
>                                 format, ## args);                       \
>
> you make gcc allocate duplicate format string. You can use
> &__mstrtab_##name[sizeof(#name)] instead since it holds the same string,
> or drop ", format," above and "const char *fmt" from here:
>
>         void (*call)(const struct marker *mdata,        /* Probe wrapper */
>                 void *call_private, const char *fmt, ...);
>
> since mdata->format is the same and all callees which need it can take it there.

Very good point. I actually thought about dropping it, since it would
remove an unnecessary argument from the stack. And actually, since I now
have the marker_probe_cb sitting between the marker site and the
callbacks, there is no API change required. Thanks :)

Mathieu

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 22:25:27 +02:00
Steven Rostedt
3eefae994d ftrace: limit trace entries
Currently there is no protection from the root user to use up all of
memory for trace buffers. If the root user allocates too many entries,
the OOM killer might start kill off all tasks.

This patch adds an algorith to check the following condition:

 pages_requested > (freeable_memory + current_trace_buffer_pages) / 4

If the above is met then the allocation fails. The above prevents more
than 1/4th of freeable memory from being used by trace buffers.

To determine the freeable_memory, I made determine_dirtyable_memory in
mm/page-writeback.c global.

Special thanks goes to Peter Zijlstra for suggesting the above calculation.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 22:05:14 +02:00
Ingo Molnar
88a4216c3e ftrace: sched special
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:08:47 +02:00
Ingo Molnar
1a3c303433 ftrace: fix __trace_special()
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:07:20 +02:00
Ingo Molnar
017730c112 ftrace: fix wakeups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:05:02 +02:00
Ingo Molnar
4e65551905 ftrace: sched tracer, trace full rbtree
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:04:44 +02:00
Ingo Molnar
8ac0fca4cc ftrace: sched tracer fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:04:28 +02:00
Ingo Molnar
86387f7ee5 ftrace: add stack tracing
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 21:04:20 +02:00
Ingo Molnar
aeaee8a2c9 ftrace: build fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:55:33 +02:00
Steven Rostedt
4eebcc81a3 ftrace: disable tracing on failure
Since ftrace touches practically every function. If we detect any
anomaly, we want to fully disable ftrace. This patch adds code
to try shutdown ftrace as much as possible without doing any more
harm is something is detected not quite correct.

This only kills ftrace, this patch does have checks for other parts of
the tracer (irqsoff, wakeup, etc.).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:54:16 +02:00
Steven Rostedt
37ad508419 ftrace - fix dynamic ftrace memory leak
The ftrace dynamic function update allocates a record to store the
instruction pointers that are being modified. If the modified
instruction pointer fails to update, then the record is marked as
failed and nothing more is done.

Worse, if the modification fails, but the record ip function is still
called, it will allocate a new record and try again. In just a matter
of time, will this cause a serious memory leak and crash the system.

This patch plugs this memory leak. When a record fails, it is
included back into the pool of records to be used. Now a record may
fail over and over again, but the number of allocated records will
not increase.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:54:04 +02:00
Steven Rostedt
77a2b37d22 ftrace: startup tester on dynamic tracing.
This patch adds a startup self test on dynamic code modification
and filters. The test filters on a specific function, makes sure that
no other function is traced, exectutes the function, then makes sure that
the function is traced.

This patch also fixes a slight bug with the ftrace selftest, where
tracer_enabled was not being set.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:41:06 +02:00
Ingo Molnar
c7aafc5497 ftrace: cleanups
factor out code and clean it up.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:40:46 +02:00
Steven Rostedt
e1c08bdd9f ftrace: force recording
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:40:29 +02:00
Ingo Molnar
f43fdad862 ftrace: fix kexec
disable the tracer while kexec pulls the rug from under the old
kernel.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:39:05 +02:00
Steven Rostedt
5072c59fd4 ftrace: add filter select functions to trace
This patch adds two files to the debugfs system:

 /debugfs/tracing/available_filter_functions

and

 /debugfs/tracing/set_ftrace_filter

The available_filter_functions lists all functions that has been
recorded by the ftraced that has called the ftrace_record_ip function.
This is to allow users to see what functions have been converted
to nops and can be enabled for tracing.

To enable functions, simply echo the names (whitespace delimited)
into set_ftrace_filter. Simple wildcards are also allowed.

echo 'scheduler' > /debugfs/tracing/set_ftrace_filter

Will have only the scheduler be activated when tracing is enabled.

echo 'sched_*' > /debugfs/tracing/set_ftrace_filter

Will have only the functions starting with 'sched_' be activated.

echo '*lock' > /debugfs/tracing/set_ftrace_filter

Will have only functions ending with 'lock' be activated.

echo '*lock*' > /debugfs/tracing/set_ftrace_filter

Will have only functions with 'lock' in its name be activated.

Note: 'sched*lock' will not work. The only wildcards that are
allowed is an asterisk and the beginning and or end of the string
passed in.

Multiple names can be passed in with whitespace delimited:

echo 'scheduler *lock *acpi*' > /debugfs/tracing/set_ftrace_filter

is also the same as:

echo 'scheduler' > /debugfs/tracing/set_ftrace_filter
echo '*lock' >> /debugfs/tracing/set_ftrace_filter
echo '*acpi*' >> /debugfs/tracing/set_ftrace_filter

Appending does just that. It appends to the list.

To disable all filters simply echo an empty line in:

echo > /debugfs/tracing/set_ftrace_filter

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:38:41 +02:00
Steven Rostedt
d61f82d066 ftrace: use dynamic patching for updating mcount calls
This patch replaces the indirect call to the mcount function
pointer with a direct call that will be patched by the
dynamic ftrace routines.

On boot up, the mcount function calls the ftace_stub function.
When the dynamic ftrace code is initialized, the ftrace_stub
is replaced with a call to the ftrace_record_ip, which records
the instruction pointers of the locations that call it.

Later, the ftraced daemon will call kstop_machine and patch all
the locations to nops.

When a ftrace is enabled, the original calls to mcount will now
be set top call ftrace_caller, which will do a direct call
to the registered ftrace function. This direct call is also patched
when the function that should be called is updated.

All patching is performed by a kstop_machine routine to prevent any
type of race conditions that is associated with modifying code
on the fly.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:33:47 +02:00
Steven Rostedt
3c1720f00b ftrace: move memory management out of arch code
This patch moves the memory management of the ftrace
records out of the arch code and into the generic code
making the arch code simpler.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:33:35 +02:00
Steven Rostedt
b0fc494fae ftrace: add ftrace_enabled sysctl to disable mcount function
This patch adds back the sysctl ftrace_enabled. This time it is
defaulted to on, if DYNAMIC_FTRACE is configured. When ftrace_enabled
is disabled, the ftrace function is set to the stub return.

If DYNAMIC_FTRACE is also configured, on ftrace_enabled = 0,
the registered ftrace functions will all be set to jmps, but no more
new calls to ftrace recording (used to find the ftrace calling sites)
will be called.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:33:19 +02:00
Steven Rostedt
3d0833953e ftrace: dynamic enabling/disabling of function calls
This patch adds a feature to dynamically replace the ftrace code
with the jmps to allow a kernel with ftrace configured to run
as fast as it can without it configured.

The way this works, is on bootup (if ftrace is enabled), a ftrace
function is registered to record the instruction pointer of all
places that call the function.

Later, if there's still any code to patch, a kthread is awoken
(rate limited to at most once a second) that performs a stop_machine,
and replaces all the code that was called with a jmp over the call
to ftrace. It only replaces what was found the previous time. Typically
the system reaches equilibrium quickly after bootup and there's no code
patching needed at all.

e.g.

  call ftrace  /* 5 bytes */

is replaced with

  jmp 3f  /* jmp is 2 bytes and we jump 3 forward */
3:

When we want to enable ftrace for function tracing, the IP recording
is removed, and stop_machine is called again to replace all the locations
of that were recorded back to the call of ftrace.  When it is disabled,
we replace the code back to the jmp.

Allocation is done by the kthread. If the ftrace recording function is
called, and we don't have any record slots available, then we simply
skip that call. Once a second a new page (if needed) is allocated for
recording new ftrace function calls.  A large batch is allocated at
boot up to get most of the calls there.

Because we do this via stop_machine, we don't have to worry about another
CPU executing a ftrace call as we modify it. But we do need to worry
about NMI's so all functions that might be called via nmi must be
annotated with notrace_nmi. When this code is configured in, the NMI code
will not call notrace.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:33:09 +02:00
Steven Rostedt
6cd8a4bb2f ftrace: trace preempt off critical timings
Add preempt off timings. A lot of kernel core code is taken from the RT patch
latency trace that was written by Ingo Molnar.

This adds "preemptoff" and "preemptirqsoff" to /debugfs/tracing/available_tracers

Now instead of just tracing irqs off, preemption off can be selected
to be recorded.

When this is selected, it shares the same files as irqs off timings.
One can either trace preemption off, irqs off, or one or the other off.

By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording
of preempt off only is performed. "irqsoff" will only record the time
irqs are disabled, but "preemptirqsoff" will take the total time irqs
or preemption are disabled. Runtime switching of these options is now
supported by simpling echoing in the appropriate trace name into
/debugfs/tracing/current_tracer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:32:54 +02:00
Steven Rostedt
81d68a96a3 ftrace: trace irq disabled critical timings
This patch adds latency tracing for critical timings
(how long interrupts are disabled for).

 "irqsoff" is added to /debugfs/tracing/available_tracers

Note:
  tracing_max_latency
    also holds the max latency for irqsoff (in usecs).
   (default to large number so one must start latency tracing)

  tracing_thresh
    threshold (in usecs) to always print out if irqs off
    is detected to be longer than stated here.
    If irq_thresh is non-zero, then max_irq_latency
    is ignored.

Here's an example of a trace with ftrace_enabled = 0

=======
preemption latency trace v1.1.5 on 2.6.24-rc7
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--------------------------------------------------------------------
 latency: 100 us, #3/3, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
    -----------------
    | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
    -----------------
 => started at: _spin_lock_irqsave+0x2a/0xb7
 => ended at:   _spin_unlock_irqrestore+0x32/0x5f

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
 swapper-0     1d.s3    0us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000])
 swapper-0     1d.s3  100us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000])
 swapper-0     1d.s3  100us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f)

vim:ft=help
=======

And this is a trace with ftrace_enabled == 1

=======
preemption latency trace v1.1.5 on 2.6.24-rc7
--------------------------------------------------------------------
 latency: 102 us, #12/12, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
    -----------------
    | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
    -----------------
 => started at: _spin_lock_irqsave+0x2a/0xb7
 => ended at:   _spin_unlock_irqrestore+0x32/0x5f

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
 swapper-0     1dNs3    0us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000])
 swapper-0     1dNs3   46us : e1000_read_phy_reg+0x16/0x225 [e1000] (e1000_update_stats+0x5e2/0x64c [e1000])
 swapper-0     1dNs3   46us : e1000_swfw_sync_acquire+0x10/0x99 [e1000] (e1000_read_phy_reg+0x49/0x225 [e1000])
 swapper-0     1dNs3   46us : e1000_get_hw_eeprom_semaphore+0x12/0xa6 [e1000] (e1000_swfw_sync_acquire+0x36/0x99 [e1000])
 swapper-0     1dNs3   47us : __const_udelay+0x9/0x47 (e1000_read_phy_reg+0x116/0x225 [e1000])
 swapper-0     1dNs3   47us+: __delay+0x9/0x50 (__const_udelay+0x45/0x47)
 swapper-0     1dNs3   97us : preempt_schedule+0xc/0x84 (__delay+0x4e/0x50)
 swapper-0     1dNs3   98us : e1000_swfw_sync_release+0xc/0x55 [e1000] (e1000_read_phy_reg+0x211/0x225 [e1000])
 swapper-0     1dNs3   99us+: e1000_put_hw_eeprom_semaphore+0x9/0x35 [e1000] (e1000_swfw_sync_release+0x50/0x55 [e1000])
 swapper-0     1dNs3  101us : _spin_unlock_irqrestore+0xe/0x5f (e1000_update_stats+0x641/0x64c [e1000])
 swapper-0     1dNs3  102us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000])
 swapper-0     1dNs3  102us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f)

vim:ft=help
=======

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-23 20:32:46 +02:00