This way there is no need to add explicit checks in every
for_each_shadow_entry user.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The performance counter MSRs are different for AMD and Intel CPUs and they
are chosen mainly by the CPUID vendor string. This patch catches writes to
all addresses (regardless of VMX/SVM path) and handles them in the generic
MSR handler routine. Writing a 0 into the event select register is something
we perfectly emulate ;-), so don't print out a warning to dmesg in this
case.
This fixes booting a 64bit Windows guest with an AMD CPUID on an Intel host.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
- Fail early in case gfn_to_pfn returns is_error_pfn.
- For the pre pte write case, avoid spurious "gva is valid but spte is notrap"
messages (the emulation code does the guest write first, so this particular
case is OK).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Under testing, count_writable_mappings returns a value that is 2 integers
larger than what count_rmaps returns.
Suspicion is that either of the two functions is counting a duplicate (either
positively or negatively).
Modifying check_writable_mappings_rmap to check for rmap existance on
all present MMU pages fails to trigger an error, which should keep Avi
happy.
Also introduce mmu_spte_walk to invoke a callback on all present sptes visible
to the current vcpu, might be useful in the future.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Hiding some of the last largepage / level interaction (which is useful
for gbpages and for zero based levels).
Also merge the PT_PAGE_TABLE_LEVEL clearing loop in unlink_children.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Archs are free to use vcpu_id as they see fit. For x86 it is used as
vcpu's apic id. New ioctl is added to configure boot vcpu id that was
assumed to be 0 till now.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We use shadow_pte and spte inconsistently, switch to the shorter spelling.
Rename set_shadow_pte() to __set_spte() to avoid a conflict with the
existing set_spte(), and to indicate its lowlevelness.
Signed-off-by: Avi Kivity <avi@redhat.com>
Since the guest and host ptes can have wildly different format, adjust
the pte accessor names to indicate on which type of pte they operate on.
No functional changes.
Signed-off-by: Avi Kivity <avi@redhat.com>
is_dirty_pte() is used on guest ptes, not shadow ptes, so it needs to avoid
shadow_dirty_mask and use PT_DIRTY_MASK instead.
Misdetecting dirty pages could lead to unnecessarily setting the dirty bit
under EPT.
Signed-off-by: Avi Kivity <avi@redhat.com>
"Unrestricted Guest" feature is added in the VMX specification.
Intel Westmere and onwards processors will support this feature.
It allows kvm guests to run real mode and unpaged mode
code natively in the VMX mode when EPT is turned on. With the
unrestricted guest there is no need to emulate the guest real mode code
in the vm86 container or in the emulator. Also the guest big real mode
code works like native.
The attached patch enhances KVM to use the unrestricted guest feature
if available on the processor. It also adds a new kernel/module
parameter to disable the unrestricted guest feature at the boot time.
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Protect irq injection/acking data structures with a separate irq_lock
mutex. This fixes the following deadlock:
CPU A CPU B
kvm_vm_ioctl_deassign_dev_irq()
mutex_lock(&kvm->lock); worker_thread()
-> kvm_deassign_irq() -> kvm_assigned_dev_interrupt_work_handler()
-> deassign_host_irq() mutex_lock(&kvm->lock);
-> cancel_work_sync() [blocked]
[gleb: fix ia64 path]
Reported-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
None of the interface services the LAPIC emulation provides need to be
exported to modules, and kvm_lapic_get_base is even totally unused
today.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of reloading the pdptrs on every entry and exit (vmcs writes on vmx,
guest memory access on svm) extract them on demand.
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of reading the PDPTRs from memory after every exit (which is slow
and wrong, as the PDPTRs are stored on the cpu), sync the PDPTRs from
memory to the VMCS before entry, and from the VMCS to memory after exit.
Do the same for cr3.
Signed-off-by: Avi Kivity <avi@redhat.com>
vmx_set_cr3() will call vmx_tlb_flush(), which will flush the ept context.
So there is no need to call ept_sync_context() explicitly.
Signed-off-by: Avi Kivity <avi@redhat.com>
We currently publish the i8254 resources to the pio_bus before the devices
are fully initialized. Since we hold the pit_lock, its probably not
a real issue. But lets clean this up anyway.
Reported-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
We modernize the io_device code so that we use container_of() instead of
dev->private, and move the vtable to a separate ops structure
(theoretically allows better caching for multiple instances of the same
ops structure)
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Since AMD does not support sysenter in 64bit mode, the VMCB fields storing
the MSRs are truncated to 32bit upon VMRUN/#VMEXIT. So store the values
in a separate 64bit storage to avoid truncation.
[andre: fix amd->amd migration]
Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The in-kernel speaker emulation is only a dummy and also unneeded from
the performance point of view. Rather, it takes user space support to
generate sound output on the host, e.g. console beeps.
To allow this, introduce KVM_CREATE_PIT2 which controls in-kernel
speaker port emulation via a flag passed along the new IOCTL. It also
leaves room for future extensions of the PIT configuration interface.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM provides a complete virtual system environment for guests, including
support for injecting interrupts modeled after the real exception/interrupt
facilities present on the native platform (such as the IDT on x86).
Virtual interrupts can come from a variety of sources (emulated devices,
pass-through devices, etc) but all must be injected to the guest via
the KVM infrastructure. This patch adds a new mechanism to inject a specific
interrupt to a guest using a decoupled eventfd mechnanism: Any legal signal
on the irqfd (using eventfd semantics from either userspace or kernel) will
translate into an injected interrupt in the guest at the next available
interrupt window.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The problem exists only on VMX. Also currently we skip this step if
there is pending exception. The patch fixes this too.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use proper foo-y style list additions to cleanup all the conditionals,
move module selection after compound object selection and remove the
superflous comment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
If we run out of cpuid entries for extended request types
we should return -E2BIG, just like we do for the standard
request types.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Replace 0xc0010010 with MSR_K8_SYSCFG and 0xc0010015 with MSR_K7_HWCR.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The related MSRs are emulated. MCE capability is exported via
extension KVM_CAP_MCE and ioctl KVM_X86_GET_MCE_CAP_SUPPORTED. A new
vcpu ioctl command KVM_X86_SETUP_MCE is used to setup MCE emulation
such as the mcg_cap. MCE is injected via vcpu ioctl command
KVM_X86_SET_MCE. Extended machine-check state (MCG_EXT_P) and CMCI are
not implemented.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use standard msr-index.h's MSR declaration.
MSR_IA32_TSC is better than MSR_IA32_TIME_STAMP_COUNTER as it also solves
80 column issue.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When reinjecting a software interrupt or exception, use the correct
instruction length provided by the hardware instead of a hardcoded 1.
Fixes problems running the suse 9.1 livecd boot loader.
Problem introduced by commit f0a3602c20 ("KVM: Move interrupt injection
logic to x86.c").
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2.6.31-rc7 does not boot on vSMP systems:
[ 8.501108] CPU31: Thermal monitoring enabled (TM1)
[ 8.501127] CPU 31 MCA banks SHD:2 SHD:3 SHD:5 SHD:6 SHD:8
[ 8.650254] CPU31: Intel(R) Xeon(R) CPU E5540 @ 2.53GHz stepping 04
[ 8.710324] Brought up 32 CPUs
[ 8.713916] Total of 32 processors activated (162314.96 BogoMIPS).
[ 8.721489] ERROR: parent span is not a superset of domain->span
[ 8.727686] ERROR: domain->groups does not contain CPU0
[ 8.733091] ERROR: groups don't span domain->span
[ 8.737975] ERROR: domain->cpu_power not set
[ 8.742416]
Ravikiran Thirumalai bisected it to:
| commit 2759c3287d
| x86: don't call read_apic_id if !cpu_has_apic
The problem is that on vSMP systems the CPUID derived
initial-APICIDs are overlapping - so we need to fall
back on hard_smp_processor_id() which reads the local
APIC.
Both come from the hardware (influenced by firmware
though) so it's a tough call which one to trust.
Doing the quirk expresses the vSMP property properly
and also does not affect other systems, so we go for
this solution instead of a revert.
Reported-and-Tested-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shai Fultheim <shai@scalex86.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <4A944D3C.5030100@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Initialize cx before calling xen_cpuid(), in order to suppress the
"may be used uninitialized in this function" warning.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Xen always runs on CPUs which properly support WP enforcement in
privileged mode, so there's no need to test for it.
This also works around a crash reported by Arnd Hannemann, though I
think its just a band-aid for that case.
Reported-by: Arnd Hannemann <hannemann@nets.rwth-aachen.de>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clockevent: Prevent dead lock on clockevents_lock
timers: Drop write permission on /proc/timer_list
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix build with older binutils and consolidate linker script
x86: Fix an incorrect argument of reserve_bootmem()
x86: add vmlinux.lds to targets in arch/x86/boot/compressed/Makefile
xen: rearrange things to fix stackprotector
x86: make sure load_percpu_segment has no stackprotector
i386: Fix section mismatches for init code with !HOTPLUG_CPU
x86, pat: Allow ISA memory range uncacheable mapping requests
binutils prior to 2.17 can't deal with the currently possible
situation of a new segment following the per-CPU segment, but
that new segment being empty - objcopy misplaces the .bss (and
perhaps also the .brk) sections outside of any segment.
However, the current ordering of sections really just appears
to be the effect of cumulative unrelated changes; re-ordering
things allows to easily guarantee that the segment following
the per-CPU one is non-empty, and at once eliminates the need
for the bogus data.init2 segment.
Once touching this code, also use the various data section
helper macros from include/asm-generic/vmlinux.lds.h.
-v2: fix !SMP builds.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: <sam@ravnborg.org>
LKML-Reference: <4A94085D02000078000119A5@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This line looks suspicious, because if this is true, then the
'flags' parameter of function reserve_bootmem_generic() will be
unused when !CONFIG_NUMA. I don't think this is what we want.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: akpm@linux-foundation.org
LKML-Reference: <20090821083709.5098.52505.sendpatchset@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As noted in 83d349f35e ("x86: don't send
an IPI to the empty set of CPU's"), some APIC's will be very unhappy
with an empty destination mask. That commit added a WARN_ON() for that
case, and avoided the resulting problem, but didn't fix the underlying
reason for why those empty mask cases happened.
This fixes that, by checking the result of 'cpumask_andnot()' of the
current CPU actually has any other CPU's left in the set of CPU's to be
sent a TLB flush, and not calling down to the IPI code if the mask is
empty.
The reason this started happening at all is that we started passing just
the CPU mask pointers around in commit 4595f9620 ("x86: change
flush_tlb_others to take a const struct cpumask"), and when we did that,
the cpumask was no longer thread-local.
Before that commit, flush_tlb_mm() used to create it's own copy of
'mm->cpu_vm_mask' and pass that copy down to the low-level flush
routines after having tested that it was not empty. But after changing
it to just pass down the CPU mask pointer, the lower level TLB flush
routines would now get a pointer to that 'mm->cpu_vm_mask', and that
could still change - and become empty - after the test due to other
CPU's having flushed their own TLB's.
See
http://bugzilla.kernel.org/show_bug.cgi?id=13933
for details.
Tested-by: Thomas Björnell <thomas.bjornell@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The default_send_IPI_mask_logical() function uses the "flat" APIC mode
to send an IPI to a set of CPU's at once, but if that set happens to be
empty, some older local APIC's will apparently be rather unhappy. So
just warn if a caller gives us an empty mask, and ignore it.
This fixes a regression in 2.6.30.x, due to commit 4595f9620 ("x86:
change flush_tlb_others to take a const struct cpumask"), documented
here:
http://bugzilla.kernel.org/show_bug.cgi?id=13933
which causes a silent lock-up. It only seems to happen on PPro, P2, P3
and Athlon XP cores. Most developers sadly (or not so sadly, if you're
a developer..) have more modern CPU's. Also, on x86-64 we don't use the
flat APIC mode, so it would never trigger there even if the APIC didn't
like sending an empty IPI mask.
Reported-by: Pavel Vilim <wylda@volny.cz>
Reported-and-tested-by: Thomas Björnell <thomas.bjornell@gmail.com>
Reported-and-tested-by: Martin Rogge <marogge@onlinehome.de>
Cc: Mike Travis <travis@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The absence of vmlinux.lds here keeps .vmlinux.lds.cmd from being
included, which in turn leads to it and all its dependents always
getting rebuilt independent of whether they are already up-to-date.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4A8D84670200007800010D31@vpn.id2.novell.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make sure the stack-protector segment registers are properly set up
before calling any functions which may have stack-protection compiled
into them.
[ Impact: prevent Xen early-boot crash when stack-protector is enabled ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
load_percpu_segment() is used to set up the per-cpu segment registers,
which are also used for -fstack-protector. Make sure that the
load_percpu_segment() function doesn't have stackprotector enabled.
[ Impact: allow percpu setup before calling stack-protected functions ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Currently clockevents_notify() is called with interrupts enabled at
some places and interrupts disabled at some other places.
This results in a deadlock in this scenario.
cpu A holds clockevents_lock in clockevents_notify() with irqs enabled
cpu B waits for clockevents_lock in clockevents_notify() with irqs disabled
cpu C doing set_mtrr() which will try to rendezvous of all the cpus.
This will result in C and A come to the rendezvous point and waiting
for B. B is stuck forever waiting for the spinlock and thus not
reaching the rendezvous point.
Fix the clockevents code so that clockevents_lock is taken with
interrupts disabled and thus avoid the above deadlock.
Also call lapic_timer_propagate_broadcast() on the destination cpu so
that we avoid calling smp_call_function() in the clockevents notifier
chain.
This issue left us wondering if we need to change the MTRR rendezvous
logic to use stop machine logic (instead of smp_call_function) or add
a check in spinlock debug code to see if there are other spinlocks
which gets taken under both interrupts enabled/disabled conditions.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: "Pallipadi Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: "Brown Len" <len.brown@intel.com>
LKML-Reference: <1250544899.2709.210.camel@sbs-t61.sc.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: use the right flag for get_vm_area()
percpu, sparc64: fix sparse possible cpu map handling
init: set nr_cpu_ids before setup_per_cpu_areas()
Commit 0e83815be7 changed the
section the initial_code variable gets allocated in, in an
attempt to address a section conflict warning. This, however
created a new section conflict when building without
HOTPLUG_CPU. The apparently only (reasonable) way to address
this is to always use __REFDATA.
Once at it, also fix a second section mismatch when not using
HOTPLUG_CPU.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Robert Richter <robert.richter@amd.com>
LKML-Reference: <4A8AE7CD020000780001054B@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Max Vozeler reported:
> Bug 13877 - bogl-term broken with CONFIG_X86_PAT=y, works with =n
>
> strace of bogl-term:
> 814 mmap2(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0)
> = -1 EAGAIN (Resource temporarily unavailable)
> 814 write(2, "bogl: mmaping /dev/fb0: Resource temporarily unavailable\n",
> 57) = 57
PAT code maps the ISA memory range as WB in the PAT attribute, so that
fixed range MTRR registers define the actual memory type (UC/WC/WT etc).
But the upper level is_new_memtype_allowed() API checks are failing,
as the request here is for UC and the return tracked type is WB (Tracked type is
WB as MTRR type for this legacy range potentially will be different for each
4k page).
Fix is_new_memtype_allowed() by always succeeding the ISA address range
checks, as the null PAT (WB) and def MTRR fixed range register settings
satisfy the memory type needs of the applications that map the ISA address
range.
Reported-and-Tested-by: Max Vozeler <xam@debian.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
An older test-box started hanging at the following point during
bootup:
[ 0.022996] Mount-cache hash table entries: 512
[ 0.024996] Initializing cgroup subsys debug
[ 0.025996] Initializing cgroup subsys cpuacct
[ 0.026995] Initializing cgroup subsys devices
[ 0.027995] Initializing cgroup subsys freezer
[ 0.028995] mce: CPU supports 5 MCE banks
I've bisected it down to commit 4efc0670 ("x86, mce: use 64bit
machine check code on 32bit"), which utilizes the MCE code on
32-bit systems too.
The problem is caused by this detail in my config:
# CONFIG_CPU_SUP_INTEL is not set
This disables the quirks in mce_cpu_quirks() but still enables
MCE support - which then hangs due to the missing quirk
workaround needed on this CPU:
if (c->x86 == 6 && c->x86_model < 0x1A && banks > 0)
mce_banks[0].init = 0;
The safe solution is to not initialize MCEs if we dont know on
what CPU we are running (or if that CPU's support code got
disabled in the config).
Also be a bit more defensive on 32-bit systems: dont do a
boot-time dump of pending MCEs not just on the specific system
that we found a problem with (Pentium-M), but earlier ones as
well.
Now this problem is probably not common and disabling CPU
support is rare - but still being more defensive in something
we turned on for a wide range of CPUs is prudent.
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
LKML-Reference: Message-ID: <4A88E3E4.40506@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On my legacy Pentium M laptop (Acer Extensa 2900) I get bogus MCE on a cold
boot with CONFIG_X86_NEW_MCE enabled, i.e. (after decoding it with mcelog):
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 1 MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Data CACHE Level-1 UNKNOWN Error
STATUS f200000000000195 MCGSTATUS 0
[ The other STATUS values observed: f2000000000001b5 (... UNKNOWN error)
and f200000000000115 (... READ Error).
To verify that this is not a CONFIG_X86_NEW_MCE bug I also modified
the CONFIG_X86_OLD_MCE code (which doesn't log any MCEs) to dump
content of STATUS MSR before it is cleared during initialization. ]
Since the bogus MCE results in a kernel taint (which in turn disables
lockdep support) don't log boot MCEs on Pentium M (model == 13) CPUs
by default ("mce=bootlog" boot parameter can be be used to get the old
behavior).
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The function uv_acpi_madt_oem_check() has been marked __init,
the struct apic_x2apic_uv_x has been marked __refdata.
The aim is to address the following section mismatch messages:
WARNING: arch/x86/kernel/apic/built-in.o(.data+0x1368): Section mismatch in reference from the variable apic_x2apic_uv_x to the function .cpuinit.text:uv_wakeup_secondary()
The variable apic_x2apic_uv_x references
the function __cpuinit uv_wakeup_secondary()
If the reference is valid then annotate the
variable with __init* or __refdata (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
WARNING: arch/x86/kernel/built-in.o(.data+0x68e8): Section mismatch in reference from the variable apic_x2apic_uv_x to the function .cpuinit.text:uv_wakeup_secondary()
The variable apic_x2apic_uv_x references
the function __cpuinit uv_wakeup_secondary()
If the reference is valid then annotate the
variable with __init* or __refdata (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
WARNING: arch/x86/built-in.o(.text+0x7b36f): Section mismatch in reference from the function uv_acpi_madt_oem_check() to the function .init.text:early_ioremap()
The function uv_acpi_madt_oem_check() references
the function __init early_ioremap().
This is often because uv_acpi_madt_oem_check lacks a __init
annotation or the annotation of early_ioremap is wrong.
WARNING: arch/x86/built-in.o(.text+0x7b38d): Section mismatch in reference from the function uv_acpi_madt_oem_check() to the function .init.text:early_iounmap()
The function uv_acpi_madt_oem_check() references
the function __init early_iounmap().
This is often because uv_acpi_madt_oem_check lacks a __init
annotation or the annotation of early_iounmap is wrong.
WARNING: arch/x86/built-in.o(.data+0x8668): Section mismatch in reference from the variable apic_x2apic_uv_x to the function .cpuinit.text:uv_wakeup_secondary()
The variable apic_x2apic_uv_x references
the function __cpuinit uv_wakeup_secondary()
If the reference is valid then annotate the
variable with __init* or __refdata (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
Signed-off-by: Leonardo Potenza <lpotenza@inwind.it>
LKML-Reference: <200908161855.48302.lpotenza@inwind.it>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The SGI UV Broadcast Assist Unit is used to send TLB shootdown
messages to remote nodes of the system. The header of the
message must contain the subnode id of the block in the
receiving hub that handles such messages. It should always be
0x10, the id of the "LB" block.
It had previously been documented as a "must be zero" field.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
LKML-Reference: <E1Mc1x7-0005Ce-6t@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
percpu code has been assuming num_possible_cpus() == nr_cpu_ids which
is incorrect if cpu_possible_map contains holes. This causes percpu
code to access beyond allocated memories and vmalloc areas. On a
sparc64 machine with cpus 0 and 2 (u60), this triggers the following
warning or fails boot.
WARNING: at /devel/tj/os/work/mm/vmalloc.c:106 vmap_page_range_noflush+0x1f0/0x240()
Modules linked in:
Call Trace:
[00000000004b17d0] vmap_page_range_noflush+0x1f0/0x240
[00000000004b1840] map_vm_area+0x20/0x60
[00000000004b1950] __vmalloc_area_node+0xd0/0x160
[0000000000593434] deflate_init+0x14/0xe0
[0000000000583b94] __crypto_alloc_tfm+0xd4/0x1e0
[00000000005844f0] crypto_alloc_base+0x50/0xa0
[000000000058b898] alg_test_comp+0x18/0x80
[000000000058dad4] alg_test+0x54/0x180
[000000000058af00] cryptomgr_test+0x40/0x60
[0000000000473098] kthread+0x58/0x80
[000000000042b590] kernel_thread+0x30/0x60
[0000000000472fd0] kthreadd+0xf0/0x160
---[ end trace 429b268a213317ba ]---
This patch fixes generic percpu functions and sparc64
setup_per_cpu_areas() so that they handle sparse cpu_possible_map
properly.
Please note that on x86, cpu_possible_map() doesn't contain holes and
thus num_possible_cpus() == nr_cpu_ids and this patch doesn't cause
any behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Report the cloning task as parent on perf_counter_fork()
perf_counter: Fix an ipi-deadlock
perf: Rework/fix the whole read vs group stuff
perf_counter: Fix swcounter context invariance
perf report: Don't show unresolved DSOs and symbols when -S/-d is used
perf tools: Add a general option to enable raw sample records
perf tools: Add a per tracepoint counter attribute to get raw sample
perf_counter: Provide hw_perf_counter_setup_online() APIs
perf list: Fix large list output by using the pager
perf_counter, x86: Fix/improve apic fallback
perf record: Add missing -C option support for specifying profile cpu
perf tools: Fix dso__new handle() to handle deleted DSOs
perf tools: Fix fallback to cplus_demangle() when bfd_demangle() is not available
perf report: Show the tid too in -D
perf record: Fix .tid and .pid fill-in when synthesizing events
perf_counter, x86: Fix generic cache events on P6-mobile CPUs
perf_counter, x86: Fix lapic printk message
Johannes Stezenbach reported that his Pentium-M based
laptop does not have the local APIC enabled by default,
and hence perfcounters do not get initialized.
Add a fallback for this case: allow non-sampled counters
and return with an error on sampled counters. This allows
'perf stat' to work out of box - and allows 'perf top'
and 'perf record' to fall back on a hrtimer based sampling
method.
( Passing 'lapic' on the boot line will allow hardware
sampling to occur - but if the APIC is disabled
permanently by the hardware then this fallback still
allows more systems to use perfcounters. )
Also decouple perfcounter support from X86_LOCAL_APIC.
-v2: fix typo breaking counters on all other systems ...
Reported-by: Johannes Stezenbach <js@sig21.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kernel is broken for x86 CPUs without CPUID since 2.6.28. It
crashes with NULL pointer dereference in identify_cpu():
766 generic_identify(c);
767
768--> if (this_cpu->c_identify)
769 this_cpu->c_identify(c);
this_cpu is NULL. This is because it's only initialized in
get_cpu_vendor() function, which is not called if the CPU has
no CPUID instruction.
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
LKML-Reference: <200908112000.15993.linux@rainbow-software.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Due to an erratum with certain AMD Athlon 64 processors, the
BIOS may need to force enable the LAHF_LM capability.
Unfortunately, in at least one case, the BIOS does this even
for processors that do not support the functionality.
Add a specific check that will clear the feature bit for
processors known not to support the LAHF/SAHF instructions.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Acked-by: Borislav Petkov <petkovbb@googlemail.com>
LKML-Reference: <4A80A5AD.2000209@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Johannes Stezenbach reported that 'perf stat' does not count
cache-miss and cache-references events on his Pentium-M based
laptop.
This is because we left them blank in p6_perfmon_event_map[],
fill them in.
Reported-by: Johannes Stezenbach <js@sig21.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
My Latitude d630 seems to be handling thermal events in SMI by
lowering the max frequency of the CPU till it cools down but
still leaks the "everything is normal" events.
This spams the console and with high priority printks.
Adjust therm_throt driver to only print messages about the fact
that temperatire returned back to normal when leaving the
throttling state.
Also lower the severity of "back to normal" message from
KERN_CRIT to KERN_INFO.
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Acked-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <20090810051513.0558F526EC9@mailhub.coreip.homeip.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reboot does not work on my MacBook Pro 13 inch (MacBookPro5,5)
too. It seems all unibody MacBook and MacBookPro require
PCI reboot handling, i guess.
Following model/machine ID list shows unibody MacBook/Pro have
the 5 series of model number:
http://www.everymac.com/systems/by_capability/macs-by-machine-model-machine-id.html
Signed-off-by: Shunichi Fuji <palglowr@gmail.com>
Cc: Ozan Çağlayan <ozan@pardus.org.tr>
LKML-Reference: <30046e3b0908101134p6487ddbftd8776e4ddef204be@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Wei Chong Tan reported a fast-PIT-calibration corner-case:
| pit_expect_msb() is vulnerable to SMI disturbance corner case
| in some platforms which causes /proc/cpuinfo to show wrong
| CPU MHz value when quick_pit_calibrate() jumps to success
| section.
I think that the real issue isn't even an SMI - but the fact
that in the very last iteration of the loop, there's no
serializing instruction _after_ the last 'rdtsc'. So even in
the absense of SMI's, we do have a situation where the cycle
counter was read without proper serialization.
The last check should be done outside the outer loop, since
_inside_ the outer loop, we'll be testing that the PIT has
the right MSB value has the right value in the next iteration.
So only the _last_ iteration is special, because that's the one
that will not check the PIT MSB value any more, and because the
final 'get_cycles()' isn't serialized.
In other words:
- I'd like to move the PIT MSB check to after the last
iteration, rather than in every iteration
- I think we should comment on the fact that it's also a
serializing instruction and so 'fences in' the TSC read.
Here's a suggested replacement.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: "Tan, Wei Chong" <wei.chong.tan@intel.com>
Tested-by: "Tan, Wei Chong" <wei.chong.tan@intel.com>
LKML-Reference: <B28277FD4E0F9247A3D55704C440A140D5D683F3@pgsmsx504.gar.corp.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'kvm-updates/2.6.31' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Avoid redelivery of edge interrupt before next edge
KVM: MMU: limit rmap chain length
KVM: ia64: fix build failures due to ia64/unsigned long mismatches
KVM: Make KVM_HPAGES_PER_HPAGE unsigned long to avoid build error on powerpc
KVM: fix ack not being delivered when msi present
KVM: s390: fix wait_queue handling
KVM: VMX: Fix locking imbalance on emulation failure
KVM: VMX: Fix locking order in handle_invalid_guest_state
KVM: MMU: handle n_free_mmu_pages > n_alloc_mmu_pages in kvm_mmu_change_mmu_pages
KVM: SVM: force new asid on vcpu migration
KVM: x86: verify MTRR/PAT validity
KVM: PIT: fix kpit_elapsed division by zero
KVM: Fix KVM_GET_MSR_INDEX_LIST
If the vendor name (from c16) can be longer than 100 bytes (or missing a
terminating null), then the null is written past the end of vendor[].
Found with Parfait, http://research.sun.com/projects/parfait/
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Ying <ying.huang@intel.com>
MacBookPro5,1 is not able to reboot unless reboot=pci is set.
This patch forces it through a DMI quirk specific to this
device.
Signed-off-by: Ozan Çağlayan <ozan@pardus.org.tr>
LKML-Reference: <1249403971-6543-1-git-send-email-ozan@pardus.org.tr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Otherwise the host can spend too long traversing an rmap chain, which
happens under a spinlock.
Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We have to disable preemption and IRQs on every exit from
handle_invalid_guest_state, otherwise we generate at least a
preempt_disable imbalance.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Release and re-acquire preemption and IRQ lock in the same order as
vcpu_enter_guest does.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_mmu_change_mmu_pages mishandles the case where n_alloc_mmu_pages is
smaller then n_free_mmu_pages, by not checking if the result of
the subtraction is negative.
Its a valid condition which can happen if a large number of pages has
been recently freed.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If a migrated vcpu matches the asid_generation value of the target pcpu,
there will be no TLB flush via TLB_CONTROL_FLUSH_ALL_ASID.
The check for vcpu.cpu in pre_svm_run is meaningless since svm_vcpu_load
already updated it on schedule in.
Such vcpu will VMRUN with stale TLB entries.
Based on original patch from Joerg Roedel (http://patchwork.kernel.org/patch/10021/)
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Do not allow invalid memory types in MTRR/PAT (generating a #GP
otherwise).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Fix division by zero triggered by latch count command on uninitialized
counter.
Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
So far, KVM copied the emulated_msrs (only MSR_IA32_MISC_ENABLE) to a
wrong address in user space due to broken pointer arithmetic. This
caused subtle corruption up there (missing MSR_IA32_MISC_ENABLE had
probably no practical relevance). Moreover, the size check for the
user-provided kvm_msr_list forgot about emulated MSRs.
Cc: stable@kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
With CONFIG_STACK_PROTECTOR turned on, VMI doesn't boot with
more than one processor. The problem is with the gs value not
being initialized correctly when registering the secondary
processor for VMI's case.
The patch below initializes the gs value for the AP to
__KERNEL_STACK_CANARY. Without this the secondary processor
keeps on taking a GP on every gs access.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: <stable@kernel.org> # for v2.6.30.x
LKML-Reference: <1249425262.18955.40.camel@ank32.eng.vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Work around compilation warning in arch/x86/kernel/apm_32.c
x86, UV: Complete IRQ interrupt migration in arch_enable_uv_irq()
x86, 32-bit: Fix double accounting in reserve_top_address()
x86: Don't use current_cpu_data in x2apic phys_pkg_id
x86, UV: Fix UV apic mode
x86, UV: Fix macros for accessing large node numbers
x86, UV: Delete mapping of MMR rangs mapped by BIOS
x86, UV: Handle missing blade-local memory correctly
x86: fix assembly constraints in native_save_fl()
x86, msr: execute on the correct CPU subset
x86: Fix assert syntax in vmlinux.lds.S
x86: Make 64-bit efi_ioremap use ioremap on MMIO regions
x86: Add quirk to make Apple MacBook5,2 use reboot=pci
x86: Fix CPA memtype reserving in the set_pages_array*() cases
x86, pat: Fix set_memory_wc related corruption
x86: fix section mismatch for i386 init code
The following fix was initially inspired by David Howells fix
few days back:
http://lkml.org/lkml/2009/7/9/109
However, Ingo disapproves such fixes as it's dangerous (it can
hide future, relevant warnings) - in something as
performance-uncritical.
So, initialize 'err' to '0' to work around a GCC false positive
warning:
http://lkml.org/lkml/2009/7/18/89
Signed-off-by: Subrata Modak<subrata@linux.vnet.ibm.com>
Cc: Sachin P Sant <sachinp@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
LKML-Reference: <20090721023226.31855.67236.sendpatchset@subratamodak.linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In uv_setup_irq(), the call to create_irq() initially assigns
IRQ vectors to cpu 0. The subsequent call to
assign_irq_vector() in arch_enable_uv_irq() migrates the IRQ to
another cpu and frees the cpu 0 vector - at least it will be
freed as soon as the "IRQ move" completes.
arch_enable_uv_irq() needs to send a cleanup IPI to complete
the IRQ move. Otherwise, assignment of GRU interrupts on large
systems (>200 cpus) will exhaust the cpu 0 interrupt vectors
and initialization of the GRU driver will fail.
Signed-off-by: Jack Steiner <steiner@sgi.com>
LKML-Reference: <20090720142840.GA8885@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With VMALLOC_END included in the calculation of MAXMEM (as of
2.6.28) it is no longer correct to also bump __VMALLOC_RESERVE
in reserve_top_address(). Doing so results in needlessly small
lowmem.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4A71DD2A020000780000D482@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change SGI UV default apicid mode to "physical". This is
required to match settings in the UV hub chip.
Signed-off-by: Jack Steiner <steiner@sgi.com>
LKML-Reference: <20090727143856.GA8905@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The UV chipset automatically supplies the upper bits on nodes
being referenced by MMR accesses. These bit can be deleted from
the hub addressing macros.
Signed-off-by: Jack Steiner <steiner@sgi.com>
LKML-Reference: <20090727143808.GA8076@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The UV BIOS has added additional MMR ranges that are mapped via
EFI virtual mode mappings. These ranges should be deleted from
ranges mapped by uv_system_init().
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: linux-mm@kvack.org
LKML-Reference: <20090727143656.GA7698@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
UV blades may not have any blade-local memory. Add a field
(nid) to the UV blade structure to indicates whether the node
has local memory. This is needed by the GRU driver (pushed
separately).
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: linux-mm@kvack.org
LKML-Reference: <20090727143507.GA7006@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From Gabe Black in bugzilla 13888:
native_save_fl is implemented as follows:
11static inline unsigned long native_save_fl(void)
12{
13 unsigned long flags;
14
15 asm volatile("# __raw_save_flags\n\t"
16 "pushf ; pop %0"
17 : "=g" (flags)
18 : /* no input */
19 : "memory");
20
21 return flags;
22}
If gcc chooses to put flags on the stack, for instance because this is
inlined into a larger function with more register pressure, the offset
of the flags variable from the stack pointer will change when the
pushf is performed. gcc doesn't attempt to understand that fact, and
address used for pop will still be the same. It will write to
somewhere near flags on the stack but not actually into it and
overwrite some other value.
I saw this happen in the ide_device_add_all function when running in a
simulator I work on. I'm assuming that some quirk of how the simulated
hardware is set up caused the code path this is on to be executed when
it normally wouldn't.
A simple fix might be to change "=g" to "=r".
Reported-by: Gabe Black <spamforgabe@umich.edu>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Stable Team <stable@kernel.org>
Make rdmsr_on_cpus/wrmsr_on_cpus execute on the current CPU only if it
is in the supplied bitmask.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Older versions of binutils did not accept the naked "ASSERT" syntax;
it is considered an expression whose value needs to be assigned to
something.
Reported-tested-and-fixed-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>