I have had four seperate system lockups attributable to this exact problem
in two days of testing. Instead of trying to handle all the weird end
cases and wrap, how about changing it to look for exactly what we appear
to want.
The following patch removes a couple races in setup_APIC_timer. One occurs
when the HPET advances the COUNTER past the T0_CMP value between the time
the T0_CMP was originally read and when COUNTER is read. This results in
a delay waiting for the counter to wrap. The other results from the counter
wrapping.
This change takes a snapshot of T0_CMP at the beginning of the loop and
simply loops until T0_CMP has changed (a tick has happened).
<later>
I have one small concern about the patch. I am not sure it meets the intent
as well as it should. I think we are trying to match APIC timer interrupts up
with the hpet counter increment. The event which appears to be disturbing
this loop in our test environment is the NMI watchdog. What we believe has
been happening with the existing code is the setup_APIC_timer loop has read
the CMP value, and the NMI watchdog code fires for the first time. This
results in a series of icache miss slowdowns and by the time we get back to
things it has wrapped.
I think this code is trying to get the CMP as close to the counter value as
possible. If that is the intent, maybe we should really be testing against a
"window" around the CMP. Something like COUNTER = CMP+/2. It appears COUNTER
should get advanced every 89nSec (IIRC). The above seems like an unreasonably
small window, but may be necessary. Without documentation, I am not sure of
the original intent with this code.
In summary, this code fixes my boot hangs, but since I am not certain of the
intent of the existing code, I am not certain this has not introduced new bugs
or unexpected behaviors.
Signed-off-by: Robin Holt <holt@sgi.com>
Acked-by: Andi Kleen <ak@suse.de>
Cc: Vojtech Pavlik <vojtech@suse.cz>
Cc: "Aaron Durbin" <adurbin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.
Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Use 64bit TSC calculations to avoid handling overflow
- Use 32bit unsigned arithmetic for the APIC timer. This
way overflows are handled correctly.
- Fix exit check of loop to account for apic timer counting down
Signed-off-by: dpreed@reed.com
Signed-off-by: Andi Kleen <ak@suse.de>
apic_wait_icr_idle looks like this:
static __inline__ void apic_wait_icr_idle(void)
{
while (apic_read(APIC_ICR) & APIC_ICR_BUSY)
cpu_relax();
}
The busy loop in this function would not be problematic if the
corresponding status bit in the ICR were always updated, but that does
not seem to be the case under certain crash scenarios. Kdump uses an IPI
to stop the other CPUs in the event of a crash, but when any of the
other CPUs are locked-up inside the NMI handler the CPU that sends the
IPI will end up looping forever in the ICR check, effectively
hard-locking the whole system.
Quoting from Intel's "MultiProcessor Specification" (Version 1.4), B-3:
"A local APIC unit indicates successful dispatch of an IPI by
resetting the Delivery Status bit in the Interrupt Command
Register (ICR). The operating system polls the delivery status
bit after sending an INIT or STARTUP IPI until the command has
been dispatched.
A period of 20 microseconds should be sufficient for IPI dispatch
to complete under normal operating conditions. If the IPI is not
successfully dispatched, the operating system can abort the
command. Alternatively, the operating system can retry the IPI by
writing the lower 32-bit double word of the ICR. This “time-out”
mechanism can be implemented through an external interrupt, if
interrupts are enabled on the processor, or through execution of
an instruction or time-stamp counter spin loop."
Intel's documentation suggests the implementation of a time-out
mechanism, which, by the way, is already being open-coded in some parts
of the kernel that tinker with ICR.
Create a apic_wait_icr_idle replacement that implements the time-out
mechanism and that can be used to solve the aforementioned problem.
AK: moved both functions out of line
AK: Added improved loop from Keith Owens
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
Ray Lee reported, that on an UP kernel with "noapic" command line option
set, the box locks hard during boot.
Adding some debug printks revealed, that the last action on the box
before stalling was "Send IPI" - a debug printk which was put into
smp_send_timer_broadcast_ipi().
It seems that send_IPI_mask(mask, LOCAL_TIMER_VECTOR) fails when
"noapic" is set on the command line on an UP kernel.
Aside of that it does not make much sense to trigger an interrupt
instead of calling the function directly on the CPU which gets the
PIT/HPET interrupt in case of broadcasting.
Reported-by: Ray Lee <ray-lk@madrabbit.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ray Lee <ray-lk@madrabbit.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Needed for any architecture that claims ARCH_APICTIMER_STOPS_ON_C3,
not just i386.
I'm hoping Thomas will clean this up a bit later..
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch converts x86_64 to use the GENERIC_TIME infrastructure and adds
clocksource structures for both TSC and HPET (ACPI PM is shared w/ i386).
[akpm@osdl.org: fix printk timestamps]
[akpm@osdl.org: fix printk ckeanups]
[akpm@osdl.org: hpet build fix]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for supporting generic timekeeping, this patch cleans up
x86-64's use of vxtime.hpet_address, changing it to just hpet_address as is
also used in i386. This is necessary since the vxtime structure will be going
away.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Read/Write APIC_LVTPC and APIC_LVTTHMR only,
if get_maxlvt() returns certain values.
This is done like everywhere else in i386/kernel/apic.c,
so I guess its correct.
Suspends/Resumes to disk fine and eleminates an smp_error_interrupt()
here on a K8.
AK: ported to x86-64 too
Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Insert the Local APIC and IO APIC(s) into the resource tree. It allows the
APIC resources to be visible within /proc/iomem. The patch also takes into
account IO APIC(s) mapped in the PCI space by deferring the insertion until
after PCI has allocated its necessary resources.
Signed-off-by: Aaron Durbin <adurbin@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
smp_apic_timer_interrupt() needs to stack the pt_regs* for profile_tick.
If any other of those APIC interrupt handlers want to run get_irq_regs() then
their C entrypoint handlers will need the same treatment.
Cc: Andi Kleen <ak@muc.de>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Acked-by: Andrew Vasquez <andrew.vasquez@qlogic.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Maintain a per-CPU global "struct pt_regs *" variable which can be used instead
of passing regs around manually through all ~1800 interrupt handlers in the
Linux kernel.
The regs pointer is used in few places, but it potentially costs both stack
space and code to pass it around. On the FRV arch, removing the regs parameter
from all the genirq function results in a 20% speed up of the IRQ exit path
(ie: from leaving timer_interrupt() to leaving do_IRQ()).
Where appropriate, an arch may override the generic storage facility and do
something different with the variable. On FRV, for instance, the address is
maintained in GR28 at all times inside the kernel as part of general exception
handling.
Having looked over the code, it appears that the parameter may be handed down
through up to twenty or so layers of functions. Consider a USB character
device attached to a USB hub, attached to a USB controller that posts its
interrupts through a cascaded auxiliary interrupt controller. A character
device driver may want to pass regs to the sysrq handler through the input
layer which adds another few layers of parameter passing.
I've build this code with allyesconfig for x86_64 and i386. I've runtested the
main part of the code on FRV and i386, though I can't test most of the drivers.
I've also done partial conversion for powerpc and MIPS - these at least compile
with minimal configurations.
This will affect all archs. Mostly the changes should be relatively easy.
Take do_IRQ(), store the regs pointer at the beginning, saving the old one:
struct pt_regs *old_regs = set_irq_regs(regs);
And put the old one back at the end:
set_irq_regs(old_regs);
Don't pass regs through to generic_handle_irq() or __do_IRQ().
In timer_interrupt(), this sort of change will be necessary:
- update_process_times(user_mode(regs));
- profile_tick(CPU_PROFILING, regs);
+ update_process_times(user_mode(get_irq_regs()));
+ profile_tick(CPU_PROFILING);
I'd like to move update_process_times()'s use of get_irq_regs() into itself,
except that i386, alone of the archs, uses something other than user_mode().
Some notes on the interrupt handling in the drivers:
(*) input_dev() is now gone entirely. The regs pointer is no longer stored in
the input_dev struct.
(*) finish_unlinks() in drivers/usb/host/ohci-q.c needs checking. It does
something different depending on whether it's been supplied with a regs
pointer or not.
(*) Various IRQ handler function pointers have been moved to type
irq_handler_t.
Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 1b16e7ac850969f38b375e511e3fa2f474a33867 commit)
Commit 54dbc0c9eb is causing various
people's machines to fail to map PCI resources.
Revert it in preparation for addressing the show-APICs-in-/proc/iomem
requirement in a different manner.
Cc: Aaron Durbin <adurbin@google.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch places the IOAPIC(s) and the Local APIC specified by ACPI
tables into the resource map. The APICs will then be visible within
/proc/iomem
Signed-off-by: Aaron Durbin <adurbin@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Lockdep can call the dwarf2 unwinder early, and the dwarf2 code
uses safe_smp_processor_id which tries to access the local APIC page.
But that doesn't work before the APIC code has set up its fixmap.
Check for this case and always return boot cpu then.
Cc: jbeulich@novell.com
Cc: mingo@elte.hu
Signed-off-by: Andi Kleen <ak@suse.de>
The combination of "local_save_flags" and "local_irq_disable" seems to be
equivalent to "local_irq_save" (see code snips below). Consequently, replace
occurrences of local_save_flags+local_irq_disable with local_irq_save.
* local_irq_save
#define raw_local_irq_save(flags) \
do { (flags) = __raw_local_irq_save(); } while (0)
static inline unsigned long __raw_local_irq_save(void)
{
unsigned long flags = __raw_local_save_flags();
raw_local_irq_disable();
return flags;
}
* local_save_flags
#define raw_local_save_flags(flags) \
do { (flags) = __raw_local_save_flags(); } while (0)
Signed-off-by: Fernando Vazquez <fernando@intellilink.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
Instead of hackish manual parsing
Requires earlier i386 patchkit, but also fixes i386 early_printk again.
I removed some obsolete really early parameters which didn't do anything useful.
Also made a few parameters that needed it early (mostly oops printing setup)
Also removed one panic check that wasn't visible without
early console anyways (the early console is now initialized after that
panic)
This cleans up a lot of code.
Signed-off-by: Andi Kleen <ak@suse.de>
PIC mode is an outdated way to drive the APICs that was used on
some early MP boards. It is not supported in the ACPI model.
It is unlikely to be ever configured by any x86-64 system
Remove it thus.
Signed-off-by: Andi Kleen <ak@suse.de>
IO-APIC or local APIC can only be disabled at runtime anyways and
Kconfig has forced these options on for a long time now.
The Kconfigs are kept only now for the benefit of the shared acpi
boot.c code.
Signed-off-by: Andi Kleen <ak@suse.de>
A few trivial spelling and grammar mistakes picked up in
"arch/x86_64/aperture.c", "arch/x86_64/crash.c" and
"arch/x86_64/apic.c". I think all are correct fixes but am ever aware
of my fallibility :o) This is my first patch submission so all
feedback is appreciated, esp. WRT CCing to Linus, Andi and
trivial@kernel.org, is this correct? And which is the most appropriate
kernel version to diff against? If any.
Should apply cleanly to 2.6.18-rc1
Signed-off-by: Adam Henley <adamazing@gmail.com>
Signed-off-by: Andi Kleen <ak@suse.de>
- adam
This patch includes the changes to make the nmi watchdog on x86_64 SMP
aware. A bunch of code was moved around to make it simpler to read. In
addition, it is now possible to determine if a particular NMI was the result
of the watchdog or not. This feature allows the kernel to filter out
unknown NMIs easier.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Appended patch fixes the "APIC error on CPUX: 00(40)" observed during bootup.
From SDM Vol-3A "Valid Interrupt Vectors" section:
"When an illegal vector value (0-15) is written to an LVT entry
and the delivery mode is Fixed, the APIC may signal an illegal
vector error, with out regard to whether the mask bit is set
or whether an interrupt is actually seen on input."
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add support for extended APIC LVT found in future AMD processors.
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename oem_force_hpet_timer to apic_is_clustered_box, to give the
function a better fitting name - it really isn't at all about HPET.
Signed-off-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The boot cmdline is parsed in parse_early_param() and
parse_args(,unknown_bootoption).
And __setup() is used in obsolete_checksetup().
start_kernel()
-> parse_args()
-> unknown_bootoption()
-> obsolete_checksetup()
If __setup()'s callback (->setup_func()) returns 1 in
obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
handled.
If ->setup_func() returns 0, obsolete_checksetup() tries other
->setup_func(). If all ->setup_func() that matched a parameter returns 0,
a parameter is seted to argv_init[].
Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
If the app doesn't ignore those arguments, it will warning and exit.
This patch fixes a wrong usage of it, however fixes obvious one only.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o check_timer() routine fails while second kernel is booting after a crash
on an opetron box. Problem happens because timer vector (0x31) seems to be
locked.
o After a system crash, it is not safe to service interrupts any more, hence
interrupts are disabled. This leads to pending interrupts at LAPIC. LAPIC
sends these interrupts to the CPU during early boot of second kernel. Other
pending interrupts are discarded saying unexpected trap but timer interrupt
is serviced and CPU does not issue an LAPIC EOI because it think this
interrupt came from i8259 and sends ack to 8259. This leads to vector 0x31
locking as LAPIC does not clear respective ISR and keeps on waiting for
EOI.
o This patch issues extra EOI for the pending interrupts who have ISR set.
o Though today only timer seems to be the special case because in early
boot it thinks interrupts are coming from i8259 and uses
mask_and_ack_8259A() as ack handler and does not issue LAPIC EOI. But
probably doing it in generic manner for all vectors makes sense.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This reverts commit 13a229abc2.
Quoth Andi:
"After some consideration and feedback from various people it turns
out this wasn't that good an idea. It has some problems and needs
more work. Since it was only an optimization anyways it's best to
just back it out again for now."
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Big Unisys systems have multiple clusters too, but they have an
synchronized TSC.
I'm using the SMBIOS to check for vendor == IBM.
Cc: Chris McDermott <lcm@us.ibm.com>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@unisys.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[description from AK]
The IBM Summit 3 chipset doesn't implement the HPET timer replacement
option. Since the current Linux code relies on it use a mixed mode with
both PIT for the interrupt and HPET counters for the time keeping. That
was already implemented, but didn't work properly because it was still
using the last interrupt offset in HPET. This resulted in x460 not
booting. Fix this up by using the free running HPET counter.
Shouldn't affect any other machine because they either use full HPET mode
or no HPET at all.
TBD needs a similar 32bit fix.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It's bad juju to touch the APIC when it hasn't been enabled.
I also moved ack_bad_irq for x86-64 out of line following i386.
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On some broken motherboards (at least one NForce3 based AMD64 laptop)
the PIT timer runs at a incorrect frequency. This patch adds a new
option "apicpmtimer" that allows to use the APIC timer and calibrate it
using the PMTimer. It requires the earlier patch that allows to run the
main timer from the APIC.
Specifying apicpmtimer implies apicmaintimer.
The option defaults to off for now.
I tested it on a few systems and the resulting APIC timer frequencies
were usually a bit off, but always <1%, which should be tolerable.
TBD figure out heuristic to enable this automatically on the affected
systems TBD perhaps do it on all NForce3s or using DMI?
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Another piece from the no-idle-tick patch.
This can be enabled with the "apicmaintimer" option.
This is mainly useful when the PIT/HPET interrupt is unreliable.
Note there are some systems that are known to stop the APIC
timer in C3. For those it will never work, but this case
should be automatically detected.
It also only works with PM timer right now. When HPET is used
the way the main timer handler computes the delay doesn't work.
It should be a bit more efficient because there is one less
regular interrupt to process on the boot processor.
Requires earlier bugfix from Venkatesh
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove support for obsolete hardware and cleanup.
- Remove checks for non integrated APICs
- Replace apic_write_around with apic_write.
- Remove apic_read_around
- Remove APIC version reads used by old workarounds
- Remove old workaround for Simics
- Fix indentation
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Apic id is in most significant 8 bits of APIC_ID register. Current code
is trying to write apic id to least significant 8 bits. This patch fixes
it.
o This fix enables booting uni kdump capture kernel on a cpu with non-zero
apic id.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds a new notifier chain that is called with IDLE_START
when a CPU goes idle and IDLE_END when it goes out of idle.
The context can be idle thread or interrupt context.
Since we cannot rely on MONITOR/MWAIT existing the idle
end check currently has to be done in all interrupt
handlers.
They were originally inspired by the similar s390 implementation.
They have a variety of applications:
- They will be needed for CONFIG_NO_IDLE_HZ
- They can be used for oprofile to fix up the missing time
in idle when performance counters don't tick.
- They can be used for better C state management in ACPI
- They could be used for microstate accounting.
This is just infrastructure so far, no users.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Whenever we see that a CPU is capable of C3 (during ACPI cstate init), we
disable local APIC timer and switch to using a broadcast from external timer
interrupt (IRQ 0).
Patch below adds the code for x86_64.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the finer control of local APIC timer. We cannot provide a sub-jiffy
control like this when we use broadcast from external timer in place of
local APIC. Instead of removing this only on systems that may end up using
broadcast from external timer (due to C3), I am going the
"I'm feeling lucky" way to remove this fully. Basically, I am not sure about
usefulness of this code today. Few other architectures also don't seem to
support this today.
If you are using profiling and fine grained control and don't like this going
away in normal case, yell at me right now.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
MC4_MISC - DRAM Errors Threshold Register realized under AMD K8 Rev F.
This register is used to count correctable and uncorrectable ECC errors that occur during DRAM read operations.
The user may interface through sysfs files in order to change the threshold configuration.
bank%d/error_count - reads current error count, write to clear.
bank%d/interrupt_enable - set/clear interrupt enable.
bank%d/threshold_limit - read/write the threshold limit.
APIC vector 0xF9 in hw_irq.h.
5 software defined bank ids in mce.h.
new apic.c function to setup threshold apic lvt.
defaults to interrupt off, count enabled, and threshold limit max.
sysfs interface created on /sys/devices/system/threshold.
AK: added some ifdefs to make it compile on UP
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... and with that all instances in arch/x86_64 are gone.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>