check_tsc_sync_source() depends on being called with irqs disabled (it
checks whether the TSC is coherent across two specific CPUs). This is
incidentally true during bootup, but not during cpu hotplug __cpu_up().
This got found via smp_processor_id() debugging.
disable irqs explicitly and remove the unconditional enabling of
interrupts. Add touch_nmi_watchdog() to the cpu_online_map busy loop.
this bug is present both on i386 and on x86_64.
Reported-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch resolves the issue found here:
http://bugme.osdl.org/show_bug.cgi?id=7426
The basic summary is:
Currently we register most of i386/x86_64 clocksources at module_init
time. Then we enable clocksource selection at late_initcall time. This
causes some problems for drivers that use gettimeofday for init
calibration routines (specifically the es1968 driver in this case),
where durring module_init, the only clocksource available is the low-res
jiffies clocksource. This may cause slight calibration errors, due to
the small sampling time used.
It should be noted that drivers that require fine grained time may not
function on architectures that do not have better then jiffies
resolution timekeeping (there are a few). However, this does not
discount the reasonable need for such fine-grained timekeeping at init
time.
Thus the solution here is to register clocksources earlier (ideally when
the hardware is being initialized), and then we enable clocksource
selection at fs_initcall (before device_initcall).
This patch should probably get some testing time in -mm, since
clocksource selection is one of the most important issues for correct
timekeeping, and I've only been able to test this on a few of my own
boxes.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When removing set_native_irq I missed the fact that it was
called in a couple of places that were compiled even when
SMP support is disabled. And since the irq_desc[].affinity
field only exists in SMP things broke.
Thanks to Simon Arlott <simon@arlott.org> for spotting this.
There are a couple of ways to fix this but the simplest one
is to just remove the assignments. The affinity field is only
used to display a value to the user, and nothing on either i386
or x86_64 reads it or depends on it being any particlua value,
so skipping the assignment is safe. The assignment that
is being removed is just for the initial affinity value before
the user explicitly sets it. The irq_desc array initializes
this field to CPU_MASK_ALL so the field is initialized to
a reasonable value in the SMP case without being set.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The problem: After moving an interrupt when is it safe to teardown
the data structures for receiving the interrupt at the old location?
With a normal pci device it is possible to issue a read to a device
to flush all posted writes. This does not work for the oldest ioapics
because they are on a 3-wire apic bus which is a completely different
data path. For some more modern ioapics when everything is using
front side bus delivery you can flush interrupts by simply issuing a
read to the ioapic. For other modern ioapics emperical testing has
shown that this does not work.
So it appears the only reliable way to know the last of the irqs from an
ioapic have been received from before the ioapic was reprogrammed is to
received the first irq from the ioapic from after it was reprogrammed.
Once we know the last irq message has been received from an ioapic
into a local apic we then need to know that irq message has been
processed through the local apics.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For the ISA irqs we reserve 16 vectors. This patch adds constants for
those vectors and modifies the code to use them. Making the code a
little clearer and making it possible to move these vectors in the future.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code in io_apic.c and in i8259.c currently hardcode the same
vector for the timer interrupt so there is no reason for a special
assignment for the timer as the setup for the i8259 already takes care
of this.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently assign_irq_vector works mostly by side effect and returns
the results of it's changes to the caller. Which makes for a lot of
arguments to pass/return and confusion as to what to do if you need
the status but you aren't calling assign_irq_vector.
This patch stops returning values from assign_irq_vector that can be
retrieved just as easily by examining irq_cfg, and modifies the
callers to retrive those values from irq_cfg when they need them.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the io_apic.c has several parallel arrays for different
kinds of data that can be know about an irq. The parallel arrays
make the code harder to maintain and make it difficult to remove
the static limits on the number of the number of irqs.
This patch pushes irq_data and irq_vector into a irq_cfg array and
updates the code to use it.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NR_IRQ_VECTORS is currently a compatiblity define set to NR_IRQs.
This patch updates the users of NR_IRQ_VECTORS to use NR_IRQs instead
so that NR_IRQ_VECTORS can be removed.
There is still shared code with arch/i386 that uses NR_IRQ_VECTORS
so we can't remove the #define just yet :(
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we have an irq that comes from multiple io_apic pins the FINAL action
(which is io_apic_sync or nothing) needs to be called for every entry or
else if the two pins come from different io_apics we may not wait until
after the action happens on the io_apic.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For some reason the code has been picking TARGET_CPUS when asked to
set the affinity to an empty set of cpus. That is just silly it's
extra work. Instead if there are no cpus to set the affinity to we
should just give up immediately. That is simpler and a little more
intuitive.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we have two routines that do practically the same thing
setup_IO_APIC_irq and io_apic_set_pci_routing. This patch makes
setup_IO_APIC_irq the common factor of these two previous routines.
For setup_IO_APIC_irq all that was needed was to pass the trigger
and polarity to make the code a proper subset of io_apic_set_pci_routing.
Hopefully consolidating these two routines will improve maintenance
there were several differences that simply appear to be one routine
or the other getting it wrong.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch replaces all instances of "set_native_irq_info(irq, mask)"
with "irq_desc[irq].affinity = mask". The latter form is clearer
uses fewer abstractions, and makes access to this field uniform
accross different architectures.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By precomputing old_mask I remove an extra if statement, remove an
indentation level and make the code slightly easier to read.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
Documentation/kernel-docs.txt update.
arch/cris: typo in KERN_INFO
Storage class should be before const qualifier
kernel/printk.c: comment fix
update I/O sched Kconfig help texts - CFQ is now default, not AS.
Remove duplicate listing of Cris arch from README
kbuild: more doc. cleanups
doc: make doc. for maxcpus= more visible
drivers/net/eexpress.c: remove duplicate comment
add a help text for BLK_DEV_GENERIC
correct a dead URL in the IP_MULTICAST help text
fix the BAYCOM_SER_HDX help text
fix SCSI_SCAN_ASYNC help text
trivial documentation patch for platform.txt
Fix typos concerning hierarchy
Fix comment typo "spin_lock_irqrestore".
Fix misspellings of "agressive".
drivers/scsi/a100u2w.c: trivial typo patch
Correct trivial typo in log2.h.
Remove useless FIND_FIRST_BIT() macro from cardbus.c.
...
Now that disable_irq() defaults to delayed-disable semantics, the IRQ_DISABLED
flag is not needed anymore.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Never mask interrupts immediately upon request. Disabling interrupts in
high-performance codepaths is rare, and on the other hand this change could
recover lost edges (or even other types of lost interrupts) by conservatively
only masking interrupts after they happen. (NOTE: with this change the
highlevel irq-disable code still soft-disables this IRQ line - and if such an
interrupt happens then the IRQ flow handler keeps the IRQ masked.)
Mark i8529A controllers as 'never loses an edge'.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch converts x86_64 to use the GENERIC_TIME infrastructure and adds
clocksource structures for both TSC and HPET (ACPI PM is shared w/ i386).
[akpm@osdl.org: fix printk timestamps]
[akpm@osdl.org: fix printk ckeanups]
[akpm@osdl.org: hpet build fix]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for the x86_64 generic time conversion, this patch splits out
TSC and HPET related code from arch/x86_64/kernel/time.c into respective
hpet.c and tsc.c files.
[akpm@osdl.org: fix printk timestamps]
[akpm@osdl.org: cleanup]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for supporting generic timekeeping, this patch cleans up
x86-64's use of vxtime.hpet_address, changing it to just hpet_address as is
also used in i386. This is necessary since the vxtime structure will be going
away.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
make the TSC synchronization code more robust, and unify it between x86_64 and
i386.
The biggest change is the removal of the 'fix up TSCs' code on x86_64 and
i386, in some rare cases it was /causing/ time-warps on SMP systems.
The new code only checks for TSC asynchronity - and if it can prove a
time-warp (if it can observe the TSC going backwards when going from one CPU
to another within a critical section), then the TSC clock-source is turned
off.
The TSC synchronization-checking code also got moved into a separate file.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.
I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.
So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Basically everything was done but I removed all element initializers from the
trailing entries to make it clear the entire last entry should be zero filled.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just various new acronyms. The new popcnt bit is in the middle
of Intel space. This looks a little weird, but I've been assured
it's ok.
Also I fixed RDTSCP for i386 which was at the wrong place.
For i386 and x86-64.
Signed-off-by: Andi Kleen <ak@suse.de>
Occasionally the kernel has bugs that result in no irq being found for a
given cpu vector. If we acknowledge the irq the system has a good chance
of continuing even though we dropped an irq message. If we continue to
simply print a message and not acknowledge the irq the system is likely to
become non-responsive shortly there after.
AK: Fixed compilation for UP kernels
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: "Luigi Genoni" <luigi.genoni@pirelli.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
I've seen my box paralyzed by an endless spew of
rtc: lost some interrupts at 1024Hz.
messages on the serial console. What seems to be happening is that
something real causes an interrupt to be lost and triggers the
message. But then printing the message to the serial console (from
the hpet interrupt handler) takes more than 1/1024th of a second, and
then some more interrupts are lost, so the message triggers again....
Fix this by adding a printk_ratelimit() before printing the warning.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
On the Unisys ES7000/ONE system, we encountered a problem where performing
a kexec reboot or dump on any cell other than cell 0 causes the system
timer to stop working, resulting in a hang during timer calibration in the
new kernel.
We traced the problem to one line of code in disable_IO_APIC(), which needs
to restore the timer's IO-APIC configuration before rebooting. The code is
currently using the 4-bit physical destination field, rather than using the
8-bit logical destination field, and it cuts off the upper 4 bits of the
timer's APIC ID. If we change this to use the logical destination field,
the timer works and we can kexec on the upper cells. This was tested on
two different cells (0 and 2) in an ES7000/ONE system.
For reference, the relevant Intel xAPIC spec is kept at
ftp://download.intel.com/design/chipsets/e8501/datashts/30962001.pdf,
specifically on page 334.
Signed-off-by: Benjamin M Romer <benjamin.romer@unisys.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During kernel bootup, a new T60 laptop (CoreDuo, 32-bit) hangs about
10%-20% of the time in acpi_init():
Calling initcall 0xc055ce1a: topology_init+0x0/0x2f()
Calling initcall 0xc055d75e: mtrr_init_finialize+0x0/0x2c()
Calling initcall 0xc05664f3: param_sysfs_init+0x0/0x175()
Calling initcall 0xc014cb65: pm_sysrq_init+0x0/0x17()
Calling initcall 0xc0569f99: init_bio+0x0/0xf4()
Calling initcall 0xc056b865: genhd_device_init+0x0/0x50()
Calling initcall 0xc056c4bd: fbmem_init+0x0/0x87()
Calling initcall 0xc056dd74: acpi_init+0x0/0x1ee()
It's a hard hang that not even an NMI could punch through! Frustratingly,
adding printks or function tracing to the ACPI code made the hangs go away
...
After some time an additional detail emerged: disabling the NMI watchdog
made these occasional hangs go away.
So i spent the better part of today trying to debug this and trying out
various theories when i finally found the likely reason for the hang: if
acpi_ns_initialize_devices() executes an _INI AML method and an NMI
happens to hit that AML execution in the wrong moment, the machine would
hang. (my theory is that this must be some sort of chipset setup method
doing stores to chipset mmio registers?)
Unfortunately given the characteristics of the hang it was sheer
impossible to figure out which of the numerous AML methods is impacted
by this problem.
As a workaround i wrote an interface to disable chipset-based NMIs while
executing _INI sections - and indeed this fixed the hang. I did a
boot-loop of 100 separate reboots and none hung - while without the patch
it would hang every 5-10 attempts. Out of caution i did not touch the
nmi_watchdog=2 case (it's not related to the chipset anyway and didnt
hang).
I implemented this for both x86_64 and i686, tested the i686 laptop both
with nmi_watchdog=1 [which triggered the hangs] and nmi_watchdog=2, and
tested an Athlon64 box with the 64-bit kernel as well. Everything builds
and works with the patch applied.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- set bad_dma_address explicitly to 0x0
- reserve 32 pages from bad_dma_address and up
- WARN_ON() a driver feeding us bad_dma_address
Thanks to Leo Duran <leo.duran@amd.com> for the suggestion.
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Leo Duran <leo.duran@amd.com>
Cc: Job Mason <jdmason@kudzu.us>
Should be harmless because there is normally no memory there, but
technically it was incorrect.
Pointed out by Leo Duran
Signed-off-by: Andi Kleen <ak@suse.de>
Initialize FS and GS to __KERNEL_DS as well. The actual value of them is not
important, but it is important to reload them in protected mode. At this time,
they still retain the real mode values from initial boot. VT disallows
execution of code under such conditions, which means hardware virtualization
can not be used to boot the kernel on Intel platforms, making the boot time
painfully slow.
This requires moving the GS load before the load of GS_BASE, so just move
all the segments loads there to keep them together in the code.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
The symbol is needed to manipulate page tables, and modules shouldn't
do that.
Leftover from 2.4, but no in tree module should need it now.
Signed-off-by: Andi Kleen <ak@suse.de>
This means if an illegal value is set for the segment registers there
ptrace will error out now with an errno instead of silently ignoring
it.
Signed-off-by: Andi Kleen <ak@suse.de>
Add failsafe mechanism to HPET/TSC clock calibration.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Updated to include failsafe mechanism & additional community feedback.
Patch built on latest 2.6.20-rc4-mm1 tree.
Signed-off-by: Andi Kleen <ak@suse.de>
When a machine check event is detected (including a AMD RevF threshold
overflow event) allow to run a "trigger" program. This allows user space
to react to such events sooner.
The trigger is configured using a new trigger entry in the
machinecheck sysfs interface. It is currently shared between
all CPUs.
I also fixed the AMD threshold handler to run the machine
check polling code immediately to actually log any events
that might have caused the threshold interrupt.
Also added some documentation for the mce sysfs interface.
Signed-off-by: Andi Kleen <ak@suse.de>
while debugging an unrelated problem in Xen, I noticed odd reads from
non-existent MSRs. Having now found time to look why these happen, I
came up with below patch, which
- prevents accessing MCi_MISCj with j > 0 when the block pointer in
MCi_MISC0 is zero
- accesses only contiguous MCi_MISCj until a non-implemented one is
found
- doesn't touch unimplemented blocks in mce_threshold_interrupt at all
- gives names to two bits previously derived from MASK_VALID_HI (it
took me some time to understand the code without this)
The first three items, besides being apparently closer to the spec, should
namely help cutting down on the time mce_threshold_interrupt() takes.
Signed-off-by: Andi Kleen <ak@suse.de>
P6 CPUs and Core/Core 2 CPUs which has 'architectural perf mon' feature,
only supports write of low 32 bits in Performance Monitoring Counters.
Bits 32..39 are sign extended based on bit 31 and bits 40..63 are reserved
and should be zero.
This patch:
Change x86_64 nmi handler to handle this case cleanly.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
This is a tiny cleanup to increase readability
Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>