This needs to be a unique interrupt source because we do
not have a register or similar to poll to make sure the
IRQ is really for us. We do not have any dev_id to pass
in anyways, and the generic IRQ layer is now enforcing
that when SA_SHIRQ is specified, dev_id must be non-NULL.
Signed-off-by: David S. Miller <davem@davemloft.net>
The idea is to fully construct the device register and
interrupt values into these of_device objects, and convert
all of SBUS, EBUS, ISA drivers to use this new stuff.
Much ideas and code taken from Ben H.'s powerpc work.
Signed-off-by: David S. Miller <davem@davemloft.net>
Totally unused.
We need to traverse the list of global IRQ translaters,
so storing it in the per-bus structures was useless.
Signed-off-by: David S. Miller <davem@davemloft.net>
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6:
[PATCH] i386: export memory more than 4G through /proc/iomem
[PATCH] 64bit Resource: finally enable 64bit resource sizes
[PATCH] 64bit Resource: convert a few remaining drivers to use resource_size_t where needed
[PATCH] 64bit resource: change pnp core to use resource_size_t
[PATCH] 64bit resource: change pci core and arch code to use resource_size_t
[PATCH] 64bit resource: change resource core to use resource_size_t
[PATCH] 64bit resource: introduce resource_size_t for the start and end of struct resource
[PATCH] 64bit resource: fix up printks for resources in misc drivers
[PATCH] 64bit resource: fix up printks for resources in arch and core code
[PATCH] 64bit resource: fix up printks for resources in pcmcia drivers
[PATCH] 64bit resource: fix up printks for resources in video drivers
[PATCH] 64bit resource: fix up printks for resources in ide drivers
[PATCH] 64bit resource: fix up printks for resources in mtd drivers
[PATCH] 64bit resource: fix up printks for resources in pci core and hotplug drivers
[PATCH] 64bit resource: fix up printks for resources in networks drivers
[PATCH] 64bit resource: fix up printks for resources in sound drivers
[PATCH] 64bit resource: C99 changes for struct resource declarations
Fixed up trivial conflict in drivers/ide/pci/cmd64x.c (the printk that
was changed by the 64-bit resources had been deleted in the meantime ;)
Consolidation: remove the irq_affinity[NR_IRQS] array and move it into the
irq_desc[NR_IRQS].affinity field.
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch-queue improves the generic IRQ layer to be truly generic, by adding
various abstractions and features to it, without impacting existing
functionality.
While the queue can be best described as "fix and improve everything in the
generic IRQ layer that we could think of", and thus it consists of many
smaller features and lots of cleanups, the one feature that stands out most is
the new 'irq chip' abstraction.
The irq-chip abstraction is about describing and coding and IRQ controller
driver by mapping its raw hardware capabilities [and quirks, if needed] in a
straightforward way, without having to think about "IRQ flow"
(level/edge/etc.) type of details.
This stands in contrast with the current 'irq-type' model of genirq
architectures, which 'mixes' raw hardware capabilities with 'flow' details.
The patchset supports both types of irq controller designs at once, and
converts i386 and x86_64 to the new irq-chip design.
As a bonus side-effect of the irq-chip approach, chained interrupt controllers
(master/slave PIC constructs, etc.) are now supported by design as well.
The end result of this patchset intends to be simpler architecture-level code
and more consolidation between architectures.
We reused many bits of code and many concepts from Russell King's ARM IRQ
layer, the merging of which was one of the motivations for this patchset.
This patch:
rename desc->handler to desc->chip.
Originally i did not want to do this, because it's a big patch. But having
both "desc->handler", "desc->handle_irq" and "action->handler" caused a
large degree of confusion and made the code appear alot less clean than it
truly is.
I have also attempted a dual approach as well by introducing a
desc->chip alias - but that just wasnt robust enough and broke
frequently.
So lets get over with this quickly. The conversion was done automatically
via scripts and converts all the code in the kernel.
This renaming patch is the first one amongst the patches, so that the
remaining patches can stay flexible and can be merged and split up
without having some big monolithic patch act as a merge barrier.
[akpm@osdl.org: build fix]
[akpm@osdl.org: another build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With Goto-san's patch, we can add new pgdat/node at runtime. I'm now
considering node-hot-add with cpu + memory on ACPI.
I found acpi container, which describes node, could evaluate cpu before
memory. This means cpu-hot-add occurs before memory hot add.
In most part, cpu-hot-add doesn't depend on node hot add. But register_cpu(),
which creates symbolic link from node to cpu, requires that node should be
onlined before register_cpu(). When a node is onlined, its pgdat should be
there.
This patch-set holds off creating symbolic link from node to cpu
until node is onlined.
This removes node arguments from register_cpu().
Now, register_cpu() requires 'struct node' as its argument. But the array of
struct node is now unified in driver/base/node.c now (By Goto's node hotplug
patch). We can get struct node in generic way. So, this argument is not
necessary now.
This patch also guarantees add cpu under node only when node is onlined. It
is necessary for node-hot-add vs. cpu-hot-add patch following this.
Moreover, register_cpu calculates cpu->node_id by cpu_to_node() without regard
to its 'struct node *root' argument. This patch removes it.
Also modify callers of register_cpu()/unregister_cpu, whose args are changed
by register-cpu-remove-node-struct patch.
[Brice.Goglin@ens-lyon.org: fix it]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Brice Goglin <Brice.Goglin@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
I severely apologize, I was still learning how to program
in C when I wrote this stuff 10 years ago...
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparcspkr and power drivers are converted, to make sure it works.
Eventually the SBUS device layer will use this as a sub-class.
I really cannot cut loose on that bit until sparc32 is given the
same infrastructure.
Signed-off-by: David S. Miller <davem@davemloft.net>
Import some more stuff from powerpc.
Add of_device_is_compatible(), and of_find_compatible_node().
Export some more of the other routines to modules.
Signed-off-by: David S. Miller <davem@davemloft.net>
One thing this change pointed out was that we really should
pull the "get 'local-mac-address' property" logic into a helper
function all the network drivers can call.
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise the in-kernel PROM device tree isn't built yet,
and therefore the present cpu bits don't get set properly.
Signed-off-by: David S. Miller <davem@davemloft.net>
On some sun4v systems, after netboot the ethernet controller and it's
DMA mappings can be left active. The net result is that the kernel
can end up using memory the ethernet controller will continue to DMA
into, resulting in corruption.
To deal with this, we are more careful about importing IOMMU
translations which OBP has left in the IO-TLB. If the mapping maps
into an area the firmware claimed was free and available memory for
the kernel to use, we demap instead of import that IOMMU entry.
This is going to cause the network chip to take a PCI master abort on
the next DMA it attempts, if it has been left going like this. All
tests show that this is handled properly by the PCI layer and the e1000
drivers.
Signed-off-by: David S. Miller <davem@davemloft.net>
The basic framework is based on the PowerPC OF code.
This code even tries to get the device addressing components
correct in the full path names.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the long overdue conversion of sparc64 over to
the generic IRQ layer.
The kernel image is slightly larger, but the BSS is ~60K
smaller due to the reduced size of struct ino_bucket.
A lot of IRQ implementation details, including ino_bucket,
were moved out of asm-sparc64/irq.h and are now private to
arch/sparc64/kernel/irq.c, and most of the code in irq.c
totally disappeared.
One thing that's different at the moment is IRQ distribution,
we do it at enable_irq() time. If the cpu mask is ALL then
we round-robin using a global rotating cpu counter, else
we pick the first cpu in the mask to support single cpu
targetting. This is similar to what powerpc's XICS IRQ
support code does.
This works fine on my UP SB1000, and the SMP build goes
fine and runs on that machine, but lots of testing on
different setups is needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Inspired by PowerPC XICS interrupt support code.
All IRQs are virtualized in order to keep NR_IRQS from needing
to be too large. Interrupts on sparc64 are arbitrary 11-bit
values, but we don't need to define NR_IRQS to 2048 if we
virtualize the IRQs.
As PCI and SBUS controller drivers build device IRQs, we divy
out virtual IRQ numbers incrementally starting at 1. Zero is
a special virtual IRQ used for the timer interrupt.
So device drivers all see virtual IRQs, and all the normal
interfaces such as request_irq(), enable_irq(), etc. translate
that into a real IRQ number in order to configure the IRQ.
At this point knowledge of the struct ino_bucket is almost
entirely contained within arch/sparc64/kernel/irq.c There are
a few small bits in the PCI controller drivers that need to
be swept away before we can remove ino_bucket's definition
out of asm-sparc64/irq.h and privately into kernel/irq.c
Signed-off-by: David S. Miller <davem@davemloft.net>
And reuse that struct member for virt_irq, which will
be used in future changesets for the implementation of
mapping between real and virtual IRQ numbers.
This nicely kills off a ton of SBUS and PCI controller
PIL assignment code which is no longer necessary.
Signed-off-by: David S. Miller <davem@davemloft.net>
Only pil0_dummy_bucket had a pil of zero and we just killed that
off, so we can delete all special case code that used bp->pil==0
as a way to identify a dummy bucket.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the first in a series of cleanups that will hopefully
allow a seamless attempt at using the generic IRQ handling
infrastructure in the Linux kernel.
Define PIL_DEVICE_IRQ and vector all device interrupts through
there.
Get rid of the ugly pil0_dummy_{bucket,desc}, instead vector
the timer interrupt directly to a specific handler since the
timer interrupt is the only event that will be signaled on
PIL 14.
The irq_worklist is now in the per-cpu trap_block[].
Signed-off-by: David S. Miller <davem@davemloft.net>
Doing PCI config space accesses to non-present PCI slots
can result in fatal JBUS errors if the PCI config access
hypervisor call is performed on cpus other than the boot
cpu.
PCI config space accesses to present PCI slots works just
fine.
Recursively traverse the OBP device tree under the PCI
controller node and record all present device IDs into
a small hash table.
Avoid the hypervisor call for any PCI config space access
attempt for a device not recorded in the hash table.
Signed-off-by: David S. Miller <davem@davemloft.net>
Uses of smp_processor_id() get pushed earlier and earlier in
the start_kernel() sequence. So just get it working before
we call start_kernel() to avoid all possible problems.
Signed-off-by: David S. Miller <davem@davemloft.net>
Using asm-generic/dma-mapping.h does not work because pushing
the call down to pci_alloc_coherent() causes the gfp_t argument
of dma_alloc_coherent() to be ignored.
Fix this by implementing things directly, and adding a gfp_t
argument we can use in the internal call down to the PCI DMA
implementation of pci_alloc_coherent().
This fixes massive memory corruption when using the sound driver
layer, which passes things like __GFP_COMP down into these
routines and (correctly) expects that to work.
Signed-off-by: David S. Miller <davem@davemloft.net>
For sparc32 we need R_SPARC_UA32 relocation support, for
sparc64 we need the handle R_SPARC_DISP32 relocations.
Based upon reports and initial patch by Martin Habets.
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Morton pointed out that compiler might not inline the functions
marked for inline in kprobes. There-by allowing the insertion of probes
on these kprobes routines, which might cause recursion.
This patch removes all such inline and adds them to kprobes section
there by disallowing probes on all such routines. Some of the routines
can even still be inlined, since these routines gets executed after the
kprobes had done necessay setup for reentrancy.
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While cleaning up parisc_ksyms.c earlier, I noticed that strpbrk wasn't
being exported from lib/string.c. Investigating further, I noticed a
changeset that removed its export and added it to _ksyms.c on a few more
architectures. The justification was that "other arches do it."
I think this is wrong, since no architecture currently defines
__HAVE_ARCH_STRPBRK, there's no reason for any of them to be exporting it
themselves. Therefore, consolidate the export to lib/string.c.
Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.
We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.
This patch replaces for_each_cpu with for_each_possible_cpu.
for sparc64.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1) Take doc-book function comment from i386 implementation.
2) cacheline align call_lock, taken from powerpc
3) Need memory barrier after setting call_data
4) Remove timeout
Signed-off-by: David S. Miller <davem@davemloft.net>
GDB uses a PTRACE_PEEKUSR call with offset 0 to see
if a thread is alive, so provide a success return for
this particular special case.
Signed-off-by: David S. Miller <davem@davemloft.net>
switch_mm() changes the mm state and does a tsb_context_switch()
first, then we do the cpu register state switch which changes
current_thread_info() and current().
So it's safer to check the PGD physical address stored in the
trap block (which will be updated by the tsb_context_switch() in
switch_mm()) than current->active_mm.
Technically we should never run here in between those two
updates, because interrupts are disabled during the entire
context switch operation. But some day we might like to leave
interrupts enabled during the context switch and this change
allows that to happen without any surprises.
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel's implementation of notifier chains is unsafe. There is no
protection against entries being added to or removed from a chain while the
chain is in use. The issues were discussed in this thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2
We noticed that notifier chains in the kernel fall into two basic usage
classes:
"Blocking" chains are always called from a process context
and the callout routines are allowed to sleep;
"Atomic" chains can be called from an atomic context and
the callout routines are not allowed to sleep.
We decided to codify this distinction and make it part of the API. Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name). New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain. The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.
With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed. For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections. (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)
There are some limitations, which should not be too hard to live with. For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem. Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain. (This did happen in a couple of places and the code
had to be changed to avoid it.)
Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization. Instead we use RCU. The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.
Here is the list of chains that we adjusted and their classifications. None
of them use the raw API, so for the moment it is only a placeholder.
ATOMIC CHAINS
-------------
arch/i386/kernel/traps.c: i386die_chain
arch/ia64/kernel/traps.c: ia64die_chain
arch/powerpc/kernel/traps.c: powerpc_die_chain
arch/sparc64/kernel/traps.c: sparc64die_chain
arch/x86_64/kernel/traps.c: die_chain
drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list
kernel/panic.c: panic_notifier_list
kernel/profile.c: task_free_notifier
net/bluetooth/hci_core.c: hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain
net/ipv6/addrconf.c: inet6addr_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain
net/netlink/af_netlink.c: netlink_chain
BLOCKING CHAINS
---------------
arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain
arch/s390/kernel/process.c: idle_chain
arch/x86_64/kernel/process.c idle_notifier
drivers/base/memory.c: memory_chain
drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list
drivers/macintosh/adb.c: adb_client_list
drivers/macintosh/via-pmu.c sleep_notifier_list
drivers/macintosh/via-pmu68k.c sleep_notifier_list
drivers/macintosh/windfarm_core.c wf_client_list
drivers/usb/core/notify.c usb_notifier_list
drivers/video/fbmem.c fb_notifier_list
kernel/cpu.c cpu_chain
kernel/module.c module_notify_list
kernel/profile.c munmap_notifier
kernel/profile.c task_exit_notifier
kernel/sys.c reboot_notifier_list
net/core/dev.c netdev_chain
net/decnet/dn_dev.c: dnaddr_chain
net/ipv4/devinet.c: inetaddr_chain
It's possible that some of these classifications are wrong. If they are,
please let us know or submit a patch to fix them. Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)
The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.
[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Provide proper kprobes fault handling, if a user-specified pre/post handlers
tries to access user address space, through copy_from_user(), get_user() etc.
The user-specified fault handler gets called only if the fault occurs while
executing user-specified handlers. In such a case user-specified handler is
allowed to fix it first, later if the user-specifed fault handler does not fix
it, we try to fix it by calling fix_exception().
The user-specified handler will not be called if the fault happens when single
stepping the original instruction, instead we reset the current probe and
allow the system page fault handler to fix it up.
I could not test this patch for sparc64.
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently kprobe handler traps only happen in kernel space, so function
kprobe_exceptions_notify should skip traps which happen in user space.
This patch modifies this, and it is based on 2.6.16-rc4.
Signed-off-by: bibo mao <bibo.mao@intel.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Cc: <hiramatu@sdl.hitachi.co.jp>
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Create compat_sys_adjtimex and use it an all appropriate places.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We had a copy of the compatibility version of struct timex in each 64 bit
architecture. This patch just creates a global one and replaces all the
usages of the old ones.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kyle McMartin <kyle@parisc-linux.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When we stop allocating percpu memory for not-possible CPUs we must not touch
the percpu data for not-possible CPUs at all. The correct way of doing this
is to test cpu_possible() or to use for_each_cpu().
This patch is a kernel-wide sweep of all instances of NR_CPUS. I found very
few instances of this bug, if any. But the patch converts lots of open-coded
test to use the preferred helper macros.
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Kyle McMartin <kyle@parisc-linux.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Christian Zankel <chris@zankel.net>
Cc: Philippe Elie <phil.el@wanadoo.fr>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Jens Axboe <axboe@suse.de>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
init/do_mounts_rd.c depends upon CONFIG_BLK_DEV_RAM, not CONFIG_BLK_DEV_INITRD.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We only need to write an invalid tag every 16 bytes,
so taking advantage of this can save many instructions
compared to the simple memset() call we make now.
A prefetching implementation is implemented for sun4u
and a block-init store version if implemented for Niagara.
The next trick is to be able to perform an init and
a copy_tsb() in parallel when growing a TSB table.
Signed-off-by: David S. Miller <davem@davemloft.net>
Put it one page below the top of the 32-bit address space.
This gives us ~16MB more address space to work with.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently allocations are very constrained for 32-bit processes.
It grows down-up from 0x70000000 to 0xf0000000 which gives about
2GB of stack + dynamic mmap() space.
So support the top-down method, and we need to override the
generic helper function in order to deal with D-cache coloring.
With these changes I was able to squeeze out a mmap() just over
3.6GB in size in a 32-bit process.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is good for up to %50 performance improvement of some test cases.
The problem has been the race conditions, and hopefully I've plugged
them all up here.
1) There was a serious race in switch_mm() wrt. lazy TLB
switching to and from kernel threads.
We could erroneously skip a tsb_context_switch() and thus
use a stale TSB across a TSB grow event.
There is a big comment now in that function describing
exactly how it can happen.
2) All code paths that do something with the TSB need to be
guarded with the mm->context.lock spinlock. This makes
page table flushing paths properly synchronize with both
TSB growing and TLB context changes.
3) TSB growing events are moved to the end of successful fault
processing. Previously it was in update_mmu_cache() but
that is deadlock prone. At the end of do_sparc64_fault()
we hold no spinlocks that could deadlock the TSB grow
sequence. We also have dropped the address space semaphore.
While we're here, add prefetching to the copy_tsb() routine
and put it in assembler into the tsb.S file. This piece of
code is quite time critical.
There are some small negative side effects to this code which
can be improved upon. In particular we grab the mm->context.lock
even for the tsb insert done by update_mmu_cache() now and that's
a bit excessive. We can get rid of that locking, and the same
lock taking in flush_tsb_user(), by disabling PSTATE_IE around
the whole operation including the capturing of the tsb pointer
and tsb_nentries value. That would work because anyone growing
the TSB won't free up the old TSB until all cpus respond to the
TSB change cross call.
I'm not quite so confident in that optimization to put it in
right now, but eventually we might be able to and the description
is here for reference.
This code seems very solid now. It passes several parallel GCC
bootstrap builds, and our favorite "nut cruncher" stress test which is
a full "make -j8192" build of a "make allmodconfig" kernel. That puts
about 256 processes on each cpu's run queue, makes lots of process cpu
migrations occur, causes lots of page table and TLB flushing activity,
incurs many context version number changes, and it swaps the machine
real far out to disk even though there is 16GB of ram on this test
system. :-)
Signed-off-by: David S. Miller <davem@davemloft.net>
Report 'sun4v' when appropriate in /proc/cpuinfo
Remove all the verifications of the OBP version string. Just
make sure it's there, and report it raw in the bootup logs and
via /proc/cpuinfo.
Signed-off-by: David S. Miller <davem@davemloft.net>
The mapping is a simple "(cpuid >> 2) == core" for now.
Later we'll add more sophisticated code that will walk
the sun4v machine description and figure this out from
there.
We should also add core mappings for jaguar and panther
processors.
Signed-off-by: David S. Miller <davem@davemloft.net>
This has been pending for a long time, and the fact
that we waste a ton of ram on some configurations
kind of pushed things over the edge.
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't piggy back the SMP receive signal code to do the
context version change handling.
Instead allocate another fixed PIL number for this
asynchronous cross-call. We can't use smp_call_function()
because this thing is invoked with interrupts disabled
and a few spinlocks held.
Also, fix smp_call_function_mask() to count "cpus" correctly.
There is no guarentee that the local cpu is in the mask
yet that is exactly what this code was assuming.
Signed-off-by: David S. Miller <davem@davemloft.net>
this patch converts arch/sparc64 to kzalloc usage.
Crosscompile tested with allyesconfig.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't try to avoid putting non-base page sized entries
into the user TSB. It actually costs us more to check
this than it helps.
Eventually we'll have a multiple TSB scheme for user
processes. Once a process starts using larger pages,
we'll allocate and use such a TSB.
Signed-off-by: David S. Miller <davem@davemloft.net>
This cpu mondo sending interface isn't all that easy to
use correctly...
We were clearing out the wrong bits from the "mask" after getting
something other than EOK from the hypervisor.
It turns out the hypervisor can just be resent the same cpu_list[]
array, with the 0xffff "done" entries still in there, and it will do
the right thing.
So don't update or try to rebuild the cpu_list[] array to condense it.
This requires the "forward_progress" check to be done slightly
differently, but this new scheme is less bug prone than what we were
doing before.
Signed-off-by: David S. Miller <davem@davemloft.net>
We were clobbering a base register before we were done
using it. Fix a comment typo while we're here.
Signed-off-by: David S. Miller <davem@davemloft.net>
The UltraSPARC T1 manual recommends this because the chip
could instruction prefetch into the VA hole, and this would
also make decoding certain kinds of memory access traps
more difficult (because the chip sign extends certain pieces
of trap state).
Signed-off-by: David S. Miller <davem@davemloft.net>
First of all, use the known _PAGE_EXEC_{4U,4V} value instead
of loading _PAGE_EXEC from memory. We either know which one
to use by context, or we can code patch the test.
Next, we need to check executability of a PTE in the generic
TSB miss handler.
Signed-off-by: David S. Miller <davem@davemloft.net>
There were several bugs in the SUN4V cpu mondo dispatch code.
In fact, if we ever got a EWOULDBLOCK or other error from
the hypervisor call, we'd potentially send a cpu mondo multiple
times to the same cpu and even worse we could loop until the
timeout resending the same mondo over and over to such cpus.
So let's bulletproof this thing as follows:
1) Implement cpu_mondo_send() and cpu_state() hypervisor calls
in arch/sparc64/kernel/entry.S, add prototypes to asm/hypervisor.h
2) Don't build and update the cpulist using inline functions, this
was causing the cpu mask to not get updated in the caller.
3) Disable interrupts during the entire mondo send, otherwise our
cpu list and/or mondo block could get overwritten if we take
an interrupt and do a cpu mondo send on the current cpu.
4) Check for all possible error return types from the cpu_mondo_send()
hypervisor call. In particular:
HV_EOK) Our work is done, all cpus have received the mondo.
HV_CPUERROR) One or more of the cpus in the cpu list we passed
to the hypervisor are in error state. Use cpu_state()
calls over the entries in the cpu list to see which
ones. Record them in "error_mask" and report this
after we are done sending the mondo to cpus which are
not in error state.
HV_EWOULDBLOCK) We need to keep trying.
Any other error we consider fatal, we report the event and exit
immediately.
5) We only timeout if forward progress is not made. Forward progress
is defined as having at least one cpu get the mondo successfully
in a given cpu_mondo_send() call. Otherwise we bump a counter
and delay a little. If the counter hits a limit, we signal an
error and report the event.
Also, smp_call_function_mask() error handling reports the number
of cpus incorrectly.
Signed-off-by: David S. Miller <davem@davemloft.net>
1) We must flush the TLB, duh.
2) Even if the sw context was seen to be valid, the local cpu's
hw context can be out of date, so reload it unconditionally.
Signed-off-by: David S. Miller <davem@davemloft.net>
Check TLB flush hypervisor calls for errors and report them.
Pass HV_MMU_ALL always for now, we can add back the optimization
to avoid the I-TLB flush later.
Always explicitly page align the virtual address arguments.
Signed-off-by: David S. Miller <davem@davemloft.net>
The context allocation scheme we use depends upon there being a 1<-->1
mapping from cpu to physical TLB for correctness. Chips like Niagara
break this assumption.
So what we do is notify all cpus with a cross call when the context
version number changes, and if necessary this makes them allocate
a valid context for the address space they are running at the time.
Stress tested with make -j1024, make -j2048, and make -j4096 kernel
builds on a 32-strand, 8 core, T2000 with 16GB of ram.
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise with too much stuff enabled in the kernel config
we can end up with an unaligned trap table.
Signed-off-by: David S. Miller <davem@davemloft.net>
If we take a window fault, on SUN4V set %gl to zero before we
turn PSTATE_IE back on in %pstate. Otherwise if we take an
interrupt we'll end up with corrupt register state.
Signed-off-by: David S. Miller <davem@davemloft.net>
It can map all of the linear kernel mappings with zero TSB hash
conflicts for systems with 16GB or less ram. In such cases, on
SUN4V, once we load up this TSB the first time with all the
mappings, we never take a linear kernel mapping TLB miss ever
again, the hypervisor handles them all.
Signed-off-by: David S. Miller <davem@davemloft.net>
We use a bitmap, one bit for every 256MB of memory. If the
bit is set we can use a 256MB PTE for linear mappings, else
we have to use a 4MB PTE.
SUN4V support is there, and we can very easily add support
for Panther cpu 256MB PTEs in the future.
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to turn off the "polling nrflag" bit when we sleep
the cpu like this, so that we'll get a cross-cpu interrupt
to wake the processor up from the yield.
We also have to disable PSTATE_IE in %pstate around the yield
call and recheck need_resched() in order to avoid any races.
Signed-off-by: David S. Miller <davem@davemloft.net>
Set, but never used.
We used to use this for dynamic IRQ retargetting, but that
code died a long time ago.
Signed-off-by: David S. Miller <davem@davemloft.net>
It's extremely noisy and causes much grief on slow
consoles with large numbers of cpus.
We'll have to provide this some saner way in order
to re-enable this.
Signed-off-by: David S. Miller <davem@davemloft.net>
We're about to seriously die in these cases so it is important
that the messages make it to the console.
Signed-off-by: David S. Miller <davem@davemloft.net>
Another case where we have to force ourselves into global register
level one. Also make sure the arguments passed to sun4v_do_mna() are
correct.
This area actually needs some more work, for example spill fixup is
not necessarily going to do the right thing for this case.
Signed-off-by: David S. Miller <davem@davemloft.net>
Just like kvmap_dtlb_longpath we have to force the
global register level to one in order to mimick the
PSTATE_MG --> PSTATE_AG trasition done on SUN4U.
Signed-off-by: David S. Miller <davem@davemloft.net>
The SUN4V convention with non-shared TSBs is that the context
bit of the TAG is clear. So we have to choose an "invalid"
bit and initialize new TSBs appropriately. Otherwise a zero
TAG looks "valid".
Make sure, for the window fixup cases, that we use the right
global registers and that we don't potentially trample on
the live global registers in etrap/rtrap handling (%g2 and
%g6) and that we put the missing virtual address properly
in %g5.
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Add error return checking for TLB load hypervisor
calls.
2) Don't fallthru to dtlb tsb miss handler from itlb tsb
miss handler, oops.
3) On window fixups, propagate fault information to fixup
handler correctly.
Signed-off-by: David S. Miller <davem@davemloft.net>
This gives more consistent bogomips and delay() semantics,
especially on sun4v. It gives weird looking values though...
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to use the real hardware processor ID when
targetting interrupts, not the "define to 0" thing
the uniprocessor build gives us.
Also, fill in the Node-ID and Agent-ID fields properly
on sun4u/Safari.
Signed-off-by: David S. Miller <davem@davemloft.net>
If the top-level cnode had multi entries in it's "reg"
property, we'd fail. The buffer wasn't large enough in
such cases.
Signed-off-by: David S. Miller <davem@davemloft.net>
The sibling cpu bringup is extremely fragile. We can only
perform the most basic calls until we take over the trap
table from the firmware/hypervisor on the new cpu.
This means no accesses to %g4, %g5, %g6 since those can't be
TLB translated without our trap handlers.
In order to achieve this:
1) Change sun4v_init_mondo_queues() so that it can operate in
several modes.
It can allocate the queues, or install them in the current
processor, or both.
The boot cpu does both in it's call early on.
Later, the boot cpu allocates the sibling cpu queue, starts
the sibling cpu, then the sibling cpu loads them in.
2) init_cur_cpu_trap() is changed to take the current_thread_info()
as an argument instead of reading %g6 directly on the current
cpu.
3) Create a trampoline stack for the sibling cpus. We do our basic
kernel calls using this stack, which is locked into the kernel
image, then go to our proper thread stack after taking over the
trap table.
4) While we are in this delicate startup state, we put 0xdeadbeef
into %g4/%g5/%g6 in order to catch accidental accesses.
5) On the final prom_set_trap_table*() call, we put &init_thread_union
into %g6. This is a hack to make prom_world(0) work. All that
wants to do is restore the %asi register using
get_thread_current_ds().
Longer term we should just do the OBP calls to set the trap table by
hand just like we do for everything else. This would avoid that silly
prom_world(0) issue, then we can remove the init_thread_union hack.
Signed-off-by: David S. Miller <davem@davemloft.net>
For 32 cpus and a slow console, it just wedges the
machine especially with DETECT_SOFTLOCKUP enabled.
Signed-off-by: David S. Miller <davem@davemloft.net>
The whole algorithm was wrong. What we need to do is:
1) Walk each PCI bus above this device on the path to the
PCI controller nexus, and for each:
a) If interrupt-map exists, apply it, record IRQ controller node
b) Else, swivel interrupt number using PCI_SLOT(), use PCI bus
parent OBP node as controller node
c) Walk up to "controller node" until we hit the first PCI bus
in this domain, or "controller node" is the PCI controller
OBP node
2) If we walked to PCI controller OBP node, we're done.
3) Else, apply PCI controller interrupt-map to interrupt.
There is some stuff that needs to be checked out for ebus and
isa, but the PCI part is good to go.
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to set the global register set _AND_ disable
PSTATE_IE in %pstate. The original patch sequence was
leaving PSTATE_IE enabled when returning to kernel mode,
oops.
This fixes the random register corruption being seen
on SUN4V.
Signed-off-by: David S. Miller <davem@davemloft.net>
Forgot to multiply by 8 * 1024, oops. Correct the size constant when
the virtual-dma arena is 2GB in size, it should bet 256 not 128.
Finally, log some info about the TSB at probe time.
Signed-off-by: David S. Miller <davem@davemloft.net>
For SUN4V, we were clobbering %o5 to do the hypervisor call.
This clobbers the saved %pstate value and we end up writing
garbage into that register as a result. Oops.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use prom_startcpu_cpuid() on SUN4V instead of prom_startcpu().
We should really test for "SUNW,start-cpu-by-cpuid" presence
and use it if present even on SUN4U.
Signed-off-by: David S. Miller <davem@davemloft.net>
When crawling up the PCI bus chain, stop at the first node
that has an interrupt-map property before we hit the root.
Also, if we use a bus interrupt-{map,mask} do not forget to
update the 'intmask' pointer as we do for the 'intmap' pointer.
Signed-off-by: David S. Miller <davem@davemloft.net>
On SUN4V, force IRQ state to idle in enable_irq(). However,
I'm still not sure this is %100 correct.
Call add_interrupt_randomness() on SUN4V too.
Signed-off-by: David S. Miller <davem@davemloft.net>
On the PBM's first bus number, only allow device 0, function 0, to be
poked at with PCI config space accesses.
For some reason, this single device responds to all device numbers.
Also, reduce the verbiage of the debugging log printk's for PCI cfg
space accesses in the SUN4V PCI controller driver, so that it doesn't
overwhelm the slow SUN4V hypervisor console.
Signed-off-by: David S. Miller <davem@davemloft.net>
We should dynamically allocate the per-cpu pglist not use
an in-kernel-image datum, since __pa() does not work on
such addresses.
Also, consistently use "u32" for devhandle.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add udelay to polling console write loop, and increment
the loop limit.
Name the device "ttyHV" and pass that to add_preferred_console()
when we're using hypervisor console.
Kill sunhv_console_setup(), it's empty.
Handle the case where we don't want to use hypervisor console.
(ie. we have a head attached to a sun4v machine)
Signed-off-by: David S. Miller <davem@davemloft.net>
Get bus range from child of PCI controller root nexus.
This is actually a hack, but the PCI-E bridge sitting
at the top of the PCI tree responds to PCI config cycles
for every device number, so best to just ignore it for now.
Preliminary PCI irq routing, needs lots of work.
Signed-off-by: David S. Miller <davem@davemloft.net>
Clear top 8-bits of physical addresses in "ranges" property.
This gives the actual physical address.
Detect PBM-A vs. PBM-B by checking bit 0x40 of the devhandle.
Signed-off-by: David S. Miller <davem@davemloft.net>
PCI cfg space is accessed transparently through the Hypervisor and not
through direct cpu PIO operations.
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to use bootmem during init_IRQ and page alloc
for sibling cpu calls.
Also, fix incorrect hypervisor call return value
checks in the hypervisor SMP cpu mondo send code.
Signed-off-by: David S. Miller <davem@davemloft.net>
Yes, you heard it right, they changed the PTE layout for
SUN4V. Ho hum...
This is the simple and inefficient way to support this.
It'll get optimized, don't worry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Code patching did not sign extend negative branch
offsets correctly.
Kernel TLB miss path needs patching and %g4 register
preservation in order to handle SUN4V correctly.
Signed-off-by: David S. Miller <davem@davemloft.net>
prom_sun4v_name should be "sun4v" not "SUNW,sun4v"
Also, this is too early to make use of the
.sun4v_Xinsn_patch code patching, so just check
things manually.
This gets us at least to prom_init() on Niagara.
Signed-off-by: David S. Miller <davem@davemloft.net>
There was also a bug in sun4v_itlb_miss, it loaded the
MMU Fault Status base into %g3 instead of %g2.
This pointed out a fast path for TSB miss processing,
since we have %g2 with the MMU Fault Status base, we
can use that to quickly load up the PGD phys address.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is where the virtual address of the fault status
area belongs.
To set it up we don't make a hypervisor call, instead
we call OBP's SUNW,set-trap-table with the real address
of the fault status area as the second argument. And
right before that call we write the virtual address into
ASI_SCRATCHPAD vaddr 0x0.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add assembler file for PCI hypervisor calls.
Setup basic skeleton of SUN4V PCI controller driver.
Add 32-bit devhandle to PBM struct, as this is needed for
hypervisor calls.
Signed-off-by: David S. Miller <davem@davemloft.net>
Abstract out IOMMU operations so that we can have a different
set of calls on sun4v, which needs to do things through
hypervisor calls.
Signed-off-by: David S. Miller <davem@davemloft.net>
When we register a TSB with the hypervisor, so that it or hardware can
handle TLB misses and do the TSB walk for us, the hypervisor traps
down to these trap when it incurs a TSB miss.
Processing is simple, we load the missing virtual address and context,
and do a full page table walk.
Signed-off-by: David S. Miller <davem@davemloft.net>
We look for "SUNW,sun4v" in the 'compatible' property
of the root OBP device tree node.
Protect every %ver register access, to make sure it is
not touched on sun4v, as %ver is hyperprivileged there.
Lock kernel TLB entries using hypervisor calls instead of
calls into OBP.
Signed-off-by: David S. Miller <davem@davemloft.net>
Technically the hypervisor call supports sending in a list
of all cpus to get the cross-call, but I only pass in one
cpu at a time for now.
The multi-cpu support is there, just ifdef'd out so it's easy to
enable or delete it later.
Signed-off-by: David S. Miller <davem@davemloft.net>
Sun4v has 4 interrupt queues: cpu, device, resumable errors,
and non-resumable errors. A set of head/tail offset pointers
help maintain a work queue in physical memory. The entries
are 64-bytes in size.
Each queue is allocated then registered with the hypervisor
as we bring cpus up.
The two error queues each get a kernel side buffer that we
use to quickly empty the main interrupt queue before we
call up to C code to log the event and possibly take evasive
action.
Signed-off-by: David S. Miller <davem@davemloft.net>
Happily we have no D-cache aliasing issues on these
chips, so the implementation is very straightforward.
Add a stub in bootup which will be where the patching
calls will be made for niagara/sun4v/hypervisor.
Signed-off-by: David S. Miller <davem@davemloft.net>
Things are a little tricky because, unlike sun4u, we have
to:
1) do a hypervisor trap to do the TLB load.
2) do the TSB lookup calculations by hand
Signed-off-by: David S. Miller <davem@davemloft.net>
If we're just switching between different alternate global
sets, nop it out on sun4v. Also, get rid of all of the
alternate global save/restore in the OBP CIF trampoline code.
Signed-off-by: David S. Miller <davem@davemloft.net>
They are totally unnecessary because:
1) Interrupts are already disabled when switch_to()
runs.
2) We don't use hard-coded alternate globals any longer.
This found a case in rtrap, which still assumed alternate
global %g6 was current_thread_info(), and that is fixed
by this changeset as well.
Signed-off-by: David S. Miller <davem@davemloft.net>
As we save trap state onto the stack, the store buffer fills up
mid-way through and we stall for several cycles as the store buffer
trickles out to the L2 cache. Meanwhile we can do some privileged
register reads and other calculations, essentially for free.
Signed-off-by: David S. Miller <davem@davemloft.net>
And more consistently check cheetah{,_plus} instead
of assuming anything not spitfire is cheetah{,_plus}.
Signed-off-by: David S. Miller <davem@davemloft.net>
When saving and restoing trap state, do the window spill/fill
handling inline so that we never trap deeper than 2 trap levels.
This is important for chips like Niagara.
The window fixup code is massively simplified, and many more
improvements are now possible.
Signed-off-by: David S. Miller <davem@davemloft.net>
On uniprocessor, it's always zero for optimize that.
On SMP, the jmpl to the stub kills the return address stack in the cpu
branch prediction logic, so expand the code sequence inline and use a
code patching section to fix things up. This also always better and
explicit register selection, which will be taken advantage of in a
future changeset.
The hard_smp_processor_id() function is big, so do not inline it.
Fix up tests for Jalapeno to also test for Serrano chips too. These
tests want "jbus Ultra-IIIi" cases to match, so that is what we should
test for.
Signed-off-by: David S. Miller <davem@davemloft.net>
The are distrupting, which by the sparc v9 definition means they
can only occur when interrupts are enabled in the %pstate register.
This never occurs in any of the trap handling code running at
trap levels > 0.
So just mark it as an unexpected trap.
This allows us to kill off the cee_stuff member of struct thread_info.
Signed-off-by: David S. Miller <davem@davemloft.net>
This way we don't need to lock the TSB into the TLB.
The trick is that every TSB load/store is registered into
a special instruction patch section. The default uses
virtual addresses, and the patch instructions use physical
address load/stores.
We can't do this on all chips because only cheetah+ and later
have the physical variant of the atomic quad load.
Signed-off-by: David S. Miller <davem@davemloft.net>
If we are returning back to kernel mode, %g4 could be live
(for example, in the case where we window spill in the etrap
code). So do not change it's value if going back to kernel.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we use %g5 itself as a temporary, it can get clobbered
if we take an interrupt mid-stream and thus cause end up with
the final %g5 value too early as a result of rtrap processing.
Set %g5 at the very end, atomically, to avoid this problem.
Signed-off-by: David S. Miller <davem@davemloft.net>
%g6 is not necessarily set to current_thread_info()
at sparc64_realfault_common. So store the fault
code and address after we invoke etrap and %g6 is
properly set up.
Signed-off-by: David S. Miller <davem@davemloft.net>
Just flip the bit off of whatever it's currently set to.
PSTATE_IE is guarenteed to be enabled when we get here.
Signed-off-by: David S. Miller <davem@davemloft.net>
It is totally unnecessary complexity. After we take over
the trap table, we handle all PROM tlb misses fully.
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of the trap code was still assuming that alternate
global %g6 was hard coded with current_thread_info().
Let's just consistently flush at KERNBASE when we need
a pipeline synchronization. That's locked into the TLB
and will always work.
Signed-off-by: David S. Miller <davem@davemloft.net>
As the RSS grows, grow the TSB in order to reduce the likelyhood
of hash collisions and thus poor hit rates in the TSB.
This definitely needs some serious tuning.
Signed-off-by: David S. Miller <davem@davemloft.net>
This also cleans up tsb_context_switch(). The assembler
routine is now __tsb_context_switch() and the former is
an inline function that picks out the bits from the mm_struct
and passes it into the assembler code as arguments.
setup_tsb_parms() computes the locked TLB entry to map the
TSB. Later when we support using the physical address quad
load instructions of Cheetah+ and later, we'll simply use
the physical address for the TSB register value and set
the map virtual and PTE both to zero.
Signed-off-by: David S. Miller <davem@davemloft.net>
Move {init_new,destroy}_context() out of line.
Do not put huge pages into the TSB, only base page size translations.
There are some clever things we could do here, but for now let's be
correct instead of fancy.
Signed-off-by: David S. Miller <davem@davemloft.net>
UltraSPARC has special sets of global registers which are switched to
for certain trap types. There is one set for MMU related traps, one
set of Interrupt Vector processing, and another set (called the
Alternate globals) for all other trap types.
For what seems like forever we've hard coded the values in some of
these trap registers. Some examples include:
1) Interrupt Vector global %g6 holds current processors interrupt
work struct where received interrupts are managed for IRQ handler
dispatch.
2) MMU global %g7 holds the base of the page tables of the currently
active address space.
3) Alternate global %g6 held the current_thread_info() value.
Such hardcoding has resulted in some serious issues in many areas.
There are some code sequences where having another register available
would help clean up the implementation. Taking traps such as
cross-calls from the OBP firmware requires some trick code sequences
wherein we have to save away and restore all of the special sets of
global registers when we enter/exit OBP.
We were also using the IMMU TSB register on SMP to hold the per-cpu
area base address, which doesn't work any longer now that we actually
use the TSB facility of the cpu.
The implementation is pretty straight forward. One tricky bit is
getting the current processor ID as that is different on different cpu
variants. We use a stub with a fancy calling convention which we
patch at boot time. The calling convention is that the stub is
branched to and the (PC - 4) to return to is in register %g1. The cpu
number is left in %g6. This stub can be invoked by using the
__GET_CPUID macro.
We use an array of per-cpu trap state to store the current thread and
physical address of the current address space's page tables. The
TRAP_LOAD_THREAD_REG loads %g6 with the current thread from this
table, it uses __GET_CPUID and also clobbers %g1.
TRAP_LOAD_IRQ_WORK is used by the interrupt vector processing to load
the current processor's IRQ software state into %g6. It also uses
__GET_CPUID and clobbers %g1.
Finally, TRAP_LOAD_PGD_PHYS loads the physical address base of the
current address space's page tables into %g7, it clobbers %g1 and uses
__GET_CPUID.
Many refinements are possible, as well as some tuning, with this stuff
in place.
Signed-off-by: David S. Miller <davem@davemloft.net>
Taking a nod from the powerpc port.
With the per-cpu caching of both the page allocator and SLAB, the
pgtable quicklist scheme becomes relatively silly and primitive.
Signed-off-by: David S. Miller <davem@davemloft.net>
We now use the TSB hardware assist features of the UltraSPARC
MMUs.
SMP is currently knowingly broken, we need to find another place
to store the per-cpu base pointers. We hid them away in the TSB
base register, and that obviously will not work any more :-)
Another known broken case is non-8KB base page size.
Also noticed that flush_tlb_all() is not referenced anywhere, only
the internal __flush_tlb_all() (local cpu only) is used by the
sparc64 port, so we can get rid of flush_tlb_all().
The kernel gets it's own 8KB TSB (swapper_tsb) and each address space
gets it's own private 8K TSB. Later we can add code to dynamically
increase the size of per-process TSB as the RSS grows. An 8KB TSB is
good enough for up to about a 4MB RSS, after which the TSB starts to
incur many capacity and conflict misses.
We even accumulate OBP translations into the kernel TSB.
Another area for refinement is large page size support. We could use
a secondary address space TSB to handle those.
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch "[SPARC64]: Get rid of fast IRQ feature"
moved the the code from arch/sparc64/kernel/entry.S:
lduba [%g7] ASI_PHYS_BYPASS_EC_E, %g5
or %g5, AUXIO_AUX1_FTCNT, %g5
stba %g5, [%g7] ASI_PHYS_BYPASS_EC_E
andn %g5, AUXIO_AUX1_FTCNT, %g5
stba %g5, [%g7] ASI_PHYS_BYPASS_EC_E
to arch/sparc64/kernel/irq.c:
val = readb(auxio_register);
val |= AUXIO_AUX1_FTCNT;
writeb(val, auxio_register);
val &= AUXIO_AUX1_FTCNT;
writeb(val, auxio_register);
This looks like it it missing a bitwise not, which is reintroduced
by this patch.
Due to lack of a floppy device, I could not test it, but it looks
evident.
Signed-off-by: Bernhard R Link <brlink@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We must use the "a" (allocate) attribute every time we
emit an entry into the __ex_table section.
For consistency, use "a" instead of #alloc which is some
Solaris compat cruft GNU as provides on Sparc.
Signed-off-by: David S. Miller <davem@davemloft.net>
The change to kernel/sched.c's init code to use for_each_cpu()
requires that the cpu_possible_map be setup much earlier.
Set it up via setup_arch(), constrained to NR_CPUS, and later
constrain it to max_cpus in smp_prepare_cpus().
This fixes SMP booting on sparc64.
Signed-off-by: David S. Miller <davem@davemloft.net>
The sparc64 64 bit syscall table seems to be broken as it has
compat_sys_newfstatat in its syscall table instead of sys_newfstatat.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also, the Solaris syscall table is sized differrently,
and does not go beyond entry 255, so trim off the excess
entries.
Signed-off-by: David S. Miller <davem@davemloft.net>
This also includes by necessity _TIF_RESTORE_SIGMASK support,
which actually resulted in a lot of cleanups.
The sparc signal handling code is quite a mess and I should
clean it up some day.
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is a series of patches which introduce in total 13 new system calls
which take a file descriptor/filename pair instead of a single file
name. These functions, openat etc, have been discussed on numerous
occasions. They are needed to implement race-free filesystem traversal,
they are necessary to implement a virtual per-thread current working
directory (think multi-threaded backup software), etc.
We have in glibc today implementations of the interfaces which use the
/proc/self/fd magic. But this code is rather expensive. Here are some
results (similar to what Jim Meyering posted before).
The test creates a deep directory hierarchy on a tmpfs filesystem. Then
rm -fr is used to remove all directories. Without syscall support I get
this:
real 0m31.921s
user 0m0.688s
sys 0m31.234s
With syscall support the results are much better:
real 0m20.699s
user 0m0.536s
sys 0m20.149s
The interfaces are for obvious reasons currently not much used. But they'll
be used. coreutils (and Jeff's posixutils) are already using them.
Furthermore, code like ftw/fts in libc (maybe even glob) will also start using
them. I expect a patch to make follow soon. Every program which is walking
the filesystem tree will benefit.
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ftp.linux.org.uk>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: Eddie C. Dost <ecd@brainaid.de>
I have the following patch for serial console over the RSC
(remote system controller) on my E250 machine. It basically adds
support for input-device=rsc and output-device=rsc from OBP, and
allows 115200,8,n,1,- serial mode setting.
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure a consistent value is read from the STICK register by ensuring
that both high and low are read without high changing due to a roll
over of the low register.
Various Debian/SPARC users (myself include) have noticed problems with
Hummingbird based systems. The symptoms are that the system time is
seen to jump forward 3 days, 6 hours, 11 minutes give or take a few
seconds. In many cases the system then hangs some time afterwards.
I've spotted a race condition in the code to read the STICK register.
I could not work out why 3d, 6h, 11m is important but guess that it is
due to the 2^32 jump of STICK (forwards on one read and then the next
read will seem to be backwards) during a timer interrupt. I'm guessing
that a change of -2^32 will get converted to a large unsigned
increment after the arithmetic manipulation between STICK,
nanoseconds, jiffies etc.
I did a test where I modified __hbird_read_stick to artificially
inject rollover faults forcefully every few seconds. With this I saw
the clock jump over 6 times in 12 hours compared to once every month
or so.
Signed-off-by: Richard Mortimer <richm@oldelvet.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
arch: Use <linux/capability.h> where capable() is used.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is a window where a probe gets removed right after the probe is hit
on some different cpu. In this case probe handlers can't find a matching
probe instance related to break address. In this case we need to read the
original instruction at break address to see if that is not a break/int3
instruction and recover safely.
Previous code had a bug where we were not checking for the above race in
case of reentrant probes and the below patch fixes this race.
Tested on IA64, Powerpc, x86_64.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently arch_remove_kprobes() is only implemented/required for x86_64 and
powerpc. All other architecture like IA64, i386 and sparc64 implementes a
dummy function which is being called from arch independent kprobes.c file.
This patch removes the dummy functions and replaces it with
#define arch_remove_kprobe(p, s) do { } while(0)
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since Kprobes runtime exception handlers is now lock free as this code path is
now using RCU to walk through the list, there is no need for the
register/unregister{_kprobe} to use spin_{lock/unlock}_isr{save/restore}. The
serialization during registration/unregistration is now possible using just a
mutex.
In the above process, this patch also fixes a minor memory leak for x86_64 and
powerpc.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that all these entries in the arch ioctl32.c files are gone [1], we can
build fs/compat_ioctl.c as a normal object and kill tons of cruft. We need a
special do_ioctl32_pointer handler for s390 so the compat_ptr call is done.
This is not needed but harmless on all other architectures. Also remove some
superflous includes in fs/compat_ioctl.c
Tested on ppc64.
[1] parisc still had it's PPP handler left, which is not fully correct
for ppp and besides that ppp uses the generic SIOCPRIV ioctl so it'd
kick in for all netdevice users. We can introduce a proper handler
in one of the next patch series by adding a compat_ioctl method to
struct net_device but for now let's just kill it - parisc doesn't
compile in mainline anyway and I don't want this to block this
patchset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The comment in compat.c is wrong, every architecture provides a
get_compat_sigevent() for the IPC compat code already.
This basically moves the x86_64 version to common code and removes all the
others.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
)
From: Adrian Bunk <bunk@stusta.de>
- create one common dump_thread() prototype in kernel.h
- dump_thread() is only used in fs/binfmt_aout.c and can therefore be
removed on all architectures where CONFIG_BINFMT_AOUT is not
available
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't clobber register %l0 while checking TI_SYS_NOERROR value in
syscall return path. This bug was introduced by:
db7d9a4eb7
Problem narrowed down by Luis F. Ortiz and Richard Mortimer.
I tried using %l2 as suggested by Luis and that works for me.
Looking at the code I wonder if it makes sense to simplify the code
a little bit. The following works for me but I'm not sure how to
exercise the "NOERROR" codepath.
Signed-off-by: David S. Miller <davem@davemloft.net>
The ptrace_get_task_struct() helper that I added as part of the ptrace
consolidation is useful in variety of places that currently opencode it.
Switch them to the common helpers.
Add a ptrace_traceme() helper that needs to be explicitly called, and simplify
the ptrace_get_task_struct() interface. We don't need the request argument
now, and we return the task_struct directly, using ERR_PTR() for error
returns. It's a bit more code in the callers, but we have two sane routines
that do one thing well now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It's definition is wrong (-1 means "no limit" not 999),
only the Sparc SunOS/Solaris compat code uses it, so
let's just kill it off completely from limits.h and
all referencing code.
Noticed by Ulrich Drepper.
Signed-off-by: David S. Miller <davem@davemloft.net>
When multiple probes are registered at the same address and if due to some
recursion (probe getting triggered within a probe handler), we skip calling
pre_handlers and just increment nmissed field.
The below patch make sure it walks the list for multiple probes case.
Without the below patch we get incorrect results of nmissed count for
multiple probe case.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Earlier I unifdefed PageCompound, so that snd_pcm_mmap_control_nopage and
others can give out a 0-order component of a higher-order page, which won't
be mistakenly freed when zap_pte_range unmaps it. But many Bad page states
reported a PG_reserved was freed after all: I had missed that we need to
say __GFP_COMP to get compound page behaviour.
Some of these higher-order pages are allocated by snd_malloc_pages, some by
snd_malloc_dev_pages; or if SBUS, by sbus_alloc_consistent - but that has
no gfp arg, so add __GFP_COMP into its sparc32/64 implementations.
I'm still rather puzzled that DRM seems not to need a similar change.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds a new function, sbusfb_compat_ioctl() to
drivers/video/sbuslib.c and uses it as compat_ioctl in all sbus fb
drivers
This remove the last per-arch compat ioctl bits in
arch/sparc64/kernel/ioctl32.c so it would be nice if people could test
if this actually copiles and works and if yes apply it :)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Noticed by Tom 'spot' Callaway.
Even on uniprocessor we always reported the number of physical
cpus in the system via /proc/cpuinfo. But when this got changed
to use num_possible_cpus() it always reads as "1" on uniprocessor.
This change was unintentional.
So scan the firmware device tree and count the number of cpu
nodes, and report that, as we always did.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use ARRAY_SIZE macro instead of sizeof(x)/sizeof(x[0]) and remove a
duplicate of ARRAY_SIZE which is never used anyways.
Signed-off-by: Tobias Klauser <tklauser@nuerscht.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make some changes to the NEED_RESCHED and POLLING_NRFLAG to reduce
confusion, and make their semantics rigid. Improves efficiency of
resched_task and some cpu_idle routines.
* In resched_task:
- TIF_NEED_RESCHED is only cleared with the task's runqueue lock held,
and as we hold it during resched_task, then there is no need for an
atomic test and set there. The only other time this should be set is
when the task's quantum expires, in the timer interrupt - this is
protected against because the rq lock is irq-safe.
- If TIF_NEED_RESCHED is set, then we don't need to do anything. It
won't get unset until the task get's schedule()d off.
- If we are running on the same CPU as the task we resched, then set
TIF_NEED_RESCHED and no further action is required.
- If we are running on another CPU, and TIF_POLLING_NRFLAG is *not* set
after TIF_NEED_RESCHED has been set, then we need to send an IPI.
Using these rules, we are able to remove the test and set operation in
resched_task, and make clear the previously vague semantics of
POLLING_NRFLAG.
* In idle routines:
- Enter cpu_idle with preempt disabled. When the need_resched() condition
becomes true, explicitly call schedule(). This makes things a bit clearer
(IMO), but haven't updated all architectures yet.
- Many do a test and clear of TIF_NEED_RESCHED for some reason. According
to the resched_task rules, this isn't needed (and actually breaks the
assumption that TIF_NEED_RESCHED is only cleared with the runqueue lock
held). So remove that. Generally one less locked memory op when switching
to the idle thread.
- Many idle routines clear TIF_POLLING_NRFLAG, and only set it in the inner
most polling idle loops. The above resched_task semantics allow it to be
set until before the last time need_resched() is checked before going into
a halt requiring interrupt wakeup.
Many idle routines simply never enter such a halt, and so POLLING_NRFLAG
can be always left set, completely eliminating resched IPIs when rescheduling
the idle task.
POLLING_NRFLAG width can be increased, to reduce the chance of resched IPIs.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Run idle threads with preempt disabled.
Also corrected a bugs in arm26's cpu_idle (make it actually call schedule()).
How did it ever work before?
Might fix the CPU hotplugging hang which Nigel Cunningham noted.
We think the bug hits if the idle thread is preempted after checking
need_resched() and before going to sleep, then the CPU offlined.
After calling stop_machine_run, the CPU eventually returns from preemption and
into the idle thread and goes to sleep. The CPU will continue executing
previous idle and have no chance to call play_dead.
By disabling preemption until we are ready to explicitly schedule, this bug is
fixed and the idle threads generally become more robust.
From: alexs <ashepard@u.washington.edu>
PPC build fix
From: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
MIPS build fix
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some architectures define and use this type in their compat_ioctl code, but
all of them can easily use the identical ioctl_trans_handler_t type that is
defined in common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
drivers/drm/ now implements proper ->compat_ioctl methods, so this isn't
needed anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
all ioctls are 32bit compat clean, so the driver can use ->compat_ioctl
and ->unlocked_ioctl easily.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
implement a compat_ioctl handle in the driver instead of having table
entries in sparc64 ioctl32.c (I plan to get rid of the arch ioctl32.c
file eventually)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
all the ioctls in the driver are 32bit compat clean and don't need BKL,
so we can switch it to ->unlocked_ioctl and ->compat_ioctl trivially.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Would you mind applying the following patch that kills those two + the
m68k and Documentation/ references?
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
all these are handled by fs/compat_ioctls.c already.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
I don't know if we ever implemented this, but the only user in any 2.6
tree are the compat ioctls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The old keyboard driver is gone in 2.6, so the only user left are the
compat ioctls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The old sound drivers are gone in 2.6, so the only user left are the
compat ioctls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
this inline routine in arch/sparc64/kernel/ioctl32.c is completely
unused and superceeded by compat_alloc_user_space()
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
It only serves to generate false-positive buildcheck warnings.
Just set it initially to tick_operations which uses the v9
%tick register which every sparc64 processor has.
Signed-off-by: David S. Miller <davem@davemloft.net>
It isn't needed any longer, as noted by Hugh Dickins.
We still need the flush routines, due to the one remaining
call site in hugetlb_prefault_arch_hook(). That can be
eliminated at some later point, however.
Signed-off-by: David S. Miller <davem@davemloft.net>
sparc64 is unique among architectures in taking the page_table_lock in
its context switch (well, cris does too, but erroneously, and it's not
yet SMP anyway).
This seems to be a private affair between switch_mm and activate_mm,
using page_table_lock as a per-mm lock, without any relation to its uses
elsewhere. That's fine, but comment it as such; and unlock sooner in
switch_mm, more like in activate_mm (preemption is disabled here).
There is a block of "if (0)"ed code in smp_flush_tlb_pending which would
have liked to rely on the page_table_lock, in switch_mm and elsewhere;
but its comment explains how dup_mmap's flush_tlb_mm defeated it. And
though that could have been changed at any time over the past few years,
now the chance vanishes as we push the page_table_lock downwards, and
perhaps split it per page table page. Just delete that block of code.
Which leaves the mysterious spin_unlock_wait(&oldmm->page_table_lock)
in kernel/fork.c copy_mm. Textual analysis (supported by Nick Piggin)
suggests that the comment was written by DaveM, and that it relates to
the defeated approach in the sparc64 smp_flush_tlb_pending. Just delete
this block too.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sparc64 prom_callback and new_setup_frame32 each operates on a user page
table without holding lock, and no doubt they've good reason. But I'd
feel more confident if they were to do a "pte = *ptep" and then operate
on pte, rather than re-evaluating *ptep.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the arch/ part of the big kfree cleanup patch.
Remove pointless checks for NULL prior to calling kfree() in arch/.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reorganize the preempt_disable/enable calls to eliminate the extra preempt
depth. Changes based on Paul McKenney's review suggestions for the kprobes
RCU changeset.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Changes to the arch kprobes infrastructure to take advantage of the locking
changes introduced by usage of RCU for synchronization. All handlers are now
run without any locks held, so they have to be re-entrant or provide their own
synchronization.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sparc64 changes to track kprobe execution on a per-cpu basis. We now track
the kprobe state machine independently on each cpu using an arch specific
kprobe control block.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following set of patches are aimed at improving kprobes scalability. We
currently serialize kprobe registration, unregistration and handler execution
using a single spinlock - kprobe_lock.
With these changes, kprobe handlers can run without any locks held. It also
allows for simultaneous kprobe handler executions on different processors as
we now track kprobe execution on a per processor basis. It is now necessary
that the handlers be re-entrant since handlers can run concurrently on
multiple processors.
All changes have been tested on i386, ia64, ppc64 and x86_64, while sparc64
has been compile tested only.
The patches can be viewed as 3 logical chunks:
patch 1: Reorder preempt_(dis/en)able calls
patches 2-7: Introduce per_cpu data areas to track kprobe execution
patches 8-9: Use RCU to synchronize kprobe (un)registration and handler
execution.
Thanks to Maneesh Soni, James Keniston and Anil Keshavamurthy for their
review and suggestions. Thanks again to Anil, Hien Nguyen and Kevin Stafford
for testing the patches.
This patch:
Reorder preempt_disable/enable() calls in arch kprobes files in preparation to
introduce locking changes. No functional changes introduced by this patch.
Signed-off-by: Ananth N Mavinakayahanalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Define jiffies_64 in kernel/timer.c rather than having 24 duplicated
defines in each architecture.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
TIOCSTART and TIOCSTOP are defined in asm/ioctls.h and asm/termios.h by
various architectures but not actually implemented anywhere but in the IRIX
compatibility layer, so remove their COMPATIBLE_IOCTL from parisc, ppc64
and sparc64.
Move the TIOCSLTC COMPATIBLE_IOCTL to common code, guided by an ifdef to
only show up on architectures that support it (same as the code handling it
in tty_ioctl.c), aswell as it's brother TIOCGLTC that wasn't handled so
far.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but
that's not so good if an mm_counter_t is a special type. And how is rss
initialized? By set_mm_counter, all over the place. Come on, we just need to
initialize them both at once by set_mm_counter in mm_init (which follows the
memcpy when forking).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Doing a "SUNW,stop-self" firmware call on the other cpus is not the
correct thing to do when dropping into the firmware for a halt,
reboot, or power-off.
For now, just do nothing to quiet the other cpus, as the system should
be quiescent enough. Later we may decide to implement smp_send_stop()
like the other SMP platforms do.
Based upon a report from Christopher Zimmermann.
Signed-off-by: David S. Miller <davem@davemloft.net>
The hairy fast allocator in the sparc64 PCI IOMMU code
has a hard limit of 256 pages. Certain devices can
exceed this when performing very large I/Os.
So replace with a more simple allocator, based largely
upon the arch/ppc64/kernel/iommu.c code.
Signed-off-by: David S. Miller <davem@davemloft.net>
All the PCI controller drivers were doing the same thing
setting up the IOMMU software state, put it all in one spot.
Signed-off-by: David S. Miller <davem@davemloft.net>
The sequence to move over to the Linux trap tables from
the firmware ones needs to be more air tight. It turns
out that to be %100 safe we do need to be able to translate
OBP mappings in our TLB miss handlers early.
In order not to eat up a lot of kernel image memory with
static page tables, just use the translations array in
the OBP TLB miss handlers. That solves the bulk of the
problem.
Furthermore, to make sure the OBP TLB miss path will work
even before the fixed MMU globals are loaded, explicitly
load %g1 to TLB_SFSR at the beginning of the i-TLB and
d-TLB miss handlers.
To ease the OBP TLB miss walking of the prom_trans[] array,
we sort it then delete all of the non-OBP entries in there
(for example, there are entries for the kernel image itself
which we're not interested in at all).
We also save about 32K of kernel image size with this change.
Not a bad side effect :-)
There are still some reasons why trampoline.S can't use the
setup_trap_table() yet. The most noteworthy are:
1) OBP boots secondary processors with non-bias'd stack for
some reason. This is easily fixed by using a small bootup
stack in the kernel image explicitly for this purpose.
2) Doing a firmware call via the normal C call prom_set_trap_table()
goes through the whole OBP enter/exit sequence that saves and
restores OBP and Linux kernel state in the MMUs. This path
unfortunately does a "flush %g6" while loading up the OBP locked
TLB entries for the firmware call.
If we setup the %g6 in the trampoline.S code properly, that
is in the PAGE_OFFSET linear mapping, but we're not on the
kernel trap table yet so those addresses won't translate properly.
One idea is to do a by-hand firmware call like we do in the
early bootup code and elsewhere here in trampoline.S But this
fails as well, as aparently the secondary processors are not
booted with OBP's special locked TLB entries loaded. These
are necessary for the firwmare to processes TLB misses correctly
up until the point where we take over the trap table.
This does need to be resolved at some point.
Signed-off-by: David S. Miller <davem@davemloft.net>
We were not doing alignment properly when remapping the kernel image.
What we want is a 4MB aligned physical address to map at KERNBASE.
Mistakedly we were 4MB aligning the virtual address where the kernel
initially sits, that's wrong.
Instead, we should PAGE align the virtual address, then 4MB align the
physical address result the prom gives to us.
Signed-off-by: David S. Miller <davem@davemloft.net>
On the boot processor, we need to do the move onto the Linux trap
table a little bit differently else we'll take unhandlable faults in
the firmware address space.
Previously we would do the following:
1) Disable PSTATE_IE in %pstate.
2) Set %tba by hand to sparc64_ttable_tl0
3) Initialize alternate, mmu, and interrupt global
trap registers.
4) Call prom_set_traptable()
That doesn't work very well actually with the way we boot the kernel
VM these days. It worked by luck on many systems because the firmware
accesses for the prom_set_traptable() call happened to be loaded into
the TLB already, something we cannot assume.
So the new scheme is this:
1) Clear PSTATE_IE in %pstate and set %pil to 15
2) Call prom_set_traptable()
3) Initialize alternate, mmu, and interrupt global
trap registers.
and this works quite well. This sequence has been moved into a
callable function in assembler named setup-trap_table(). The idea is
that eventually trampoline.S can use this code as well. That isn't
possible currently due to some complications, but eventually we should
be able to do it.
Thanks to Meelis Roos for the Ultra5 boot failure report.
Signed-off-by: David S. Miller <davem@davemloft.net>
irq.c is missing the inclusion of asm/io.h, which causes
readb() and writeb() the be undefined.
Signed-off-by: Sven Hartge <hartge@ds9.argh.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to use stricter memory barriers around the block
load and store instructions we use to save and restore the
FPU register file.
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of code patching to handle the page size fields in
the context registers, just use variables from which we get
the proper values.
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Use cpudata cache line sizes, not magic constants.
2) Align start address in cheetah case so we do not get
unaligned address traps. (pgrep was good at triggering
this, via /proc/${pid}/cmdline accesses)
Signed-off-by: David S. Miller <davem@davemloft.net>
Delete all of the code working with sp_banks[] and replace
with clean acquisition and sorting of physical memory
parameters from the firmware.
Signed-off-by: David S. Miller <davem@davemloft.net>
We were not calling kernel_mna_trap_fault() correctly.
Instead of being fancy, just return 0 vs. -EFAULT from
the assembler stubs, and handle that return value as
appropriate.
Create an "__retl_efault" stub for assembler exception
table entries and use it where possible.
Signed-off-by: David S. Miller <davem@davemloft.net>
The funny "range" exception table entries we had were only
used by the compat layer socketcall assembly, and it wasn't
even needed there.
For free we now get proper exception table sorting and fast
binary searching.
Signed-off-by: David S. Miller <davem@davemloft.net>
Also, the us3_cpufreq driver can work on Ultra-IV and IV+.
They use the SAFARI bus register to control the clock divider
just like Ultra-III and III+ do.
Signed-off-by: David S. Miller <davem@davemloft.net>
At boot time, determine the D-cache, I-cache and E-cache size and
line-size. Use them in cache flushes when appropriate.
This change was motivated by discovering that the D-cache on
UltraSparc-IIIi and later are 64K not 32K, and the flushes done by the
Cheetah error handlers were assuming a 32K size.
There are still some pieces of code that are hard coding things and
will need to be fixed up at some point.
While we're here, fix the D-cache and I-cache parity error handlers
to run with interrupts disabled, and when the trap occurs at trap
level > 1 log the event via a counter displayed in /proc/cpuinfo.
Signed-off-by: David S. Miller <davem@davemloft.net>
The trick is that we do the kernel linear mapping TLB miss starting
with an instruction sequence like this:
ba,pt %xcc, kvmap_load
xor %g2, %g4, %g5
succeeded by an instruction sequence which performs a full page table
walk starting at swapper_pg_dir.
We first take over the trap table from the firmware. Then, using this
constant PTE generation for the linear mapping area above, we build
the kernel page tables for the linear mapping.
After this is setup, we patch that branch above into a "nop", which
will cause TLB misses to fall through to the full page table walk.
With this, the page unmapping for CONFIG_DEBUG_PAGEALLOC is trivial.
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of all of this cpu-specific code to remap the kernel
to the correct location, use portable firmware calls to do
this instead.
What we do now is the following in position independant
assembler:
chosen_node = prom_finddevice("/chosen");
prom_mmu_ihandle_cache = prom_getint(chosen_node, "mmu");
vaddr = 4MB_ALIGN(current_text_addr());
prom_translate(vaddr, &paddr_high, &paddr_low, &mode);
prom_boot_mapping_mode = mode;
prom_boot_mapping_phys_high = paddr_high;
prom_boot_mapping_phys_low = paddr_low;
prom_map(-1, 8 * 1024 * 1024, KERNBASE, paddr_low);
and that replaces the massive amount of by-hand TLB probing and
programming we used to do here.
The new code should also handle properly the case where the kernel
is mapped at the correct address already (think: future kexec
support).
Consequently, the bulk of remap_kernel() dies as does the entirety
of arch/sparc64/prom/map.S
We try to share some strings in the PROM library with the ones used
at bootup, and while we're here mark input strings to oplib.h routines
with "const" when appropriate.
There are many more simplifications now possible. For one thing, we
can consolidate the two copies we now have of a lot of cpu setup code
sitting in head.S and trampoline.S.
This is a significant step towards CONFIG_DEBUG_PAGEALLOC support.
Signed-off-by: David S. Miller <davem@davemloft.net>
This was kind of ugly, and actually buggy. The bug was that
we didn't handle a machine with memory starting > 4GB. If
the 'prompmd' was allocated in physical memory > 4GB we'd
croak because the obp_iaddr_patch and obp_daddr_patch things
only supported a 32-bit physical address.
So fix this by just loading the appropriate values from two
variables in the kernel image, which is locked into the TLB
and thus accesses to them can't cause a recursive TLB miss.
Signed-off-by: David S. Miller <davem@davemloft.net>
Arrange the modules, OBP, and vmalloc areas such that a range
verification can be done quite minimally.
Signed-off-by: David S. Miller <davem@davemloft.net>
This showed that arch/sparc64/kernel/ptrace.c was not getting
the define properly, and thus the code protected by this ifdef
was never actually compiled before. So fix that too.
Signed-off-by: David S. Miller <davem@davemloft.net>
Because we use byte loads/stores to cons up the value
in and out of registers, we can't expect the ASI endianness
setting to take care of this for us. So do it by hand.
This case is triggered by drivers/block/aoe/aoecmd.c in the
ataid_complete() function where it goes:
/* word 100: number lba48 sectors */
ssize = le64_to_cpup((__le64 *) &id[100<<1]);
This &id[100<<1] address is 4 byte, rather than 8 byte aligned,
thus triggering the unaligned exception.
Signed-off-by: David S. Miller <davem@davemloft.net>
Several implementations were essentialy a common piece of C code using
the cmpxchg() macro. Put the implementation in one spot that everyone
can share, and convert sparc64 over to using this.
Alpha is the lone arch-specific implementation, which codes up a
special fast path for the common case in order to avoid GP reloading
which a pure C version would require.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch (written by me and also containing many suggestions of Arjan van
de Ven) does a major cleanup of the spinlock code. It does the following
things:
- consolidates and enhances the spinlock/rwlock debugging code
- simplifies the asm/spinlock.h files
- encapsulates the raw spinlock type and moves generic spinlock
features (such as ->break_lock) into the generic code.
- cleans up the spinlock code hierarchy to get rid of the spaghetti.
Most notably there's now only a single variant of the debugging code,
located in lib/spinlock_debug.c. (previously we had one SMP debugging
variant per architecture, plus a separate generic one for UP builds)
Also, i've enhanced the rwlock debugging facility, it will now track
write-owners. There is new spinlock-owner/CPU-tracking on SMP builds too.
All locks have lockup detection now, which will work for both soft and hard
spin/rwlock lockups.
The arch-level include files now only contain the minimally necessary
subset of the spinlock code - all the rest that can be generalized now
lives in the generic headers:
include/asm-i386/spinlock_types.h | 16
include/asm-x86_64/spinlock_types.h | 16
I have also split up the various spinlock variants into separate files,
making it easier to see which does what. The new layout is:
SMP | UP
----------------------------|-----------------------------------
asm/spinlock_types_smp.h | linux/spinlock_types_up.h
linux/spinlock_types.h | linux/spinlock_types.h
asm/spinlock_smp.h | linux/spinlock_up.h
linux/spinlock_api_smp.h | linux/spinlock_api_up.h
linux/spinlock.h | linux/spinlock.h
/*
* here's the role of the various spinlock/rwlock related include files:
*
* on SMP builds:
*
* asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the
* initializers
*
* linux/spinlock_types.h:
* defines the generic type and initializers
*
* asm/spinlock.h: contains the __raw_spin_*()/etc. lowlevel
* implementations, mostly inline assembly code
*
* (also included on UP-debug builds:)
*
* linux/spinlock_api_smp.h:
* contains the prototypes for the _spin_*() APIs.
*
* linux/spinlock.h: builds the final spin_*() APIs.
*
* on UP builds:
*
* linux/spinlock_type_up.h:
* contains the generic, simplified UP spinlock type.
* (which is an empty structure on non-debug builds)
*
* linux/spinlock_types.h:
* defines the generic type and initializers
*
* linux/spinlock_up.h:
* contains the __raw_spin_*()/etc. version of UP
* builds. (which are NOPs on non-debug, non-preempt
* builds)
*
* (included on UP-non-debug builds:)
*
* linux/spinlock_api_up.h:
* builds the _spin_*() APIs.
*
* linux/spinlock.h: builds the final spin_*() APIs.
*/
All SMP and UP architectures are converted by this patch.
arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via
crosscompilers. m32r, mips, sh, sparc, have not been tested yet, but should
be mostly fine.
From: Grant Grundler <grundler@parisc-linux.org>
Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU).
Builds 32-bit SMP kernel (not booted or tested). I did not try to build
non-SMP kernels. That should be trivial to fix up later if necessary.
I converted bit ops atomic_hash lock to raw_spinlock_t. Doing so avoids
some ugly nesting of linux/*.h and asm/*.h files. Those particular locks
are well tested and contained entirely inside arch specific code. I do NOT
expect any new issues to arise with them.
If someone does ever need to use debug/metrics with them, then they will
need to unravel this hairball between spinlocks, atomic ops, and bit ops
that exist only because parisc has exactly one atomic instruction: LDCW
(load and clear word).
From: "Luck, Tony" <tony.luck@intel.com>
ia64 fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjanv@infradead.org>
Signed-off-by: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se>
Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There were three changes necessary in order to allow
sparc64 to use setup-res.c:
1) Sparc64 roots the PCI I/O and MEM address space using
parent resources contained in the PCI controller structure.
I'm actually surprised no other platforms do this, especially
ones like Alpha and PPC{,64}. These resources get linked into the
iomem/ioport tree when PCI controllers are probed.
So the hierarchy looks like this:
iomem --|
PCI controller 1 MEM space --|
device 1
device 2
etc.
PCI controller 2 MEM space --|
...
ioport --|
PCI controller 1 IO space --|
...
PCI controller 2 IO space --|
...
You get the idea. The drivers/pci/setup-res.c code allocates
using plain iomem_space and ioport_space as the root, so that
wouldn't work with the above setup.
So I added a pcibios_select_root() that is used to handle this.
It uses the PCI controller struct's io_space and mem_space on
sparc64, and io{port,mem}_resource on every other platform to
keep current behavior.
2) quirk_io_region() is buggy. It takes in raw BUS view addresses
and tries to use them as a PCI resource.
pci_claim_resource() expects the resource to be fully formed when
it gets called. The sparc64 implementation would do the translation
but that's absolutely wrong, because if the same resource gets
released then re-claimed we'll adjust things twice.
So I fixed up quirk_io_region() to do the proper pcibios_bus_to_resource()
conversion before passing it on to pci_claim_resource().
3) I was mistakedly __init'ing the function methods the PCI controller
drivers provide on sparc64 to implement some parts of these
routines. This was, of course, easy to fix.
So we end up with the following, and that nasty SPARC64 makefile
ifdef in drivers/pci/Makefile is finally zapped.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Some PCI devices (e.g. 3c905B, 3c556B) lose all configuration
(including BARs) when transitioning from D3hot->D0. This leaves such
a device in an inaccessible state. The patch below causes the BARs
to be restored when enabling such a device, so that its driver will
be able to access it.
The patch also adds pci_restore_bars as a new global symbol, and adds a
correpsonding EXPORT_SYMBOL_GPL for that.
Some firmware (e.g. Thinkpad T21) leaves devices in D3hot after a
(re)boot. Most drivers call pci_enable_device very early, so devices
left in D3hot that lose configuration during the D3hot->D0 transition
will be inaccessible to their drivers.
Drivers could be modified to account for this, but it would
be difficult to know which drivers need modification. This is
especially true since often many devices are covered by the same
driver. It likely would be necessary to replicate code across dozens
of drivers.
The patch below should trigger only when transitioning from D3hot->D0
(or at boot), and only for devices that have the "no soft reset" bit
cleared in the PM control register. I believe it is safe to include
this patch as part of the PCI infrastructure.
The cleanest implementation of pci_restore_bars was to call
pci_update_resource. Unfortunately, that does not currently exist
for the sparc64 architecture. The patch below includes a null
implemenation of pci_update_resource for sparc64.
Some have expressed interest in making general use of the the
pci_restore_bars function, so that has been exported to GPL licensed
modules.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Since GCC has to emit a call and a delay slot to the
out-of-line "membar" routines in arch/sparc64/lib/mb.S
it is much better to just do the necessary predicted
branch inline instead as:
ba,pt %xcc, 1f
membar #whatever
1:
instead of the current:
call membar_foo
dslot
because this way GCC is not required to allocate a stack
frame if the function can be a leaf function.
This also makes this bug fix easier to backport to 2.4.x
Signed-off-by: David S. Miller <davem@davemloft.net>