Move "debug during resume from s2ram" into the variable we already use
for real-mode flags to simplify code. It also closes nasty trap for
the user in acpi_sleep_setup; order of parameters actually mattered there,
acpi_sleep=s3_bios,s3_mode doing something different from
acpi_sleep=s3_mode,s3_bios.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a feature allowing the user to make the system beep during a resume from
suspend to RAM, on x86_64 and i386.
This is useful for the users with broken resume from RAM, so that they can
verify if the control reaches the kernel after a wake-up event.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce the pm_power_off_prepare() callback that can be registered by the
interested platforms in analogy with pm_idle() and pm_power_off(), used for
preparing the system to power off (needed by ACPI).
This allows us to drop acpi_sysclass and device_acpi that are only defined in
order to register the ACPI power off preparation callback, which is needed by
pm_power_off() registered in a much different way.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it possible to register hibernation and suspend notifiers, so that
subsystems can perform hibernation-related or suspend-related operations that
should not be carried out by device drivers' .suspend() and .resume()
routines.
[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kernel threads should not have TIF_FREEZE set when user space processes are
being frozen, since otherwise some of them might be frozen prematurely.
To prevent this from happening we can (1) make exit_mm() unset TIF_FREEZE
unconditionally just after clearing tsk->mm and (2) make try_to_freeze_tasks()
check if p->mm is different from zero and PF_BORROWED_MM is unset in p->flags
when user space processes are to be frozen.
Namely, when user space processes are being frozen, we only should set
TIF_FREEZE for tasks that have p->mm different from NULL and don't have
PF_BORROWED_MM set in p->flags. For this reason task_lock() must be used to
prevent try_to_freeze_tasks() from racing with use_mm()/unuse_mm(), in which
p->mm and p->flags.PF_BORROWED_MM are changed under task_lock(p). Also, we
need to prevent the following scenario from happening:
* daemonize() is called by a task spawned from a user space code path
* freezer checks if the task has p->mm set and the result is positive
* task enters exit_mm() and clears its TIF_FREEZE
* freezer sets TIF_FREEZE for the task
* task calls try_to_freeze() and goes to the refrigerator, which is wrong at
that point
This requires us to acquire task_lock(p) before p->flags.PF_BORROWED_MM and
p->mm are examined and release it after TIF_FREEZE is set for p (or it turns
out that TIF_FREEZE should not be set).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At least on some machines it is necessary to prepare the ACPI firmware for the
restoration of the system memory state from the hibernation image if the
"platform" mode of hibernation has been used. Namely, in that cases we need
to disable the GPEs before replacing the "boot" kernel with the "frozen"
kernel (cf. http://bugzilla.kernel.org/show_bug.cgi?id=7887). After the
restore they will be re-enabled by hibernation_ops->finish(), but if the
restore fails, they have to be re-enabled by the restore code explicitly.
For this purpose we can introduce two additional hibernation operations,
called pre_restore() and restore_cleanup() and call them from the restore code
path. Still, they should be called if the "platform" mode of hibernation has
been used, so we need to pass the information about the hibernation mode from
the "frozen" kernel to the "boot" kernel in the image header.
Apparently, we can't drop the disabling of GPEs before the restore because of
Bug #7887 . We also can't do it unconditionally, because the GPEs wouldn't
have been enabled after a successful restore if the suspend had been done in
the 'shutdown' or 'reboot' mode.
In principle we could (and probably should) unconditionally disable the GPEs
before each snapshot creation *and* before the restore, but then we'd have to
unconditionally enable them after the snapshot creation as well as after the
restore (or restore failure) Still, for this purpose we'd need to modify
acpi_enter_sleep_state_prep() and acpi_leave_sleep_state() and we'd have to
introduce some mechanism synchronizing the disablind/enabling of the GPEs with
the device drivers' .suspend()/.resume() routines and with
disable_/enable_nonboot_cpus(). However, this would have affected the
suspend (ie. s2ram) code as well as the hibernation, which I'd like to avoid
in this patch series.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_zeroed_user_highpage() has no in-tree users and it is not exported.
As it is not exported, it can simply be removed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch completes Linus's wish that the fault return codes be made into
bit flags, which I agree makes everything nicer. This requires requires
all handle_mm_fault callers to be modified (possibly the modifications
should go further and do things like fault accounting in handle_mm_fault --
however that would be for another patch).
[akpm@linux-foundation.org: fix alpha build]
[akpm@linux-foundation.org: fix s390 build]
[akpm@linux-foundation.org: fix sparc build]
[akpm@linux-foundation.org: fix sparc64 build]
[akpm@linux-foundation.org: fix ia64 build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Still apparently needs some ARM and PPC loving - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change ->fault prototype. We now return an int, which contains
VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
FAULT_RET_ code tells the VM whether a page was found, whether it has been
locked, and potentially other things. This is not quite the way he wanted
it yet, but that's changed in the next patch (which requires changes to
arch code).
This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
that a page is locked which requires filemap_nopage to go away (because we
can no longer remain backward compatible without that flag), but we were
going to do that anyway.
struct fault_data is renamed to struct vm_fault as Linus asked. address
is now a void __user * that we should firmly encourage drivers not to use
without really good reason.
The page is now returned via a page pointer in the vm_fault struct.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
the virtual address -> file offset differently from linear mappings.
->populate is a layering violation because the filesystem/pagecache code
should need to know anything about the virtual memory mapping. The hitch here
is that the ->nopage handler didn't pass down enough information (ie. pgoff).
But it is more logical to pass pgoff rather than have the ->nopage function
calculate it itself anyway (because that's a similar layering violation).
Having the populate handler install the pte itself is likewise a nasty thing
to be doing.
This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn. Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.
The rationale for doing this in the first place is that nonlinear mappings are
subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
to duplicate the synchronisation logic rather than just consolidate the two.
After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache. Seems like a fringe functionality anyway.
NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
users have hit mainline yet.
[akpm@linux-foundation.org: cleanup]
[randy.dunlap@oracle.com: doc. fixes for readahead]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://linux-nfs.org/~bfields/linux:
locks: fix vfs_test_lock() comment
locks: make posix_test_lock() interface more consistent
nfs: disable leases over NFS
gfs2: stop giving out non-cluster-coherent leases
locks: export setlease to filesystems
locks: provide a file lease method enabling cluster-coherent leases
locks: rename lease functions to reflect locks.c conventions
locks: share more common lease code
locks: clean up lease_alloc()
locks: convert an -EINVAL return to a BUG
leases: minor break_lease() comment clarification
Since posix_test_lock(), like fcntl() and ->lock(), indicates absence or
presence of a conflict lock by setting fl_type to, respectively, F_UNLCK
or something other than F_UNLCK, the return value is no longer needed.
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Currently leases are only kept locally, so there's no way for a distributed
filesystem to enforce them against multiple clients. We're particularly
interested in the case of nfsd exporting a cluster filesystem, in which
case nfsd needs cluster-coherent leases in order to implement delegations
correctly.
Also add some documentation.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
We've been using the convention that vfs_foo is the function that calls
a filesystem-specific foo method if it exists, or falls back on a
generic method if it doesn't; thus vfs_foo is what is called when some
other part of the kernel (normally lockd or nfsd) wants to get a lock,
whereas foo is what filesystems call to use the underlying local
functionality as part of their lock implementation.
So rename setlease to vfs_setlease (which will call a
filesystem-specific setlease after a later patch) and __setlease to
setlease.
Also, vfs_setlease need only be GPL-exported as long as it's only needed
by lockd and nfsd.
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
This interface allows the ability to write the majority of a driver in
userspace with only a very small shell of a driver in the kernel itself.
It uses a char device and sysfs to interact with a userspace process to
process interrupts and control memory accesses.
See the docbook documentation for more details on how to use this
interface.
From: Hans J. Koch <hjk@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benedikt Spranger <b.spranger@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This defines a dev_vdbg() call, which is enabled with -DVERBOSE_DEBUG.
When enabled, dev_vdbg() acts just like dev_dbg(). When disabled, it is a
NOP ... just like dev_dbg() without -DDEBUG. The specific code was moved
out of a USB patch, but lots of drivers have similar support.
That is, code can now be written to use an additional level of debug
output, selected at compile time. Many driver authors have found this
idiom to be very useful. A typical usage model is for "normal" debug
messages to focus on fault paths and not be very "chatty", so that those
messages can be left on during normal operation without much of a
performance or syslog load. On the other hand "verbose" messages would be
noisy enough that they wouldn't normally be enabled; they might even affect
timings enough to change system or driver behavior.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch (as933) removes the deprecated dpm_runtime_suspend() and
dpm_runtime_resume() routines from the PM core. The only user of
those routines is the PCMCIA ds driver; local replacements are added.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This allows the uevent file to handle any type of uevent action to be
triggered by userspace instead of just the "add" uevent.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The TSEC/eTSEC can detect the interface to the PHY automatically,
but it isn't able to detect whether the RGMII connection needs internal
delay. So we need to detect that change in the device tree, propagate
it to the platform data, and then check it if we're in RGMII. This fixes
a bug on the 8641D HPCN board where the Vitesse PHY doesn't use the delay
for RGMII.
Signed-off-by: Andy Fleming <afleming@freescale.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hskinnemoen/avr32-2.6:
[AVR32] Initialize phy_mask for both macb devices
[AVR32] Fix atomic_add_unless() and atomic_sub_unless()
[AVR32] Correct misspelled CONFIG_BLK_DEV_INITRD variable.
[AVR32] Fix build error in parse_tag_rdimg()
[AVR32] Don't wire up macb0 unless SW6 is in default position
[AVR32] Wire up SSC platform device 0 as TX on ATSTK1000 board
[AVR32] Add Atmel SSC driver platform device to AT32AP architecture
[AVR32] Remove optimization of unaligned word loads
[AVR32] Make STK1000 mux settings configurable
[AVR32] CPU frequency scaling for AT32AP
[AVR32] Split SM device into PM, RTC, WDT and EIC
[AVR32] faster avr32 unaligned access
These functions depend on "result" being initalized to 0, but "result"
is not included as an input constraint to the inline assembly block
following its initialization, only as an output constraint. Thus gcc
thinks it doesn't need to initialize it, so result ends up undefined
if the "unless" condition is true.
This fixes an oops in sunrpc where the faulty atomics caused
rpciod_up() to not start the workqueue as it should.
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
This patch adds register definitions, clocks and IRQs to the platform devices.
Signed-off-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
If we let unaligned word loads bypass the generic unaligned handling,
gcc may combine it with a swap.b instruction and turn it into a ldwsp
instruction, which does not work with unaligned addresses.
Revert the optimization to prevent the RNDIS driver from crashing.
Hopefully we'll figure something out later (it may be better to do the
optimization in gcc.)
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Split the SM platform device into separate platform devices for PM,
RTC, WDT and EIC. This is more correct according to the documentation
and allows us to simplify the code a little.
Also turn the EIC driver into a real platform driver.
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Acked-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
Use a more conventional implementation for unaligned access, and include
an AT32AP-specific optimization: the CPU will handle unaligned words.
The result is always faster and smaller for 8, 16, and 32 bit values.
For 64 bit quantities, it's presumably larger.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
* 'master' of ssh://master.kernel.org/pub/scm/linux/kernel/git/mchehab/v4l-dvb: (126 commits)
V4L/DVB (5847): Clean up schedule_timeout calls in cpia2 and ivtv code
V4L/DVB (5846): Clean up setting state and scheduling timeouts
V4L/DVB (5844): ivtv: add high volume debugging flag
V4L/DVB (5843): ivtv: fix missing signal_pending check.
V4L/DVB (5842): ivtv: Add locking to ensure stream setup is atomic.
V4L/DVB (5841): tveeprom: add support for Philips FQ1216LME MK3 tuner.
V4L/DVB (5840): fix dst and cx24123: tune() callback changed signess for delay
V4L/DVB (5838): dvb-core: Fix signedness warnings (gcc 4.1.1, kernel 2.6.22)
V4L/DVB (5837): stv0299: Fix signedness warning (gcc 4.1.1, kernel 2.6.22)
V4L/DVB (5836): dvb-ttpci: re-initialize aspect ratio and pan scan after arm crash
V4L/DVB (5835): saa7146/dvb-ttpci: Fix signedness warnings (gcc 4.1.1, kernel 2.6.22)
V4L/DVB (5834): dvb-core: fix signedness warnings and const stripping
V4L/DVB (5832): ir-common: optimize bit extract function
V4L/DVB (5831): stradis: use ARRAY_SIZE
V4L/DVB (5829): Firmware extract and loading for opera dvb-usb update
V4L/DVB (5828): Kconfig: Added GemTek USB radio and removed experimental dependency.
V4L/DVB (5826): Usbvision: video mux cleanup
V4L/DVB (5825): Alter the tuner type for the WinTV USB UK PAL model.
V4L/DVB (5824): Usbvision: Hauppauge WinTV USB SECAM_L fix
V4L/DVB (5821): Saa7134: add remote control support for LifeView FlyDVB-S LR300
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: extent macros cleanup
Fix compilation with EXT_DEBUG, also fix leXX_to_cpu conversions.
ext4: remove extra IS_RDONLY() check
ext4: Use is_power_of_2()
Use zero_user_page() in ext4 where possible
ext4: Remove 65000 subdirectory limit
ext4: Expand extra_inodes space per the s_{want,min}_extra_isize fields
ext4: Add nanosecond timestamps
jbd2: Move jbd2-debug file to debugfs
jbd2: Fix CONFIG_JBD_DEBUG ifdef to be CONFIG_JBD2_DEBUG
ext4: Set the journal JBD2_FEATURE_INCOMPAT_64BIT on large devices
ext4: Make extents code sanely handle on-disk corruption
ext4: copy i_flags to inode flags on write
ext4: Enable extents by default
Change on-disk format to support 2^15 uninitialized extents
write support for preallocated blocks
fallocate support in ext4
sys_fallocate() implementation on i386, x86_64 and powerpc
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (24 commits)
[NETFILTER]: xt_connlimit needs to depend on nf_conntrack
[NETFILTER]: ipt_iprange.h must #include <linux/types.h>
[IrDA]: Fix IrDA build failure
[ATM]: nicstar needs virt_to_bus
[NET]: move __dev_addr_discard adjacent to dev_addr_discard for readability
[NET]: merge dev_unicast_discard and dev_mc_discard into one
[NET]: move dev_mc_discard from dev_mcast.c to dev.c
[NETLINK]: negative groups in netlink_setsockopt
[PPPOL2TP]: Reset meta-data in xmit function
[PPPOL2TP]: Fix use-after-free
[PKT_SCHED]: Some typo fixes in net/sched/Kconfig
[XFRM]: Fix crash introduced by struct dst_entry reordering
[TCP]: remove unused argument to cong_avoid op
[ATM]: [idt77252] Rename CONFIG_ATM_IDT77252_SEND_IDLE to not resemble a Kconfig variable
[ATM]: [drivers] ioremap balanced with iounmap
[ATM]: [lanai] sram_test_word() must be __devinit
[ATM]: [nicstar] Replace C code with call to ARRAY_SIZE() macro.
[ATM]: Eliminate dead config variable CONFIG_BR2684_FAST_TRANS.
[ATM]: Replacing kmalloc/memset combination with kzalloc.
[NET]: gen_estimator deadlock fix
...
Move tuner callback function pointers out of struct tuner, into
struct tuner_operations.
Signed-off-by: Michael Krufky <mkrufky@linuxtv.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Individual tuner drivers are now allocating memory themselves for
their own private data structures. This changeset adds a release
callback to the tuner operations, so that newer drivers that may
require more complex data structures may release this private data
themselves.
Signed-off-by: Michael Krufky <mkrufky@linuxtv.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Create private data struct for device specific private data.
Signed-off-by: Michael Krufky <mkrufky@linuxtv.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
[SPARC64]: Set vio->desc_buf to NULL after freeing.
[SPARC]: Mark sparc and sparc64 as not having virt_to_bus
[SPARC64]: Fix reset handling in VNET driver.
[SPARC64]: Handle reset events in vio_link_state_change().
[SPARC64]: Handle LDC resets properly in domain-services driver.
[SPARC64]: Massively simplify VIO device layer and support hot add/remove.
[SPARC64]: Simplify VNET probing.
[SPARC64]: Simplify VDC device probing.
[SPARC64]: Add basic infrastructure for MD add/remove notification.
An experimental patch for Xen allows guests to place their vcpu_info
structs anywhere. We try to use this to place the vcpu_info into the
PDA, which allows direct access.
If this works, then switch to using direct access operations for
irq_enable, disable, save_fl and restore_fl.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Keir Fraser <keir@xensource.com>
The block device frontend driver allows the kernel to access block
devices exported exported by a virtual machine containing a physical
block device driver.
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
This communicates with the machine control software via a registry
residing in a controlling virtual machine. This allows dynamic
creation, destruction and modification of virtual device
configurations (network devices, block devices and CPUS, to name some
examples).
[ Greg, would you mind giving this a review? Thanks -J ]
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Greg KH <greg@kroah.com>
Add Xen 'grant table' driver which allows granting of access to
selected local memory pages by other virtual machines and,
symmetrically, the mapping of remote memory pages which other virtual
machines have granted access to.
This driver is a prerequisite for many of the Xen virtual device
drivers, which grant the 'device driver domain' restricted and
temporary access to only those memory pages that are currently
involved in I/O operations.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Implement a Xen back-end for hvc console.
* * *
Add early printk support via hvc console, enable using
"earlyprintk=xen" on the kernel command line.
From: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Olof Johansson <olof@lixom.net>
This is a fairly straightforward Xen implementation of smp_ops.
Xen has its own IPI mechanisms, and has no dependency on any
APIC-based IPI. The smp_ops hooks and the flush_tlb_others pv_op
allow a Xen guest to avoid all APIC code in arch/i386 (the only apic
operation is a single apic_read for the apic version number).
One subtle point which needs to be addressed is unpinning pagetables
when another cpu may have a lazy tlb reference to the pagetable. Xen
will not allow an in-use pagetable to be unpinned, so we must find any
other cpus with a reference to the pagetable and get them to shoot
down their references.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Add a new definition for PG_owner_priv_1 to define PG_pinned on Xen
pagetable pages.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Xen implements interrupts in terms of event channels. Each guest
domain gets 1024 event channels which can be used for a variety of
purposes, such as Xen timer events, inter-domain events,
inter-processor events (IPI) or for real hardware IRQs.
Within the kernel, we map the event channels to IRQs, and implement
the whole interrupt handling using a Xen irq_chip.
Rather than setting NR_IRQ to 1024 under PARAVIRT in order to
accomodate Xen, we create a dynamic mapping between event channels and
IRQs. Ideally, Linux will eventually move towards dynamically
allocating per-irq structures, and we can use a 1:1 mapping between
event channels and irqs.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric W. Biederman <ebiederm@xmission.com>
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>