Commit graph

141 commits

Author SHA1 Message Date
Martin Schwidefsky
9e293b5a70 s390,kvm: provide plumbing for machines checks when running guests
This provides the basic plumbing for handling machine checks when
 running guests
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJZU4QPAAoJEBF7vIC1phx8GZsP/2P4nxWXBj0NS/dNq54/u7HU
 Va/zHIG7nUX81WZi8OCkPRlvb1RlcgNpIdw3Ar+BueFE6/qwVWBSdstVJCg6JSn4
 L8T1srSeV6yQEPq1/I9S8ERYtbC8bOC3dDF6g+KyaKYnICjq5yC01+86MKSVfLTI
 vFMPWY/PPCgECtXHjGpWBW6HjofRH3/H+XQbxaoTUyHKwWKdtvWer9K2V7Mc/Cf8
 XsyLY2Xq0Y5MBsJs+71Qw8+0R041Et5I3H7Od9lIc3SFYNoenQpk5oTtsujMtDG1
 ccMPZKErYI4wHE3Hy1ozK+MdFNbepUk3RBI3oXU25tpFPG3OPuksnOqCVN/iZmm+
 le9RuUi9WOOsuygPj2dsnx5v+aheedEcYWqvQ/qrNlP3pXNcpl+8waM6eke8HyCK
 1JKcqqGKBNX5wKNE9b5sRTHINWK12EVCQyVrgLlZaXoXLa40NpJPjtV27vr3ttVl
 WmGYgwMUTo15Rdr0NSJlXl8iCgIFtWMHvuRhIgp8pBuWWb28zr6aX4w++JPwOOMZ
 e4rzn55giCBDnjjDFQK2Knv5XxwnMKafYMxZXfC8gLr5ELjnI6vZDN+1zhT5L2S9
 uXd8l6rLN2qik57RzPV6YEDS0iybZnx5HF/ZPrNoFigJpdD7/0jFS5K5N0i+AhV5
 UQmGhSGnI7Teguc45mHT
 =CTzL
 -----END PGP SIGNATURE-----

Merge tag 'nmiforkvm' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into features

Pull kvm patches from Christian Borntraeger:
"s390,kvm: provide plumbing for machines checks when running guests"

This provides the basic plumbing for handling machine checks when
running guests
2017-06-28 12:57:47 +02:00
QingFeng Hao
c929500d7a s390/nmi: s390: New low level handling for machine check happening in guest
Add the logic to check if the machine check happens when the guest is
running. If yes, set the exit reason -EINTR in the machine check's
interrupt handler. Refactor s390_do_machine_check to avoid panicing
the host for some kinds of machine checks which happen
when guest is running.
Reinject the instruction processing damage's machine checks including
Delayed Access Exception instead of damaging the host if it happens
in the guest because it could be caused by improper update on TLB entry
or other software case and impacts the guest only.

Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-06-27 16:05:27 +02:00
Martin Schwidefsky
f044f4c588 s390/fpu: export save_fpu_regs for all configs
The save_fpu_regs function is a general API that is supposed to be
usable for modules as well. Remove the #ifdef that hides the symbol
for CONFIG_KVM=n.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-06-13 13:03:43 +02:00
Martin Schwidefsky
23fefe119c s390/kvm: avoid global config of vm.alloc_pgste=1
The system control vm.alloc_pgste is used to control the size of the
page tables, either 2K or 4K. The idea is that a KVM host sets the
vm.alloc_pgste control to 1 which causes *all* new processes to run
with 4K page tables. For a non-kvm system the control should stay off
to save on memory used for page tables.

Trouble is that distributions choose to set the control globally to
be able to run KVM guests. This wastes memory on non-KVM systems.

Introduce the PT_S390_PGSTE ELF segment type to "mark" the qemu
executable with it. All executables with this (empty) segment in
its ELF phdr array will be started with 4K page tables. Any executable
without PT_S390_PGSTE will run with the default 2K page tables.

This removes the need to set vm.alloc_pgste=1 for a KVM host and
minimizes the waste of memory for page tables.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-06-13 13:03:41 +02:00
Christian Borntraeger
c0e7bb38c0 s390/kvm: do not rely on the ILC on kvm host protection fauls
For most cases a protection exception in the host (e.g. copy
on write or dirty tracking) on the sie instruction will indicate
an instruction length of 4. Turns out that there are some corner
cases (e.g. runtime instrumentation) where this is not necessarily
true and the ILC is unpredictable.

Let's replace our 4 byte rewind_pad with 3 byte nops to prepare for
all possible ILCs.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-05-17 12:34:03 +02:00
Martin Schwidefsky
07a63cbe8b s390/cputime: fix incorrect system time
git commit c5328901aa "[S390] entry[64].S improvements" removed
the update of the exit_timer lowcore field from the critical section
cleanup of the .Lsysc_restore/.Lsysc_done and .Lio_restore/.Lio_done
blocks. If the PSW is updated by the critical section cleanup to point to
user space again, the interrupt entry code will do a vtime calculation
after the cleanup completed with an exit_timer value which has *not* been
updated. Due to this incorrect system time deltas are calculated.

If an interrupt occured with an old PSW between .Lsysc_restore/.Lsysc_done
or .Lio_restore/.Lio_done update __LC_EXIT_TIMER with the system entry
time of the interrupt.

Cc: stable@vger.kernel.org # 3.3+
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-05-03 09:08:57 +02:00
Linus Torvalds
76f1948a79 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatch updates from Jiri Kosina:

 - a per-task consistency model is being added for architectures that
   support reliable stack dumping (extending this, currently rather
   trivial set, is currently in the works).

   This extends the nature of the types of patches that can be applied
   by live patching infrastructure. The code stems from the design
   proposal made [1] back in November 2014. It's a hybrid of SUSE's
   kGraft and RH's kpatch, combining advantages of both: it uses
   kGraft's per-task consistency and syscall barrier switching combined
   with kpatch's stack trace switching. There are also a number of
   fallback options which make it quite flexible.

   Most of the heavy lifting done by Josh Poimboeuf with help from
   Miroslav Benes and Petr Mladek

   [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

 - module load time patch optimization from Zhou Chengming

 - a few assorted small fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: add missing printk newlines
  livepatch: Cancel transition a safe way for immediate patches
  livepatch: Reduce the time of finding module symbols
  livepatch: make klp_mutex proper part of API
  livepatch: allow removal of a disabled patch
  livepatch: add /proc/<pid>/patch_state
  livepatch: change to a per-task consistency model
  livepatch: store function sizes
  livepatch: use kstrtobool() in enabled_store()
  livepatch: move patching functions into patch.c
  livepatch: remove unnecessary object loaded check
  livepatch: separate enabled and patched states
  livepatch/s390: add TIF_PATCH_PENDING thread flag
  livepatch/s390: reorganize TIF thread flag bits
  livepatch/powerpc: add TIF_PATCH_PENDING thread flag
  livepatch/x86: add TIF_PATCH_PENDING thread flag
  livepatch: create temporary klp_update_patch_state() stub
  x86/entry: define _TIF_ALLWORK_MASK flags explicitly
  stacktrace/x86: add function for detecting reliable stack traces
2017-05-02 18:24:16 -07:00
Martin Schwidefsky
df26c2e87e s390/cpumf: simplify detection of guest samples
There are three different code levels in regard to the identification
of guest samples. They differ in the way the LPP instruction is used.

1) Old kernels without the LPP instruction. The guest program parameter
   is always zero.
2) Newer kernels load the process pid into the program parameter with LPP.
   The guest program parameter is non-zero if the guest executes in a
   process != idle.
3) The latest kernels load ((1UL << 31) | pid) with LPP to make the value
   non-zero even for the idle task. The guest program parameter is non-zero
   if the guest is running.

All kernels load the process pid to CR4 on context switch. The CPU sampling
code uses the value in CR4 to decide between guest and host samples in case
the guest program parameter is zero. The three cases:

1) CR4==pid, gpp==0
2) CR4==pid, gpp==pid
3) CR4==pid, gpp==((1UL << 31) | pid)

The load-control instruction to load the pid into CR4 is expensive and the
goal is to remove it. To distinguish the host CR4 from the guest pid for
the idle process the maximum value 0xffff for the PASN is used.
This adds a fourth case for a guest OS with an updated kernel:

4) CR4==0xffff, gpp=((1UL << 31) | pid)

The host kernel will have CR4==0xffff and will use (gpp!=0 || CR4!==0xffff)
to identify guest samples. This works nicely with all 4 cases, the only
possible issue would be a guest with an old kernel (gpp==0) and a process
pid of 0xffff. Well, don't do that..

Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-04-05 10:11:38 +02:00
Martin Schwidefsky
cab36c262e s390: use 64-bit lctlg to load task pid to cr4 on context switch
The 32-bit lctl instruction is quite a bit slower than the 64-bit
counter part lctlg. Use the faster instruction.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-04-05 07:35:14 +02:00
Martin Schwidefsky
916cda1aa1 s390: add a system call for guarded storage
This adds a new system call to enable the use of guarded storage for
user space processes. The system call takes two arguments, a command
and pointer to a guarded storage control block:

    s390_guarded_storage(int command, struct gs_cb *gs_cb);

The second argument is relevant only for the GS_SET_BC_CB command.

The commands in detail:

0 - GS_ENABLE
    Enable the guarded storage facility for the current task. The
    initial content of the guarded storage control block will be
    all zeros. After the enablement the user space code can use
    load-guarded-storage-controls instruction (LGSC) to load an
    arbitrary control block. While a task is enabled the kernel
    will save and restore the current content of the guarded
    storage registers on context switch.
1 - GS_DISABLE
    Disables the use of the guarded storage facility for the current
    task. The kernel will cease to save and restore the content of
    the guarded storage registers, the task specific content of
    these registers is lost.
2 - GS_SET_BC_CB
    Set a broadcast guarded storage control block. This is called
    per thread and stores a specific guarded storage control block
    in the task struct of the current task. This control block will
    be used for the broadcast event GS_BROADCAST.
3 - GS_CLEAR_BC_CB
    Clears the broadcast guarded storage control block. The guarded-
    storage control block is removed from the task struct that was
    established by GS_SET_BC_CB.
4 - GS_BROADCAST
    Sends a broadcast to all thread siblings of the current task.
    Every sibling that has established a broadcast guarded storage
    control block will load this control block and will be enabled
    for guarded storage. The broadcast guarded storage control block
    is used up, a second broadcast without a refresh of the stored
    control block with GS_SET_BC_CB will not have any effect.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-03-22 08:14:25 +01:00
Miroslav Benes
2f09ca60a5 livepatch/s390: add TIF_PATCH_PENDING thread flag
Update a task's patch state when returning from a system call or user
space interrupt, or after handling a signal.

This greatly increases the chances of a patch operation succeeding.  If
a task is I/O bound, it can be patched when returning from a system
call.  If a task is CPU bound, it can be patched when returning from an
interrupt.  If a task is sleeping on a to-be-patched function, the user
can send SIGSTOP and SIGCONT to force it to switch.

Since there are two ways the syscall can be restarted on return from a
signal handling process, it is important to clear the flag before
do_signal() is called. Otherwise we could miss the migration if we used
SIGSTOP/SIGCONT procedure or fake signal to migrate patching blocking
tasks. If we place our hook to sysc_work label in entry before
TIF_SIGPENDING is evaluated we kill two birds with one stone. The task
is correctly migrated in all return paths from a syscall.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:22:40 +01:00
Martin Schwidefsky
d9fcf2a1cb s390: fix in-kernel program checks
A program check inside the kernel takes a slightly different path in
entry.S compare to a normal user fault. A recent change moved the store
of the breaking event address into the path taken for in-kernel program
checks as well, but %r14 has not been setup to point to the correct
location. A wild store is the consequence.

Move the store of the breaking event address to the code path for
user space faults.

Fixes: 34525e1f7e ("s390: store breaking event address only for program checks")
Reported-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-03-01 09:59:27 +01:00
Heiko Carstens
b5a882fcf1 s390: restore address space when returning to user space
Unbalanced set_fs usages (e.g. early exit from a function and a
forgotten set_fs(USER_DS) call) may lead to a situation where the
secondary asce is the kernel space asce when returning to user
space. This would allow user space to modify kernel space at will.

This would only be possible with the above mentioned kernel bug,
however we can detect this and fix the secondary asce before returning
to user space.

Therefore a new TIF_ASCE_SECONDARY which is used within set_fs. When
returning to user space check if TIF_ASCE_SECONDARY is set, which
would indicate a bug. If it is set print a message to the console,
fixup the secondary asce, and then return to user space.

This is similar to what is being discussed for x86 and arm:
"[RFC] syscalls: Restore address limit after a syscall".

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-02-23 10:06:38 +01:00
Heiko Carstens
606aa4aa0b s390: rename CIF_ASCE to CIF_ASCE_PRIMARY
This is just a preparation patch in order to keep the "restore address
space after syscall" patch small.
Rename CIF_ASCE to CIF_ASCE_PRIMARY to be unique and specific when
introducing a second CIF_ASCE_SECONDARY CIF flag.

Suggested-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-02-23 10:06:38 +01:00
Martin Schwidefsky
d24b98e3a9 s390/syscall: fix single stepped system calls
Fix PER tracing of system calls after git commit 34525e1f7e
"s390: store breaking event address only for program checks"
broke it.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-02-20 12:38:01 +01:00
Martin Schwidefsky
57d7f939e7 s390: add no-execute support
Bit 0x100 of a page table, segment table of region table entry
can be used to disallow code execution for the virtual addresses
associated with the entry.

There is one tricky bit, the system call to return from a signal
is part of the signal frame written to the user stack. With a
non-executable stack this would stop working. To avoid breaking
things the protection fault handler checks the opcode that caused
the fault for 0x0a77 (sys_sigreturn) and 0x0aad (sys_rt_sigreturn)
and injects a system call. This is preferable to the alternative
solution with a stub function in the vdso because it works for
vdso=off and statically linked binaries as well.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-02-08 14:13:25 +01:00
Martin Schwidefsky
34525e1f7e s390: store breaking event address only for program checks
The principles of operations specifies that the breaking event address
is stored to the address 0x110 in the prefix page only for program checks.
The last branch in user space is lost as soon as a branch in kernel space
is executed after e.g. an svc. This makes it impossible to accurately
maintain the breaking event address for a user space process.

Simplify the code, just copy the current breaking event address from
0x110 to the task structure for program checks from user space.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-01-31 10:46:53 +01:00
Heiko Carstens
7df1160459 s390: remove unused labels from entry.S
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-12-12 12:04:26 +01:00
Martin Schwidefsky
ce4dda3f02 s390: fix machine check panic stack switch
For system damage machine checks or machine checks due to invalid PSW
fields the system will be stopped. In order to get an oops message out
before killing the system the machine check handler branches to
.Lmcck_panic, switches to the panic stack and then does the usual
machine check handling.

The switch to the panic stack is incomplete, the stack pointer in %r15
is replaced, but the pt_regs pointer in %r11 is not. The result is
a program check which will kill the system in a slightly different way.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-12-07 07:22:13 +01:00
Martin Schwidefsky
61aaef51cc s390: fix kernel oops for CONFIG_MARCH_Z900=y builds
The LAST_BREAK macro in entry.S uses a different instruction sequence
for CONFIG_MARCH_Z900 builds. The branch target offset to skip the
store of the last breaking event address needs to take the different
length of the code block into account.

Fixes: f8fc82b471 ("s390: move sys_call_table and last_break from thread_info to thread_struct")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-11-25 10:07:55 +01:00
Heiko Carstens
3a890380e4 s390/thread_info: get rid of THREAD_ORDER define
We have the s390 specific THREAD_ORDER define and the THREAD_SIZE_ORDER
define which is also used in common code. Both have exactly the same
semantics. Therefore get rid of THREAD_ORDER and always use
THREAD_SIZE_ORDER instead.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-11-23 16:02:21 +01:00
Martin Schwidefsky
ef280c859f s390: move sys_call_table and last_break from thread_info to thread_struct
Move the last two architecture specific fields from the thread_info
structure to the thread_struct. All that is left in thread_info is
the flags field.

Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-11-15 16:48:20 +01:00
Heiko Carstens
d5c352cdd0 s390: move thread_info into task_struct
This is the s390 variant of commit 15f4eae70d ("x86: Move
thread_info into task_struct").

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-11-11 16:37:41 +01:00
Martin Schwidefsky
c360192bf4 s390/preempt: move preempt_count to the lowcore
Convert s390 to use a field in the struct lowcore for the CPU
preemption count. It is a bit cheaper to access a lowcore field
compared to a thread_info variable and it removes the depencency
on a task related structure.

bloat-o-meter on the vmlinux image for the default configuration
(CONFIG_PREEMPT_NONE=y) reports a small reduction in text size:

add/remove: 0/0 grow/shrink: 18/578 up/down: 228/-5448 (-5220)

A larger improvement is achieved with the default configuration
but with CONFIG_PREEMPT=y and CONFIG_DEBUG_PREEMPT=n:

add/remove: 2/6 grow/shrink: 59/4477 up/down: 1618/-228762 (-227144)

Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-11-11 16:37:40 +01:00
Al Viro
711f5df7bf s390: move exports to definitions
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-08-07 23:47:20 -04:00
Heiko Carstens
46210c440c s390: have unique symbol for __switch_to address
After linking there are several symbols for the same address that the
__switch_to symbol points to. E.g.:

000000000089b9c0 T __kprobes_text_start
000000000089b9c0 T __lock_text_end
000000000089b9c0 T __lock_text_start
000000000089b9c0 T __sched_text_end
000000000089b9c0 T __switch_to

When disassembling with "objdump -d" this results in a missing
__switch_to function. It would be named __kprobes_text_start
instead. To unconfuse objdump add a nop in front of the kprobes text
section. That way __switch_to appears again.

Obviously this solution is sort of a hack, since it also depends on
link order if this works or not. However it is the best I can come up
with for now.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-07-04 09:25:22 +02:00
Heiko Carstens
43799597dc s390: remove pointless load within __switch_to
Remove a leftover from the code that transferred a couple of TIF bits
from the previous task to the next task.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-28 09:32:40 +02:00
Martin Schwidefsky
e370e47694 s390: fix floating pointer register corruption (again)
There is a tricky interaction between the machine check handler
and the critical sections of load_fpu_regs and save_fpu_regs
functions. If the machine check interrupts one of the two
functions the critical section cleanup will complete the function
before the machine check handler s390_do_machine_check is called.
Trouble is that the machine check handler needs to validate the
floating point registers *before* and not *after* the completion
of load_fpu_regs/save_fpu_regs.

The simplest solution is to rewind the PSW to the start of the
load_fpu_regs/save_fpu_regs and retry the function after the
return from the machine check handler.

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: <stable@vger.kernel.org> # 4.3+
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-03-10 14:35:42 +01:00
Christian Borntraeger
b1685ab9bd s390/cpumf: Improve guest detection heuristics
commit e22cf8ca6f ("s390/cpumf: rework program parameter setting
to detect guest samples") requires guest changes to get proper
guest/host. We can do better: We can use the primary asn value,
which is set on all Linux variants to compare this with the host
pp value.
We now have the following cases:
1. Guest using PP
host sample:  gpp == 0, asn == hpp --> host
guest sample: gpp != 0 --> guest
2. Guest not using PP
host sample:  gpp == 0, asn == hpp --> host
guest sample: gpp == 0, asn != hpp --> guest

As soon as the host no longer sets CR4, we must back out
this heuristics - let's add a comment in switch_to.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-03-02 06:44:28 -06:00
Martin Schwidefsky
419123f900 s390/spinlock: do not yield to a CPU in udelay/mdelay
It does not make sense to try to relinquish the time slice with diag 0x9c
to a CPU in a state that does not allow to schedule the CPU. The scenario
where this can happen is a CPU waiting in udelay/mdelay while holding a
spin-lock.

Add a CIF bit to tag a CPU in enabled wait and use it to detect that the
yield of a CPU will not be successful and skip the diagnose call.

Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-11-27 09:24:18 +01:00
Heiko Carstens
db7e007fd6 s390/udelay: make udelay have busy loop semantics
When using systemtap it was observed that our udelay implementation is
rather suboptimal if being called from a kprobe handler installed by
systemtap.

The problem observed when a kprobe was installed on lock_acquired().
When the probe was hit the kprobe handler did call udelay, which set
up an (internal) timer and reenabled interrupts (only the clock comparator
interrupt) and waited for the interrupt.
This is an optimization to avoid that the cpu is busy looping while waiting
that enough time passes. The problem is that the interrupt handler still
does call irq_enter()/irq_exit() which then again can lead to a deadlock,
since some accounting functions may take locks as well.

If one of these locks is the same, which caused lock_acquired() to be
called, we have a nice deadlock.

This patch reworks the udelay code for the interrupts disabled case to
immediately leave the low level interrupt handler when the clock
comparator interrupt happens. That way no C code is being called and the
deadlock cannot happen anymore.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-10-14 14:32:13 +02:00
Christian Borntraeger
e22cf8ca6f s390/cpumf: rework program parameter setting to detect guest samples
The program parameter can be used to mark hardware samples with
some token.  Previously, it was used to mark guest samples only.

Improve the program parameter doubleword by combining two parts,
the leftmost LPP part and the rightmost PID part.  Set the PID
part for processes by using the task PID.
To distinguish host and guest samples for the kernel (PID part
is zero), the guest must always set the program paramater to a
non-zero value.  Use the leftmost bit in the LPP part of the
program parameter to be able to detect guest kernel samples.

[brueckner@linux.vnet.ibm.com]: Split __LC_CURRENT and introduced
__LC_LPP. Corrected __LC_CURRENT users and adjusted assembler parts.
And updated the commit message accordingly.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-10-14 14:32:12 +02:00
Hendrik Brueckner
83abeffbd5 s390/entry: add assembler macro to conveniently tests under mask
Various functions in entry.S perform test-under-mask instructions
to test for particular bits in memory.  Because test-under-mask uses
a mask value of one byte, the mask value and the offset into the
memory must be calculated manually.  This easily introduces errors
and is hard to review and read.

Introduce the TSTMSK assembler macro to specify a mask constant and
let the macro calculate the offset and the byte mask to generate a
test-under-mask instruction.  The benefit is that existing symbolic
constants can now be used for tests.  Also the macro checks for
zero mask values and mask values that consist of multiple bytes.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-10-14 14:32:09 +02:00
Hendrik Brueckner
0ac277790e s390/fpu: add static FPU save area for init_task
Previously, the init task did not have an allocated FPU save area and
saving an FPU state was not possible.  Now if the vector extension is
always enabled, provide a static FPU save area to save FPU states of
vector instructions that can be executed quite early.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-10-14 14:32:08 +02:00
Hendrik Brueckner
b5510d9b68 s390/fpu: always enable the vector facility if it is available
If the kernel detects that the s390 hardware supports the vector
facility, it is enabled by default at an early stage.  To force
it off, use the novx kernel parameter.  Note that there is a small
time window, where the vector facility is enabled before it is
forced to be off.

With enabling the vector facility by default, the FPU save and
restore functions can be improved.  They do not longer require
to manage expensive control register updates to enable or disable
the vector enablement control for particular processes.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-10-14 14:32:08 +02:00
Martin Schwidefsky
72d38b1978 s390/vtime: correct scaled cputime of partially idle CPUs
The calculation for the SMT scaling factor for a hardware thread
which has been partially idle needs to disregard the cycles spent
by the other threads of the core while the thread is idle.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-09-30 16:22:38 +02:00
Heiko Carstens
9380cf5a88 s390: fix floating point register corruption
The critical section cleanup code misses to add the offset of the
thread_struct to the task address.
Therefore, if the critical section code gets executed, it may corrupt
the task struct or restore the contents of the floating point registers
from the wrong memory location.
Fixes d0164ee20d "s390/kernel: remove save_fpu_regs() parameter and use
__LC_CURRENT instead".

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-09-17 13:43:41 +02:00
Christian Borntraeger
888d5e9804 KVM: s390: use pid of cpu thread for sampling tagging
Right now we use the address of the sie control block as tag for
the sampling data. This is hard to get for users. Let's just use
the PID of the cpu thread to mark the hardware samples.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-08-03 10:04:59 +02:00
Hendrik Brueckner
d0164ee20d s390/kernel: remove save_fpu_regs() parameter and use __LC_CURRENT instead
All calls to save_fpu_regs() specify the fpu structure of the current task
pointer as parameter.  The task pointer of the current task can also be
retrieved from the CPU lowcore directly.  Remove the parameter definition,
load the __LC_CURRENT task pointer from the CPU lowcore, and rebase the FPU
structure onto the task structure.  Apply the same approach for the
load_fpu_regs() function.

Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-08-03 10:04:37 +02:00
Martin Schwidefsky
2acb94f431 s390/nmi: use the normal asynchronous stack for machine checks
If a machine checks is received while the CPU is in the kernel, only
the s390_do_machine_check function will be called. The call to
s390_handle_mcck is postponed until the CPU returns to user space.
Because of this it is safe to use the asynchronous stack for machine
checks even if the CPU is already handling an interrupt.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-07-22 09:58:04 +02:00
Martin Schwidefsky
a359bb1190 s390/kernel: squeeze a few more cycles out of the system call handler
Reorder the instructions of UPDATE_VTIME to improve superscalar execution,
remove duplicate checks for problem-state from the asynchronous interrupt
handlers, and move the check for problem-state from the synchronous
exit path to the program check path as it is only needed for program
checks inside the kernel.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-07-22 09:58:04 +02:00
Martin Schwidefsky
d0fc41071a s390/kvm: integrate HANDLE_SIE_INTERCEPT into cleanup_critical
Currently there are two mechanisms to deal with cleanup work due to
interrupts. The HANDLE_SIE_INTERCEPT macro is used to undo the changes
required to enter SIE in sie64a. If the SIE instruction causes a program
check, or an asynchronous interrupt is received the HANDLE_SIE_INTERCEPT
code forwards the program execution to sie_exit.

All the other critical sections in entry.S are handled by the code in
cleanup_critical that is called by the SWITCH_ASYNC macro.

Move the sie64a function to the beginning of the critical section and
add the code from HANDLE_SIE_INTERCEPT to cleanup_critical. Add a special
case for the sie64a cleanup to the program check handler.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-07-22 09:58:03 +02:00
Martin Schwidefsky
dcd2a9aaa0 s390/kvm: fix interrupt race with HANDLE_SIE_INTERCEPT
The HANDLE_SIE_INTERCEPT macro is used in the interrupt handlers
and the program check handler to undo a few changes done by sie64a.
Among them are guest vs host LPP, the gmap ASCE vs kernel ASCE and
the bit that indicates that SIE is currently running on the CPU.

There is a race of a voluntary SIE exit vs asynchronous interrupts.
If the CPU completed the SIE instruction and the TM instruction of
the LPP macro at the time it receives an interrupt, the interrupt
handler will run while the LPP, the ASCE and the SIE bit are still
set up for guest execution. This might result in wrong sampling data,
but it will not cause data corruption or lockups.

The critical section in sie64a needs to be enlarged to include all
instructions that undo the changes required for guest execution.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-07-22 09:58:03 +02:00
Hendrik Brueckner
9977e886cb s390/kernel: lazy restore fpu registers
Improve the save and restore behavior of FPU register contents to use the
vector extension within the kernel.

The kernel does not use floating-point or vector registers and, therefore,
saving and restoring the FPU register contents are performed for handling
signals or switching processes only.  To prepare for using vector
instructions and vector registers within the kernel, enhance the save
behavior and implement a lazy restore at return to user space from a
system call or interrupt.

To implement the lazy restore, the save_fpu_regs() sets a CPU information
flag, CIF_FPU, to indicate that the FPU registers must be restored.
Saving and setting CIF_FPU is performed in an atomic fashion to be
interrupt-safe.  When the kernel wants to use the vector extension or
wants to change the FPU register state for a task during signal handling,
the save_fpu_regs() must be called first.  The CIF_FPU flag is also set at
process switch.  At return to user space, the FPU state is restored.  In
particular, the FPU state includes the floating-point or vector register
contents, as well as, vector-enablement and floating-point control.  The
FPU state restore and clearing CIF_FPU is also performed in an atomic
fashion.

For KVM, the restore of the FPU register state is performed when restoring
the general-purpose guest registers before the SIE instructions is started.
Because the path towards the SIE instruction is interruptible, the CIF_FPU
flag must be checked again right before going into SIE.  If set, the guest
registers must be reloaded again by re-entering the outer SIE loop.  This
is the same behavior as if the SIE critical section is interrupted.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-07-22 09:58:01 +02:00
Martin Schwidefsky
3827ec3d8f s390: adapt entry.S to the move of thread_struct
git commit 0c8c0f03e3
"x86/fpu, sched: Dynamically allocate 'struct fpu'"
moved the thread_struct to the end of the task_struct.

This causes some of the offsets used in entry.S to overflow their
instruction operand field. To fix this  use aghi to create a
dedicated pointer for the thread_struct.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-07-20 13:22:18 +02:00
Christian Borntraeger
8e23654687 KVM: s390: make exit_sie_sync more robust
exit_sie_sync is used to kick CPUs out of SIE and prevent reentering at
any point in time. This is used to reload the prefix pages and to
set the IBS stuff in a way that guarantees that after this function
returns we are no longer in SIE. All current users trigger KVM requests.

The request must be set before we block the CPUs to avoid races. Let's
make this implicit by adding the request into a new function
kvm_s390_sync_requests that replaces exit_sie_sync and split out
s390_vcpu_block and s390_vcpu_unblock, that can be used to keep
CPUs out of SIE independent of requests.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
2015-05-08 15:51:14 +02:00
Heiko Carstens
a876cb3f6b s390: remove 31 bit syscalls
Remove the 31 bit syscalls from the syscall table. This is a separate patch
just in case I screwed something up so it can be easily reverted.
However the conversion was done with a script, so everything should be ok.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-03-25 11:49:35 +01:00
Heiko Carstens
4bfc86ce94 s390: remove "64" suffix from a couple of files
Rename a couple of files to get rid of the "64" suffix.
"git blame" will still work.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-03-25 11:49:34 +01:00
Heiko Carstens
5a79859ae0 s390: remove 31 bit support
Remove the 31 bit support in order to reduce maintenance cost and
effectively remove dead code. Since a couple of years there is no
distribution left that comes with a 31 bit kernel.

The 31 bit kernel also has been broken since more than a year before
anybody noticed. In addition I added a removal warning to the kernel
shown at ipl for 5 minutes: a960062e58 ("s390: add 31 bit warning
message") which let everybody know about the plan to remove 31 bit
code. We didn't get any response.

Given that the last 31 bit only machine was introduced in 1999 let's
remove the code.
Anybody with 31 bit user space code can still use the compat mode.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-03-25 11:49:33 +01:00
Martin Schwidefsky
86ed42f401 s390: use local symbol names in entry[64].S
To improve the output of the perf tool hide most of the symbols
from entry[64].S by using the '.L' prefix.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-12-08 09:42:38 +01:00