Added transport mode ESP support for starters. I will send more of
these modes and types once i have resolved the tunnel mode isses.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows other in-kernel functions to do SAD lookups.
The only known user at the moment is pktgen.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default all flows in pktgen are randomly selected.
This patch introduces ability to have all defined flows to
be sent sequentially. Robert defined randomness to be the
default behavior.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Track the extra packet overhead for VLAN tags, MPLS, IPSEC etc
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
The initial rate for STA's using rc80211_simple is set to the last
rate in the rate table. For situations for which the signal is weak,
the rate may be too high for authentication and association. Although
the rc80211_simple module will adjust the speed, the response may not
be fast enough for a successful connection. This modification sets the
initial rate to the lowest supported value.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do same adjustment to SACK fastpath counters provided that
they're valid.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a reference to an existing address is increased or decreased without
hitting zero, the address count is incorrectly adjusted.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the new sch_rr qdisc for multiqueue network device support. Allow
sch_prio and sch_rr to be compiled with or without multiqueue hardware
support.
sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS. This
was done since sch_prio and sch_rr only differ in their dequeue
routine.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the multiqueue hardware device support API to the core network
stack. Allow drivers to allocate multiple queues and manage them at
the netdev level if they choose to do so.
Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a boolean error in the new TX checksum check
that causes bogus TSO packets to be generated.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new UDP_ENCAP_L2TPINUDP encapsulation type for UDP
sockets. When a UDP socket's encap_type is UDP_ENCAP_L2TPINUDP, the
skb is delivered to a function pointed to by the udp_sock's
encap_rcv funcptr. If the skb isn't wanted by L2TP, it returns >0, which
causes it to be passed through to UDP.
Include padding to put the new encap_rcv field on a 4-byte boundary.
Previously, the only user of UDP encap sockets was ESP, so when
CONFIG_XFRM was not defined, some of the encap code was compiled
out. This patch changes that. As a result, udp_encap_rcv() will
now do a little more work when CONFIG_XFRM is not defined.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for configuring secondary unicast addresses on network
devices. To support this devices capable of filtering multiple
unicast addresses need to change their set_multicast_list function
to configure unicast filters as well and assign it to dev->set_rx_mode
instead of dev->set_multicast_list. Other devices are put into promiscous
mode when secondary unicast addresses are present.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use generic net_device address lists for multicast list handling.
Some defines are used to keep drivers working.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce struct dev_addr_list and list maintenance functions
based on dev_mc_list and the related functions. This will be
used by follow-up patches for both multicast and secondary
unicast addresses.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_mc_add/dev_mc_delete take care of uploading the list when
necessary and thats the only interface other code should use.
Also remove two incorrect calls in DECnet.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing model for checksum offload does not correctly handle
devices that can offload IPV4 and IPV6 only. The NETIF_F_HW_CSUM flag
implies device can do any arbitrary protocol.
This patch:
* adds NETIF_F_IPV6_CSUM for those devices
* fixes bnx2 and tg3 devices that need it
* add NETIF_F_IPV6_CSUM to ipv6 output (incl GSO)
* fixes assumptions about NETIF_F_ALL_CSUM in nat
* adjusts bridge union of checksumming computation
Signed-off-by: David S. Miller <davem@davemloft.net>
It is clean-up for XFRM type modules and adds aliases with its
protocol:
ESP, AH, IPCOMP, IPIP and IPv6 for IPsec
ROUTING and DSTOPTS for MIPv6
It is almost the same thing as XFRM mode alias, but it is added
new defines XFRM_PROTO_XXX for preprocessing since some protocols
are defined as enum.
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Acked-by: Ingo Oeser <netdev@axxeo.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes MIPv6 loadable module named "mip6".
Here is a modprobe.conf(5) example to load it automatically
when user application uses XFRM state for MIPv6:
alias xfrm-type-10-43 mip6
alias xfrm-type-10-60 mip6
Some MIPv6 feature is not included by this modular, however,
it should not be affected to other features like either IPsec
or IPv6 with and without the patch.
We may discuss XFRM, MH (RAW socket) and ancillary data/sockopt
separately for future work.
Loadable features:
* MH receiving check (to send ICMP error back)
* RO header parsing and building (i.e. RH2 and HAO in DSTOPTS)
* XFRM policy/state database handling for RO
These are NOT covered as loadable:
* Home Address flags and its rule on source address selection
* XFRM sub policy (depends on its own kernel option)
* XFRM functions to receive RO as IPv6 extension header
* MH sending/receiving through raw socket if user application
opens it (since raw socket allows to do so)
* RH2 sending as ancillary data
* RH2 operation with setsockopt(2)
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kill unnecessary CONFIG_IPV6_MIP6.
o It is redundant for RAW socket to keep MH out with the config then
it can handle any protocol.
o Clean-up at AH.
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a nested compat attribute type that can be used to convert
attributes that contain a structure to nested attributes in a
backwards compatible way.
The attribute looks like this:
struct {
[ compat contents ]
struct rtattr {
.rta_len = total size,
.rta_type = type,
} rta;
struct old_structure struct;
[ nested top-level attribute ]
struct rtattr {
.rta_len = nest size,
.rta_type = type,
} nest_attr;
[ optional 0 .. n nested attributes ]
struct rtattr {
.rta_len = private attribute len,
.rta_type = private attribute typ,
} nested_attr;
struct nested_data data;
};
Since both userspace and kernel deal correctly with attributes that are
larger than expected old versions will just parse the compat part and
ignore the rest.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a nested compat attribute type that can be used to convert
attributes that contain a structure to nested attributes in a
backwards compatible way.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently NAT (and others) that want to modify cloned skbs copy them,
even if in the vast majority of cases its not necessary because the
skb is a clone made by TCP and the portion NAT wants to modify is
actually writable because TCP release the header reference before
cloning.
The problem is that there is no clean way for NAT to find out how
long the writable header area is, so this patch introduces skb->hdr_len
to hold this length. When a headerless skb is cloned skb->hdr_len
is set to the current headroom, for regular clones it is copied from
the original. A new function skb_clone_writable(skb, len) returns
whether the skb is writable up to len bytes from skb->data. To avoid
enlarging the skb the mac_len field is reduced to 16 bit and the
new hdr_len field is put in the remaining 16 bit.
I've done a few rough benchmarks of NAT (not with this exact patch,
but a very similar one). As expected it saves huge amounts of system
time in case of sendfile, bringing it down to basically the same
amount as without NAT, with sendmsg it only helps on loopback,
probably because of the large MTU.
Transmit a 1GB file using sendfile/sendmsg over eth0/lo with and
without NAT:
- sendfile eth0, no NAT: sys 0m0.388s
- sendfile eth0, NAT: sys 0m1.835s
- sendfile eth0: NAT + path: sys 0m0.370s (~ -80%)
- sendfile lo, no NAT: sys 0m0.258s
- sendfile lo, NAT: sys 0m2.609s
- sendfile lo, NAT + patch: sys 0m0.260s (~ -90%)
- sendmsg eth0, no NAT: sys 0m2.508s
- sendmsg eth0, NAT: sys 0m2.539s
- sendmsg eth0, NAT + patch: sys 0m2.445s (no change)
- sendmsg lo, no NAT: sys 0m2.151s
- sendmsg lo, NAT: sys 0m3.557s
- sendmsg lo, NAT + patch: sys 0m2.159s (~ -40%)
I expect other users can see a similar performance improvement,
packet mangling iptables targets, ipip and ip_gre come to mind ..
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes :
- netif_queue_stopped need not be called inside qdisc_restart as
it has been called already in qdisc_run() before the first skb
is sent, and in __qdisc_run() after each intermediate skb is
sent (note : we are the only sender, so the queue cannot get
stopped while the tx lock was got in the ~LLTX case).
- BUG_ON((int) q->q.qlen < 0) was a relic from old times when -1
meant more packets are available, and __qdisc_run used to loop
when qdisc_restart() returned -1. During those days, it was
necessary to make sure that qlen is never less than zero, since
__qdisc_run would get into an infinite loop if no packets are on
the queue and this bug in qdisc was there (and worse - no more
skbs could ever get queue'd as we hold the queue lock too). With
Herbert's recent change to return values, this check is not
required. Hopefully Herbert can validate this change. If at all
this is required, it should be added to skb_dequeue (in failure
case), and not to qdisc_qlen.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New changes :
- Incorporated Peter Waskiewicz's comments.
- Re-added back one warning message (on driver returning wrong value).
Previous changes :
- Converted to use switch/case code which looks neater.
- "if (ret == NETDEV_TX_LOCKED && lockless)" is buggy, and the lockless
check should be removed, since driver will return NETDEV_TX_LOCKED only
if lockless is true and driver has to do the locking. In the original
code as well as the latest code, this code can result in a bug where
if LLTX is not set for a driver (lockless == 0) but the driver is written
wrongly to do a trylock (despite LLTX being set), the driver returns
LOCKED. But since lockless is zero, the packet is requeue'd instead of
calling collision code which will issue warning and free up the skb.
Instead this skb will be retried with this driver next time, and the same
result will ensue. Removing this check will catch these driver bugs instead
of hiding the problem. I am keeping this change to readability section
since :
a. it is confusing to check two things as it is; and
b. it is difficult to keep this check in the changed 'switch' code.
- Changed some names, like try_get_tx_pkt to dev_dequeue_skb (as that is
the work being done and easier to understand) and do_dev_requeue to
dev_requeue_skb, merged handle_dev_cpu_collision and tx_islocked to
dev_handle_collision (handle_dev_cpu_collision is a small routine with only
one caller, so there is no need to have two separate routines which also
results in getting rid of two macros, etc.
- Removed an XXX comment as it should never fail (I suspect this was related
to batch skb WIP, Jamal ?). Converted some functions to original coding
style of having the return values and the function name on same line, eg
prio2list.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ccid3_hc_tx_send_packet currently returns 0 when the time difference between
current time and t_nom is less than 1000 microseconds.
In this case the packet is sent immediately; but, unlike other packets that can
be emitted on first attempt, it will not have its window counter updated and
its options set as required. This is a bug.
Fix: Require the time difference to be at least 1000 microseconds. The
algorithm then converges: time differences > 1000 microseconds trigger the
timer in dccp_write_xmit; after timer expiry this function is tried again; when
the time difference is less than 1000, the packet will have its options added
and window counter updated as required.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This updates the computation of t_nom and t_last_win_count to use the newer
gettimeofday interface.
Committer note: used ktime_to_timeval to set the 'now' variable to t_ld in
ccid3hctx_no_feedback_timer
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
It had just a slab cache, so, for the sake of simplicity just make
dccp_trfc_lib module init routine create the slab cache, no need for users of
the lib to create a private loss_interval object.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Now ccid3_hc_rx_update_li is ready to be moved to
net/dccp/ccids/lib/loss_interval, it uses the same interface as the other
functions there.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This is a preparatory patch for moving these loss interval functions from
net/dccp/ccids/ccid3.c to net/dccp/ccids/lib/loss_interval.c.
Based on a patch by Ian McDonald.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
When compiling with EXTRA_CFLAGS=-W noticed that tstamp is not initialised
correctly in dccp_li_calc_first_li.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
When compiling with EXTRA_CFLAGS=-W notice that we have signed/unsigned issue
in dccp.h.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Keep track of the number of configured ingress/egress QoS mappings to
avoid iteration while calculating the netlink attribute size.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->priority has only 32 bits and even VLAN uses 32 bit values in its API.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The returned device is unused, return proper error codes instead and avoid
having the ioctl handler guess the error.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move device registration and configuration of the underlying device to a
seperate function.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the checks of the underlying device to a seperate function.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move group allocation to a seperate function to clean up the code a bit
and allocate groups before registering the device. Device registration
is globally visible and causes netlink events, so we shouldn't fail
afterwards.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move some device initialization code to new dev->init callback to make
it shareable with netlink. Additionally this fixes a minor bug, dev->iflink
is set after registration, which causes an incorrect value in the initial
netlink message.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the device lookup and checks to the ioctl handler under the RTNL and
change all name-based interfaces to take a struct net_device * instead.
This allows to use them from a netlink interface, which identifies devices
based on ifindex not name. It also avoids races between the ioctl interface
and the (upcoming) netlink interface since now all changes happen under the
RTNL.
As a nice side effect this greatly simplifies error handling in the helper
functions and fixes a number of incorrect error codes like -EINVAL for
device not found.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add rtnetlink API for creating, changing and deleting software devices.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split up rtnl_setlink into a function performing validation and a function
performing the actual changes. This allows to share the modifcation logic
with rtnl_newlink, which is introduced by the next patch.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
At present, transmission rate information for mac80211 is available only
if verbose debugging is turned on, and then only in the logs. This patch
implements the SIOCGIWRATE ioctl, which adds the current transmission rate to
the output of iwconfig.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the RFCOMM TTY release process so that the TTY is kept
on the list until it is really freed. A new device flag is used to keep
track of released TTYs.
Signed-off-by: Ville Tervo <ville.tervo@nokia.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Currently the code for /proc/net/tcp disable BH while iterating
over the entire established hash table. Even though we call
cond_resched_softirq for each entry, we still won't process
softirq's as regularly as we would otherwise do which results
in poor performance when the system is loaded near capacity.
This anomaly comes from the 2.4 code where this was all in a
single function and the local_bh_disable might have made sense
as a small optimisation.
The cost of each local_bh_disable is so small when compared
against the increased latency in keeping it disabled over a
large but mostly empty TCP established hash table that we
should just move it to the individual read_lock/read_unlock
calls as we do in inet_diag.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Over the years this code has gotten hairier. Resulting in many long
discussions over long summer days and patches that get it wrong.
This patch helps tame that code so normal people will understand it.
Thanks to Thomas Graf, Peter J. waskiewicz Jr, and Patrick McHardy
for their valuable reviews.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enhances TIPC's stream socket send routine so that
it avoids transmitting data in chunks that require fragmentation
and reassembly, thereby improving performance at both the
sending and receiving ends of the connection.
The "maximum packet size" hint that records MTU info allows
the socket to decide how big a chunk it should send; in the
event that the hint has become stale, fragmentation may still
occur, but the data will be passed correctly and the hint will
be updated in time for the following send. Note: The 66060 byte
pseudo-MTU used for intra-node connections requires the send
routine to perform an additional check to ensure it does not
exceed TIPC"s limit of 66000 bytes of user data per chunk.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Jon Paul Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch modifies TIPC's socket API to utilize existing
generic routines to indicate unsupported operations, rather
than adding similar TIPC-specific routines.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Jon Paul Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch simplifies TIPC's Ethernet receive routine to take
advantage of information already present in each incoming sk_buff
indicating whether the packet was explicitly sent to the interface,
has been broadcast to all interfaces, or was picked up because the
interface is in promiscous mode.
This new approach also fixes the problem of TIPC accepting unwanted
traffic through UML's multicast-based Ethernet interfaces (which
deliver traffic in a promiscuous manner even if the interface is
not configured to be promiscuous).
Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Jon Paul Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The core problem is that RFCOMM socket layer ioctl can release
rfcomm_dev struct while RFCOMM TTY layer is still actively using
it. Calling tty_vhangup() is needed for a synchronous hangup before
rfcomm_dev is freed.
Addresses the oops at http://bugzilla.kernel.org/show_bug.cgi?id=7509
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Most drivers must handle fragmented HCI data packets and events. This
patch adds a generic function for their reassembly to the Bluetooth
core layer and thus allows to shrink the complexity of the drivers.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We don't need the BKL when wrapping and unwrapping; and experiments by Avishay
Traeger have found that permitting multiple encryption and decryption
operations to proceed in parallel can provide significant performance
improvements.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Avishay Traeger <atraeger@cs.sunysb.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In addition to binding to a local privileged port the NFS client should
allow binding to a specific local address. This is used by the server
for callbacks. The patch adds the necessary interface.
Signed-off-by: Frank van Maarseveen <frankvm@frankvm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Save the destination address of an incoming request over TCP like is
done already for UDP. It is necessary later for callbacks by the server.
Signed-off-by: Frank van Maarseveen <frankvm@frankvm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cleanup argument passing to functions for creating an RPC transport.
Signed-off-by: Frank van Maarseveen <frankvm@frankvm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A couple of callers just use a stringified IP address for the rpc client's
hostname. Move the logic for constructing this into rpc_create(), so it can
be shared.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up, for consistency. Rename rpcb_getport as rpcb_getport_async, to
match the naming scheme of rpcb_getport_sync.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In preparation for handling NFS mount option parsing in the kernel,
rename rpcb_getport_external as rpcb_get_port_sync, and make it available
always (instead of only when CONFIG_ROOT_NFS is enabled).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This allows NFS mount requests and RPC re-binding to be interruptible if the
server isn't responding.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a refcount in order to ensure that the gss_auth doesn't disappear from
underneath us while we're freeing up GSS contexts.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We should almost always be deferencing the rpc_auth struct by means of the
credential's cr_auth field instead of the rpc_clnt->cl_auth anyway. Fix up
that historical mistake, and remove the macro that propagated it.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RPCSEC_GSS needs to be able to send NULL RPC calls to the server in order
to free up any remaining GSS contexts.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Does a NULL RPC call and returns a pointer to the resulting rpc_task. The
call may be either synchronous or asynchronous.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a memory leak in gss_create() whereby the rpc credcache was not being
freed if the rpc_mkpipe() call failed.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to set the unix_cred_cache.nextgc on the first call to
unx_create(), which should be when unix_auth.au_count === 1
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The leak only affects the RPCSEC_GSS caches, since they are the only ones
that are dynamically allocated...
Rename the existing rpcauth_free_credcache() to rpcauth_clear_credcache()
in order to better describe its role, then add a new function
rpcauth_destroy_credcache() that actually frees the cache in addition to
clearing it out.
Also move the call to destroy the credcache in gss_destroy() to come before
the rpc upcall pipe is unlinked.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a dentry_ops with a d_delete() method in order to ensure that dentries
are removed as soon as the last reference is gone.
Clean up rpc_depopulate() so that it only removes files that were created
via rpc_populate().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the downcall queue is tied to the struct gss_auth, which means
that different RPCSEC_GSS pseudoflavours must use different upcall pipes.
Add a list to struct rpc_inode that can be used instead.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It used to be possible for an rpc.gssd daemon to stuff the RPC credential
cache for any rpc client simply by creating RPCSEC_GSS contexts and then
doing downcalls. In practice, no daemons ever made use of this feature.
Remove this feature now, since it will be impossible to figure out which
mechanism a given context actually matches if we enable more
than one gss mechanism to use the same upcall pipe.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cleans up an issue whereby rpcsec_gss uses the rpc_clnt->cl_auth. If we want
to be able to add several rpc_auths to a single rpc_clnt, then this abuse
must go.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Brian Behlendorf writes:
The root cause of the NFS hang we were observing appears to be a rare
deadlock between the kernel provided usermodehelper API and the linux NFS
client. The deadlock can arise because both of these services use the
generic linux work queues. The usermodehelper API run the specified user
application in the context of the work queue. And NFS submits both cleanup
and reconnect work to the generic work queue for handling. Normally this
is fine but a deadlock can result in the following situation.
- NFS client is in a disconnected state
- [events/0] runs a usermodehelper app with an NFS dependent operation,
this triggers an NFS reconnect.
- NFS reconnect happens to be submitted to [events/0] work queue.
- Deadlock, the [events/0] work queue will never process the
reconnect because it is blocked on the previous NFS dependent
operation which will not complete.`
The solution is simply to run reconnect requests on rpciod.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of taking the mutex every time we just need to increment/decrement
rpciod_users, we can optmise by using atomic_inc_not_zero and
atomic_dec_and_test.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The kref now does most of what cl_count + cl_user used to do. The only
remaining role for cl_count is to tell us if we are in a 'shutdown'
phase. We can provide that information using a single bit field instead
of a full atomic counter.
Also rename rpc_destroy_client() to rpc_close_client(), which reflects
better what its role is these days.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace it with explicit calls to rpc_shutdown_client() or
rpc_destroy_client() (for the case of asynchronous calls).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Its use is at best racy, and there is only one user (lockd), which has
additional locking that makes the whole thing redundant.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6: (40 commits)
bonding/bond_main.c: make 2 functions static
ps3: gigabit ethernet driver for PS3, take3
[netdrvr] Fix dependencies for ax88796 ne2k clone driver
eHEA: Capability flag for DLPAR support
Remove sk98lin ethernet driver.
sunhme.c:quattro_pci_find() must be __devinit
bonding / ipv6: no addrconf for slaves separately from master
atl1: remove write-only var in tx handler
macmace: use "unsigned long flags;"
Cleanup usbnet_probe() return value handling
netxen: deinline and sparse fix
eeprom_93cx6: shorten pulse timing to match spec (bis)
phylib: Add Marvell 88E1112 phy id
phylib: cleanup marvell.c a bit
AX88796 network driver
IOC3: Switch to pci refcounting safe APIs
e100: Fix Tyan motherboard e100 not receiving IPMI commands
QE Ethernet driver writes to wrong register to mask interrupts
rrunner.c:rr_init() must be __devinit
tokenring/3c359.c:xl_init() must be __devinit
...
At present, when a device is enslaved to bonding, if ipv6 is
active then addrconf will be initated on the slave (because it is closed
then opened during the enslavement processing). This causes DAD and RS
packets to be sent from the slave. These packets in turn can confuse
switches that perform ipv6 snooping, causing them to incorrectly update
their forwarding tables (if, e.g., the slave being added is an inactve
backup that won't be used right away) and direct traffic away from the
active slave to a backup slave (where the incoming packets will be
dropped).
This patch alters the behavior so that addrconf will only run on
the master device itself. I believe this is logically correct, as it
prevents slaves from having an IPv6 identity independent from the
master. This is consistent with the IPv4 behavior for bonding.
This is accomplished by (a) having bonding set IFF_SLAVE sooner
in the enslavement processing than currently occurs (before open, not
after), and (b) having ipv6 addrconf ignore UP and CHANGE events on
slave devices.
The eql driver also uses the IFF_SLAVE flag. I inspected eql,
and I believe this change is reasonable for its usage of IFF_SLAVE, but
I did not test it.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Cleanup using list_for_each_entry.
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Joe Jezak <josejx@gentoo.org>
Cc: Daniel Drake <dsd@gentoo.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When cleaning up HIDP sessions, we currently close the ACL connection
before deregistering the input device. Closing the ACL connection
schedules a workqueue to remove the associated objects from sysfs, but
the input device still refers to them -- and if the workqueue happens to
run before the input device removal, the kernel will oops when trying to
look up PHYSDEVPATH for the removed input device.
Fix this by deregistering the input device before closing the
connections.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>From my recent patch:
> > #1
> > Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
> > required a work function should always (unconditionally) rearm with
> > delay > 0 - otherwise it would endlessly loop. This patch replaces
> > this function with cancel_delayed_work(). Later kernel versions don't
> > require this, so here it's only for uniformity.
But Oleg Nesterov <oleg@tv-sign.ru> found:
> But 2.6.22 doesn't need this change, why it was merged?
>
> In fact, I suspect this change adds a race,
...
His description was right (thanks), so this patch reverts #1.
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every file should include the headers containing the prototypes for
its global functions.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Choices' index values may be out of range while still encoded in the fixed
length bit-field. This bug may cause access to undefined types (NULL
pointers) and thus crashes (Reported by Zhongling Wen).
This patch also adds checking of decode flag when decoding SEQUENCEs.
Signed-off-by: Jing Min Zhao <zhaojingmin@vivecode.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_clone_fraglist is static so it shouldn't be exported.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP currently permits users to bind to link-local addresses,
but doesn't verify that the scope id specified at bind matches
the interface that the address is configured on. It was report
that this can hang a system.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In-kernel sockets created with sock_create_kern don't usually
have a file and file descriptor allocated to them. As a result,
when SCTP tries to check the non-blocking flag, we Oops when
dereferencing a NULL file pointer.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correctly dereference bytes_copied in sctp_copy_laddrs().
I totally must have spaced when doing this.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
#1
Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
required a work function should always (unconditionally) rearm with
delay > 0 - otherwise it would endlessly loop. This patch replaces
this function with cancel_delayed_work(). Later kernel versions don't
require this, so here it's only for uniformity.
#2
After deleting a timer in cancel_[rearming_]delayed_work() there could
stay a last skb queued in npinfo->txq causing a memory leak after
kfree(npinfo).
Initial patch & testing by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If sky2 device poll routine is called from netpoll_send_skb, it would
deadlock. The netpoll_send_skb held the netif_tx_lock, and the poll
routine could acquire it to clean up skb's. Other drivers might use
same locking model.
The driver is correct, netpoll should not introduce more locking
problems than it causes already. So change the code to drop lock
before calling poll handler.
Signed-off-by: Stephen Hemminger <shemminger@linux.foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_sock_migrate() grabs the socket lock on a newly allocated socket while
holding the socket lock on an old socket. lockdep worries that this might
be a recursive lock attempt.
task/3026 is trying to acquire lock:
(sk_lock-AF_INET){--..}, at: [<ffffffff88105b8c>] sctp_sock_migrate+0x2e3/0x327 [sctp]
but task is already holding lock:
(sk_lock-AF_INET){--..}, at: [<ffffffff8810891f>] sctp_accept+0xdf/0x1e3 [sctp]
This patch tells lockdep that this locking is safe by using
lock_sock_nested().
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Having walked through the entire skbuff, skb_seq_read would leave the
last fragment mapped. As a consequence, the unwary caller would leak
kmaps, and proceed with preempt_count off by one. The only (kind of
non-intuitive) workaround is to use skb_seq_read_abort.
This patch makes sure skb_seq_read always unmaps frag_data after
having cycled through the skb's paged part.
Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This moves the local_irq_enable() call in net_rx_action() to before
calling the CONFIG_NET_DMA's dma_async_memcpy_issue_pending() rather
than after. This shortens the irq disabled window and allows for DMA
drivers that need to do their own irq hold.
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_read_sock() currently assumes that the recv_actor() only returns
number of bytes copied. For network splice receive, we may have to
return an error in some cases. So allow the actor to return a negative
error value.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tipc netlink config handler uses the nlmsg_pid from the
request header as destination for its reply. If the application
initialized nlmsg_pid to 0, the reply is looped back to the kernel,
causing hangup. Fix: use nlmsg_pid of the skb that triggered the
request.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
secmark doesn't depend on CONFIG_NET_SCHED.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bug reported by Haruhito Watanabe <haruhito@sfc.keio.ac.jp>.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no realistic situation to change helper (Who wants IRC helper to
track FTP traffic ?). Moreover, if we want to do that, we need to fix race
issue by nfctnetlink and running helper. That will add overhead to packet
processing. It wouldn't pay. So this rejects the request to change
helper. The requests to add or remove helper are accepted as ever.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jerome Borsboom <j.borsboom@erasmusmc.nl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the split out of the patch that we agreed I should split
out from my last patch. It changes space_left to be computed in the same
way the to variable is. I know we talked about changing space_left to an
int, but I think size_t is more appropriate, since we should never have
negative space in our buffer, and computing using offsetof means space_left
should now never drop below zero.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
I noted the other day while looking at a bug that was ostensibly
in some perl networking library, that we strictly avoid allowing getsockopt
operations to complete if we pass in oversized buffers. This seems to make
libraries like Perl::NET malfunction since it seems to allocate oversized
buffers for use in several operations. It also seems to be out of line with
the way udp, tcp and ip getsockopt routines handle buffer input (since the
*optlen pointer in both an input and an output and gets set to the length
of the data that we copy into the buffer). This patch brings our getsockopt
helpers into line with other protocols, and allows us to accept oversized
buffers for our getsockopt operations. Tested by me with good results.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Return the number of bytes buffered in rxrpc_send_data().
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_vs currently fails to reset its ip_vs_sync_state variable if the
sync thread fails to start properly. The result is that the kernel
will report a running daemon when their actuall is none.
If you issue the following commands:
1. ipvsadm --start-daemon master --mcast-interface bla
2. ipvsadm -L --daemon
3. ipvsadm --stop-daemon master
Assuming that bla is not an actual interface, step 2 should return no
data, but instead returns:
$ ipvsadm -L --daemon
master sync daemon (mcast=bla, syncid=0)
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My IPsec MTU optimization patch introduced a regression in MTU calculation
for non-ESP SAs, the SA's header_len needs to be subtracted from the MTU if
the transform doesn't provide a ->get_mtu() function.
Reported-and-tested-by: Marco Berizzi <pupilla@hotmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a NULL dereference spotted by the Coverity checker.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6f74651ae6 is found guilty
of breaking DSACK counting, which should be done only for the
SACK block reported by the DSACK instead of every SACK block
that is received along with DSACK information.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 164891aadf broke RTT
sampling of congestion control modules. Inaccurate timestamps
could be fed to them without providing any way for them to
identify such cases. Previously RTT sampler was called only if
FLAG_RETRANS_DATA_ACKED was not set filtering inaccurate
timestamps nicely. In addition, the new behavior could give an
invalid timestamp (zero) to RTT sampler if only skbs with
TCPCB_RETRANS were ACKed. This solves both problems.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent patch that added ipv6_hwtype is broken on tuntap tunnels.
Indeed, it's broken on any device that does not pass the ipv6_hwtype
test.
The reason is that the original test only applies to autoconfiguration,
not IPv6 support. IPv6 support is allowed on any device. In fact,
even with the ipv6_hwtype patch applied you can still add IPv6 addresses
to any interface that doesn't pass thw ipv6_hwtype test provided that
they have a sufficiently large MTU. This is a serious problem because
come deregistration time these devices won't be cleaned up properly.
I've gone back and looked at the rationale for the patch. It appears
that the real problem is that we were creating IPv6 devices even if the
MTU was too small. So here's a patch which fixes that and reverts the
ipv6_hwtype stuff.
Thanks to Kanru Chen for reporting this issue.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This flaw does not affect any behavior (currently).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now, when we receive a mtu estimate smaller then minim
threshold in the ICMP message, we disable the path mtu discovery
on the transport. This leads to the never increasing sctp fragmentation
point even when the real path mtu has increased.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Currently, if the socket is owned by the user, we drop the ICMP
message. As a result SCTP forgets that path MTU changed and
never adjusting it's estimate. This causes all subsequent
packets to be fragmented. With this patch, we'll flag the association
that it needs to udpate it's estimate based on the already updated
routing information.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Introduce new function sctp_transport_update_pmtu that updates
the transports and destination caches view of the path mtu.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
If the copy_to_user or copy_user calls fail in sctp_getsockopt_local_addrs(),
the function should free locally allocated storage before returning error.
Spotted by Coverity.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Allow sctp_bindx() to accept multiple address with
unspecified port. In this case, all addresses inherit
the first bound port. We still catch full mis-matches.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
During peeloff of AF_INET6 socket, the inet6_sk(sk)->daddr
wasn't set correctly since the code was assuming IPv4 only.
Now we use a correct call to set the destination address.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Because of the current default of 100, Cubic and BIC perform very
poorly compared to standard Reno.
In the worst case, this change makes Cubic and BIC as aggressive as
Reno. So this change should be very safe.
Signed-off-by: David S. Miller <davem@davemloft.net>
Without FRTO, the tcp_try_to_open is never called with
lost_out > 0 (see tcp_time_to_recover). However, when FRTO is
enabled, the !tp->lost condition is not used until end of FRTO
because that way TCP avoids premature entry to fast recovery
during FRTO.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
mac80211 stops the tx queues during scans. This is wrong with respect
to the master deivce tx queue, since stopping it prevents any probes
from being sent during the scan. Instead, they accumulate in the queue
and are only sent after the scan is finished, which is obviously
wrong.
Signed-off-by: Mattias Nissler <mattias.nissler@gmx.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch fixes a typo in mac80211's debugfs.c.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Fix signedness mixup making mac addresses show up strangely
(like 00:11:22:33:44:ffffffaa) in /sys/class/ieee80211/*/macaddress.
Signed-off-by: David Lamparter <equinox@diac24.net>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Jean II was right: you have to re-charge the final timer when
resending rejected frames. Otherwise it triggers at a wrong time and
can break the currently running communication. Reproducible under
rt-preempt.
Signed-off-by: G. Liakhovetski <gl@dsa-ac.de>
Signed-off-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: G. Liakhovetski <gl@dsa-ac.de>
We need to switch to NRM _before_ sending the final packet otherwise
we might hit a race condition where we get the first packet from the
peer while we're still in LAP_XMIT_P.
Signed-off-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 options are not very well aligned within the packet and the
format of a CIPSO option is even worse. The result is that the CIPSO
engine in the kernel does a few unaligned accesses when parsing and
validating incoming packets with CIPSO options attached which generate
error messages on certain alignment sensitive platforms. This patch
fixes this by marking these unaligned accesses with the
get_unaliagned() macro.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current NetLabel code has some redundant APIs which allow both
"struct socket" and "struct sock" types to be used; this may have made
sense at some point but it is wasteful now. Remove the functions that
operate on sockets and convert the callers. Not only does this make
the code smaller and more consistent but it pushes the locking burden
up to the caller which can be more intelligent about the locks. Also,
perform the same conversion (socket to sock) on the SELinux/NetLabel
glue code where it make sense.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we create idev before addresses are added, it no longer makes
sense to remove them when addresses are all deleted.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we check for permission before deleting entries from SAD and
SPD, (see security_xfrm_policy_delete() security_xfrm_state_delete())
However we are not checking for authorization when flushing the SPD and
the SAD completely. It was perhaps missed in the original security hooks
patch.
This patch adds a security check when flushing entries from the SAD and
SPD. It runs the entire database and checks each entry for a denial.
If the process attempting the flush is unable to remove all of the
entries a denial is logged the the flush function returns an error
without removing anything.
This is particularly useful when a process may need to create or delete
its own xfrm entries used for things like labeled networking but that
same process should not be able to delete other entries or flush the
entire database.
Signed-off-by: Joy Latten<latten@austin.ibm.com>
Signed-off-by: Eric Paris <eparis@parisplace.org>
Signed-off-by: James Morris <jmorris@namei.org>
cbq and atm destroy their filters twice when destroying inner classes
during qdisc destruction.
Reported-and-tested-by: Strobl Anton <a.strobl@aws-it.at>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When changing the link state from userspace not affecting any other
flags. Two duplicate notification are being sent, once as action
in the NETDEV_UP/NETDEV_DOWN notification chain and a second time
when comparing old and new device flags after the change has been
completed. Although harmless, the duplicates should be avoided.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts changesets:
6aaf47fa48b7b5f487abde34ed91c4fc038410b4
There are still some correctness issues recently
discovered which do not have a known fix that doesn't
involve doing a full hash table scan on port bind.
So revert for now.
Signed-off-by: David S. Miller <davem@davemloft.net>
A recv() on an AF_UNIX, SOCK_STREAM socket can race with a
send()+close() on the peer, causing recv() to return zero, even though
the sent data should be received.
This happens if the send() and the close() is performed between
skb_dequeue() and checking sk->sk_shutdown in unix_stream_recvmsg():
process A skb_dequeue() returns NULL, there's no data in the socket queue
process B new data is inserted onto the queue by unix_stream_sendmsg()
process B sk->sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock()
process A sk->sk_shutdown is checked, unix_release_sock() returns zero
I'm surprised nobody noticed this, it's not hard to trigger. Maybe
it's just (un)luck with the timing.
It's possible to work around this bug in userspace, by retrying the
recv() once in case of a zero return value.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value from textsearch_prepare() needs to be checked
by IS_ERR(). Because it returns error code as a pointer.
Cc: "Brian J. Murrell" <netfilter@interlinx.bc.ca>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
check_compat_entry_size_and_hooks iterates over the matches and calls
compat_check_calc_match, which loads the match and calculates the
compat offsets, but unlike the non-compat version, doesn't call
->checkentry yet. On error however it calls cleanup_matches, which in
turn calls ->destroy, which can result in crashes if the destroy
function (validly) expects to only get called after the checkentry
function.
Add a compat_release_match function that only drops the module reference
on error and rename compat_check_calc_match to compat_find_calc_match to
reflect the fact that it doesn't call the checkentry function.
Reported by Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a helper module is unloaded all conntracks refering to it have their
helper pointer NULLed out, leading to lots of races. In most places this
can be fixed by proper use of RCU (they do already check for != NULL,
but in a racy way), additionally nf_conntrack_expect_related needs to
bail out when no helper is present.
Also remove two paranoid BUG_ONs in nf_conntrack_proto_gre that are racy
and not worth fixing.
Signed-off-by: Patrick McHarrdy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifindex == 0 does not exist and implies we should do a lookup by name if
one was given.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
GCC doesn't like the way Stephen initially did it:
net/ipv4/tcp_probe.c:83: warning: empty declaration
Signed-off-by: David S. Miller <davem@davemloft.net>
LIMIT_NETDEBUG allows the admin to disable some warning messages (echo 0
>/proc/sys/net/core/warnings).
The "TCP: Treason uncloaked!" message can use this facility.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously inet devices were only constructed when addresses are added
(or rarely in ipmr). Therefore the default config values they get are
the ones at the time of these operations.
Now that we're creating inet devices earlier, this changes the
behaviour of default config values in an incompatible way (see bug
#8519).
This patch creates a compromise by setting the default values at the
same point as before but only for those that have not been explicitly
set by the user since the inet device's creation.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously once inetdev_init has been called on a device any changes
made to ipv4_devconf_dflt would have no effect on that device's
configuration.
This creates a problem since we have moved the point where
inetdev_init is called from when an address is added to where the
device is registered.
This patch is the first half of a set that tries to mimic the old
behaviour while still calling inetdev_init.
It propagates any changes to ipv4_devconf_dflt to those devices that
have not had the corresponding attribute set.
The next patch will forcibly set all values at the point where
inetdev_init was previously called.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts the ipv4_devconf config members (everything except
sysctl) to an array. This allows easier manipulation which will be
needed later on to provide better management of default config values.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I made the inetdev_init call work on all devices I incorrectly
left in the panic call as well. It is obviously undesirable to
panic on an allocation failure for a normal network device. This
patch moves the panic call under the loopback if clause.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
A time_wait socket inherits sk_bound_dev_if from the original socket,
but it is not used when sending ACK packets using ip_send_reply.
Fix by passing the oif to ip_send_reply in struct ip_reply_arg and
use it for output routing.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when icmp_errors_use_inbound_ifaddr is set and an ICMP error is
sent after the packet passed through ip_output(), an address from the
outgoing interface is chosen as ICMP source address since skb->dev doesn't
point to the incoming interface anymore.
Fix this by doing an interface lookup on rt->dst.iif and using that device.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This isn't a bug just yet as only TCP uses sk_setup_caps for GSO.
However, if and when UDP or something else starts using it this is
likely to cause a problem if we forget to add software emulation
for it at the same time.
The problem is that right now we translate GSO emulation to the
bitmask NETIF_F_GSO_MASK, which includes every protocol, even
ones that we cannot emulate.
This patch makes it provide only the ones that we can emulate.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code used to ignore GSO completely, passing either way too
small or zero pkts_acked when GSO skb or part of it got ACKed.
In addition, there is no need to calculate the value in the loop
but simple arithmetics after the loop is sufficient. There is
no need to handle SYN case specially because congestion control
modules are not yet initialized when FLAG_SYN_ACKED is set.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent gcc versions emit warnings when unsigned variables are
compared < 0 or >= 0.
Signed-off-by: Bill Nottingham <notting@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
round_jiffies for net dev watchdog timer.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This diff changes the default port range used for outgoing connections,
from "use 32768-61000 in most cases, but use N-4999 on small boxes
(where N is a multiple of 1024, depending on just *how* small the box
is)" to just "use 32768-61000 in all cases".
I don't believe there are any drawbacks to this change, and it keeps
outgoing connection ports farther away from the mess of
IANA-registered ports.
Signed-off-by: Mark Glines <mark@glines.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon an excellent bug report and initial patch by
Frederik Deweerdt.
The UNIX datagram connect code blindly dereferences other->sk_socket
via the call down to the security_unix_may_send() function.
Without locking 'other' that pointer can go NULL via unix_release_sock()
which does sock_orphan() which also marks the socket SOCK_DEAD.
So we have to lock both 'sk' and 'other' yet avoid all kinds of
potential deadlocks (connect to self is OK for datagram sockets and it
is possible for two datagram sockets to perform a simultaneous connect
to each other). So what we do is have a "double lock" function similar
to how we handle this situation in other areas of the kernel. We take
the lock of the socket pointer with the smallest address first in
order to avoid ABBA style deadlocks.
Once we have them both locked, we check to see if SOCK_DEAD is set
for 'other' and if so, drop everything and retry the lookup.
Signed-off-by: David S. Miller <davem@davemloft.net>
The unix_state_*() locking macros imply that there is some
rwlock kind of thing going on, but the implementation is
actually a spinlock which makes the code more confusing than
it needs to be.
So use plain unix_state_lock and unix_state_unlock.
Signed-off-by: David S. Miller <davem@davemloft.net>
The interface for network device VLAN extension was confusing.
The kill_vid function is only really useful for devices that do
hardware filtering. Devices that only do VLAN receiption without
filtering were being forced to provide the hook, and there were
bugs in those devices.
Many drivers had kill_vid routine that called vlan_group_set_device, with
NULL, but that is done already.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Peroidic STP timers don't have to be exact. The hold timer runs at
1HZ, and the hello timer normally runs at 2HZ; save power by aligning
it them to next second.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge cleanup timer is fired 10 times a second for timers that
are at least 15 seconds ahead in time and that are not critical to be
cleaned asap.
This patch calculates the next time to run the timer as the minimum of
all timers or a minimum based on the current state.
Signed-off-by: Baruch Even <baruch@ev-en.org>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function in tcp_probe is printf like, use GCC to check the args.
Sighed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just a fix to correct the number of printl arguments. Now, srtt is
logging correctly.
Signed-off-by: Sangtae Ha <sangtae.ha@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_out_of_resources() and tcp_close() perform the
same checking of number of orphan sockets. Move this
code into common place.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv6/ip6_fib.c: In function ‘fib6_add_rt2node’:
net/ipv6/ip6_fib.c:661: warning: label ‘out’ defined but not used
Signed-off-by: David S. Miller <davem@davemloft.net>