If one side aborts an established connection, the entry still lingers
for 10s in conntrack for the late packets. Allow to open up the
connection again for the party which sent the RST packet.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the IPV6_FL_A_GET case the hash is checked for flowlabels
with the given label. If it is not found, the lock, protecting
the hash, is dropped to be re-get for writing. After this a
newly allocated entry is inserted, but no checks are performed
to catch a classical SMP race, when the conflicting label may
be inserted on another cpu.
Use the (currently unused) return value from fl_intern() to
return the conflicting entry (if found) and re-check, whether
we can reuse it (IPV6_FL_F_EXCL) or return -EEXISTS.
Also add the comment, about why not re-lookup the current
sock for conflicting flowlabel entry.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This routine scans the ipv6_fl_list whose update is
protected with the socket lock and the ip6_sk_fl_lock.
Since the socket lock is not taken in the lookup, use
the other one.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new flowlabels should be inserted into the sock list
under the ip6_sk_fl_lock. This was lost in one place.
This list is naturally protected with the socket lock, but
the fl6_sock_lookup() is called without it, so another
protection is required.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Macros like SCTP_CHUNKMAP_XXX(chukmap) require chukmap to be an array,
but match_packet() passes a pointer to these macros. Also remove the
ELEMCOUNT macro and fix a bug in SCTP_CHUNKMAP_COPY.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both high-sack detection and new lowest seq variables have
unnecessary zero special case which are now removed by setting
safe initial seqnos.
This also fixes problem which caused zero received_upto being
passed to tcp_mark_lost_retrans which confused after relations
within the marker loop causing incorrect TCPCB_SACKED_RETRANS
clearing. The problem was noticed because of a performance
report from TAKANO Ryousei <takano@axe-inc.co.jp>.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Ryousei Takano <takano-ryousei@aist.go.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at a net driver with the following construct,
if (!netif_carrier_ok(dev))
netif_carrier_on(dev);
it stuck me that the netif_carrier_ok() check was redundant, since
netif_carrier_on() checks bit __LINK_STATE_NOCARRIER anyway. This is
the same reason why netif_queue_stopped() need not be called prior to
netif_wake_queue().
This is true, but there is however an unwanted side effect from assuming
that netif_carrier_on() can be called multiple times: it touches the
watchdog, regardless of pre-existing carrier state.
The fix: move watchdog-up inside the bit-cleared code path.
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix uninitialised variable in ip_frag_reasm(). err should be set to
-ENOMEM if the initial call of skb_clone() fails.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new field to xfrm states called inner_mode. The existing
mode object is renamed to outer_mode.
This is the first part of an attempt to fix inter-family transforms. As it
is we always use the outer family when determining which mode to use. As a
result we may end up shoving IPv4 packets into netfilter6 and vice versa.
What we really want is to use the inner family for the first part of outbound
processing and the outer family for the second part. For inbound processing
we'd use the opposite pairing.
I've also added a check to prevent silly combinations such as transport mode
with inter-family transforms.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Combining RO and AH/ESP/IPCOMP does not make sense. So this patch adds a
check in the state initialisation function to prevent this.
This allows us to safely remove the mode input function of RO since it
can never be called anymore. Indeed, if somehow it does get called we'll
know about it through an OOPS instead of it slipping past silently.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
For IPv4 we were using the bottom route's peer instead of the top one.
This is wrong because the peer is only used by TCP to keep track of
information about the TCP destination address which certainly does not
live in the bottom route.
This patch fixes that which allows us to get rid of the family check
since the bottom route could be IPv6 while the top one must always
be IPv4.
I've also changed the other fields which are IPv4-specific to get the
info from the top route instead of potentially bogus data from the
bottom route.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is convenient to have a pointer from xfrm_state to address-specific
functions such as the output function for a family. Currently the
address-specific policy code calls out to the xfrm state code to get
those pointers when we could get it in an easier way via the state
itself.
This patch adds an xfrm_state_afinfo to xfrm_mode (since they're
address-specific) and changes the policy code to use it. I've also
added an owner field to do reference counting on the module providing
the afinfo even though it isn't strictly necessary today since IPv6
can't be unloaded yet.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently BEET mode does not reinject the packet back into the stack
like tunnel mode does. Since BEET should behave just like tunnel mode
this is incorrect.
This patch fixes this by introducing a flags field to xfrm_mode that
tells the IPsec code whether it should terminate and reinject the packet
back into the stack.
It then sets the flag for BEET and tunnel mode.
I've also added a number of missing BEET checks elsewhere where we check
whether a given mode is a tunnel or not.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The type and mode maps are only used by SAs, not policies. So it makes
sense to move them from xfrm_policy.c into xfrm_state.c. This also allows
us to mark xfrm_get_type/xfrm_put_type/xfrm_get_mode/xfrm_put_mode as
static.
The only other change I've made in the move is to get rid of the casts
on the request_module call for types. They're unnecessary because C
will promote them to ints anyway.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently xfrm_parse_spi requires there to be 16 bytes for AH and ESP.
In contrived cases there may not actually be 16 bytes there since the
respective header sizes are less than that (8 and 12 currently).
This patch changes the test to use the actual header length instead of 16.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not every transform needs to zap ip_summed. For example, a pure tunnel
mode encapsulation does not affect the hardware checksum at all. In fact,
every algorithm (that needs this) other than AH6 already does its own
ip_summed zapping.
This patch moves the zapping into AH6 which is in line with what IPv4 does.
Possible future optimisation: Checksum the data as we copy them in IPComp.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently xfrm6_rcv_spi gets the nexthdr value itself from the packet.
This means that we need to fix up the value in case we have a 4-on-6
tunnel. Moving this logic into the caller simplifies things and allows
us to merge the code with IPv4.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the tunnel parsing for IPv4 out of xfrm4_input and into
xfrm4_tunnel. This change is in line with what IPv6 does and will allow
us to merge the two input functions.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed that my recent patch broke 6-on-4 pure IPsec tunnels (the ones
that are only used for incompressible IPsec packets). Subsequent reviews
show that I broke 6-on-6 pure tunnels more than three years ago and nobody
ever noticed. I suppose every must be testing 6-on-6 IPComp with large
pings which are very compressible :)
This patch fixes both cases.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This functions is never called with NULL or not setup argument,
so the checks inside are redundant.
Also, the return value is always -ENOMEM, so no need in
additional variable for this.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The proposed fix is to delay the reference counter decrement
until the quiescent state pass. This will give sk_clone() a
chance to get the reference on the cloned filter.
Regular sk_filter_uncharge can happen from the sk_free() only
and there's no need in delaying the put - the socket is dead
anyway and is to be release itself.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk_filter_uncharge is called for error handling and
for releasing the former filter, but this will have to
be done in a bit different manner, so cleanup the error
path a bit.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is done merely as a preparation for the fix.
The sk_filter_uncharge() unaccounts the filter memory and calls
the sk_filter_release(), which in turn decrements the refcount
anf frees the filter.
The latter function will be required separately.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Filter is attached in a separate function, so do the
same for filter detaching.
This also removes one variable sock_setsockopt().
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous IW_SCAN_THIS_ESSID patch left a hole allowing scan
requests on interfaces in inappropriate modes.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we now allocate the queues in inet_fragment.c, we
can safely free it in the same place. The ->destructor
callback thus becomes optional for inet_frags.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since this callback is used to check for conflicts in
hashtable when inserting a newly created frag queue, we can
do the same by checking for matching the queue with the
argument, used to create one.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here we need another callback ->match to check whether the
entry found in hash matches the key passed. The key used
is the same as the creation argument for inet_frag_create.
Yet again, this ->match is the same for netfilter and ipv6.
Running a frew steps forward - this callback will later
replace the ->equal one.
Since the inet_frag_find() uses the already consolidated
inet_frag_create() remove the xxx_frag_create from protocol
codes.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one uses the xxx_frag_intern() and xxx_frag_alloc()
routines, which are already consolidated, so remove them
from protocol code (as promised).
The ->constructor callback is used to init the rest of
the frag queue and it is the same for netfilter and ipv6.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just perform the kzalloc() allocation and setup common
fields in the inet_frag_queue(). Then return the result
to the caller to initialize the rest.
The inet_frag_alloc() may return NULL, so check the
return value before doing the container_of(). This looks
ugly, but the xxx_frag_alloc() will be removed soon.
The xxx_expire() timer callbacks are patches,
because the argument is now the inet_frag_queue, not
the protocol specific queue.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This routine checks for the existence of a given entry
in the hash table and inserts the new one if needed.
The ->equal callback is used to compare two frag_queue-s
together, but this one is temporary and will be removed
later. The netfilter code and the ipv6 one use the same
routine to compare frags.
The inet_frag_intern() always returns non-NULL pointer,
so convert the inet_frag_queue into protocol specific
one (with the container_of) without any checks.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the hash value is already calculated in xxx_find, we can
simply use it later. This is already done in netfilter code,
so make the same in ipv4 and ipv6.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent header_ops change would break the following dead
code in br2684. Maintaining conditonal code in mainline is wrong.
"Do, or do not. There is no 'try.'"
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Ryan Reading <ryanr23@gmail.com>
Every IrCOMM socket is registered with the discovery subsystem, so we don't
need to loop over all of them for every discovery event. We just need to
do it for the registered IrCOMM socket.
Signed-off-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not define the sysctl_dccp_sync_ratelimit sysctl variable in the
CONFIG_SYSCTL dependent sysctl.c module - move it to input.c instead.
This fixes the following build bug:
net/built-in.o: In function `dccp_check_seqno':
input.c:(.text+0xbd859): undefined reference to `sysctl_dccp_sync_ratelimit'
distcc[29953] ERROR: compile (null) on localhost failed
make: *** [vmlinux] Error 1
Found via 'make randconfig' build testing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The icmpv6msg mib statistics is not freed.
This is almost not critical for current kernel, since ipv6
module is unloadable, but this can happen on load error and
will happen every time we stop the network namespace (when
we have one, of course).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A sysctl method was added to enable and disable debugging levels. After
further review, it was decided that there are better approaches to doing this
and the sysctl methodology isn't really desirable. This patch removes the
sysctl code from 9p.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This patch moves transport dynamic registration and matching to the net
module to prevent a bad Kconfig dependency between the net and fs 9p modules.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The 9P2000 protocol requires the authentication and permission checks to be
done in the file server. For that reason every user that accesses the file
server tree has to authenticate and attach to the server separately.
Multiple users can share the same connection to the server.
Currently v9fs does a single attach and executes all I/O operations as a
single user. This makes using v9fs in multiuser environment unsafe as it
depends on the client doing the permission checking.
This patch improves the 9P2000 support by allowing every user to attach
separately. The patch defines three modes of access (new mount option
'access'):
- attach-per-user (access=user) (default mode for 9P2000.u)
If a user tries to access a file served by v9fs for the first time, v9fs
sends an attach command to the server (Tattach) specifying the user. If
the attach succeeds, the user can access the v9fs tree.
As there is no uname->uid (string->integer) mapping yet, this mode works
only with the 9P2000.u dialect.
- allow only one user to access the tree (access=<uid>)
Only the user with uid can access the v9fs tree. Other users that attempt
to access it will get EPERM error.
- do all operations as a single user (access=any) (default for 9P2000)
V9fs does a single attach and all operations are done as a single user.
If this mode is selected, the v9fs behavior is identical with the current
one.
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This patch abstracts out the interfaces to underlying transports so that
new transports can be added as modules. This should also allow kernel
configuration of transports without ifdef-hell.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Why do we need r/o bind mounts?
This feature allows a read-only view into a read-write filesystem. In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.
This has a number of uses. It allows chroots to have parts of filesystems
writable. It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems. This also replaces patches that vserver has had out of the
tree for several years.
It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated. I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular bind and a r/o
bind mount of it. It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.
This patch:
Some filesystems forego the vfs and may_open() and create their own 'struct
file's.
This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.
Also, rename an existing, static-scope init_file() to a less generic name.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make request_key() and co fundamentally asynchronous to make it easier for
NFS to make use of them. There are now accessor functions that do
asynchronous constructions, a wait function to wait for construction to
complete, and a completion function for the key type to indicate completion
of construction.
Note that the construction queue is now gone. Instead, keys under
construction are linked in to the appropriate keyring in advance, and that
anyone encountering one must wait for it to be complete before they can use
it. This is done automatically for userspace.
The following auxiliary changes are also made:
(1) Key type implementation stuff is split from linux/key.h into
linux/key-type.h.
(2) AF_RXRPC provides a way to allocate null rxrpc-type keys so that AFS does
not need to call key_instantiate_and_link() directly.
(3) Adjust the debugging macros so that they're -Wformat checked even if
they are disabled, and make it so they can be enabled simply by defining
__KDEBUG to be consistent with other code of mine.
(3) Documentation.
[alan@lxorguk.ukuu.org.uk: keys: missing word in documentation]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adrian Bunk points out that "unsafe" was used to mark modules touched by
the deprecated MOD_INC_USE_COUNT interface, which has long gone. It's time
to remove the member from the module structure, as well.
If you want a module which can't unload, don't register an exit function.
(Vlad Yasevich says SCTP is now safe to unload, so just remove the
__unsafe there).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.
Convert
ctor(void *object, struct kmem_cache *s, unsigned long flags)
to
ctor(struct kmem_cache *s, void *object)
throughout the kernel
[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes the problem of associating with wpa_secured hidden
AP. Please try out.
The original author of this patch is Bill Moss <bmoss@clemson.edu>
Signed-off-by: Abhijeet Kolekar <abhijeet.kolekar@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Some AP equipment "in the wild" services multiple SSIDs using the
same BSSID. This patch changes the key of sta_bss_list to include
the SSID as well as the BSSID and the channel so as to prevent one
SSID from eclipsing another SSID with the same BSSID.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Some AP equipment "in the wild" uses the same BSSID on multiple channels
(particularly "a" vs. "b/g"). This patch changes the key of sta_bss_list
to include both the BSSID and the channel so as to prevent a BSSID on
one channel from eclipsing the same BSSID on another channel.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There's no reason to warn about an invalid AID field when the
association was denied.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Michael Wu <flamingice@sourmilk.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Adam Baker reported that the prism2 ioctl removal changed behaviour
in that now the selection order was the other way around as before.
New API is planned but not done yet, so for now just use the first
matching channel in any mode as was previous behaviour with an unset
next_mode.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Michael Wu <flamingice@sourmilk.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The commit 65b6a277 titled "ieee80211: Fix header->qos_ctl endian issue"
*introduced* an endianness bug. Partially revert it.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
sparc64:
net/sunrpc/xprtrdma/verbs.c:1264: warning: long long unsigned int format, u64 arg (arg 3)
net/sunrpc/xprtrdma/verbs.c:1264: warning: long long unsigned int format, u64 arg (arg 4)
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The difference in both functions is in the "id" passed to
the rt6_select, so just pass it as an extra argument from
two outer helpers.
This is minus 60 lines of code and 360 bytes of .text
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pnigh_lookup is used to lookup proxy entries and to
create them in case lookup failed.
However, the "creation" code does not perform the re-lookup
after GFP_KERNEL allocation. This is done because the code
is expected to be protected with the RTNL lock, so add the
assertion (mainly to address future questions from new network
developers like me :) ).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With all the users of the double pointers removed from the IPv6 input path,
this patch converts all occurances of sk_buff ** to sk_buff * in IPv6 input
handlers.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
These ones use the generic data types too, so move
them in one place.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the evictor code is consolidated there is no need in
passing the extra pointer to the xxx_put() functions.
The only place when it made sense was the evictor code itself.
Maybe this change must got with the previous (or with the
next) patch, but I try to make them shorter as much as
possible to simplify the review (but they are still large
anyway), so this change goes in a separate patch.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The evictors collect some statistics for ipv4 and ipv6,
so make it return the number of evicted queues and account
them all at once in the caller.
The XXX_ADD_STATS_BH() macros are just for this case,
but maybe there are places in code, that can make use of
them as well.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make in possible we need to know the exact frag queue
size for inet_frags->mem management and two callbacks:
* to destoy the skb (optional, used in conntracks only)
* to free the queue itself (mandatory, but later I plan to
move the allocation and the destruction of frag_queues
into the common place, so this callback will most likely
be optional too).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This code works with the generic data types as well, so
move this into inet_fragment.c
This move makes it possible to hide the secret_timer
management and the secret_rebuild routine completely in
the inet_fragment.c
Introduce the ->hashfn() callback in inet_frags() to get
the hashfun for a given inet_frag_queue() object.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since now all the xxx_frag_kill functions now work
with the generic inet_frag_queue data type, this can
be moved into a common place.
The xxx_unlink() code is moved as well.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some sysctl variables are used to tune the frag queues
management and it will be useful to work with them in
a common way in the future, so move them into one
structure, moreover they are the same for all the frag
management codes.
I don't place them in the existing inet_frags object,
introduced in the previous patch for two reasons:
1. to keep them in the __read_mostly section;
2. not to export the whole inet_frags objects outside.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some objects that are common in all the places
which are used to keep track of frag queues, they are:
* hash table
* LRU list
* rw lock
* rnd number for hash function
* the number of queues
* the amount of memory occupied by queues
* secret timer
Move all this stuff into one structure (struct inet_frags)
to make it possible use them uniformly in the future. Like
with the previous patch this mostly consists of hunks like
- write_lock(&ipfrag_lock);
+ write_lock(&ip4_frags.lock);
To address the issue with exporting the number of queues and
the amount of memory occupied by queues outside the .c file
they are declared in, I introduce a couple of helpers.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce the struct inet_frag_queue in include/net/inet_frag.h
file and place there all the common fields from three structs:
* struct ipq in ipv4/ip_fragment.c
* struct nf_ct_frag6_queue in nf_conntrack_reasm.c
* struct frag_queue in ipv6/reassembly.c
After this, replace these fields on appropriate structures with
this structure instance and fix the users to use correct names
i.e. hunks like
- atomic_dec(&fq->refcnt);
+ atomic_dec(&fq->q.refcnt);
(these occupy most of the patch)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Uninline netfilter okfns for those cases where gcc can generate tail-calls.
Before:
text data bss dec hex filename
8994153 1016524 524652 10535329 a0c1a1 vmlinux
After:
text data bss dec hex filename
8992761 1016524 524652 10533937 a0bc31 vmlinux
-------------------------------------------------------
-1392
All cases have been verified to generate tail-calls with and without netfilter.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we don't pass double skb pointers to nf_hook_slow anymore, gcc
can generate tail calls for some of the netfilter hook okfn invocations,
so there is no need to inline the functions anymore. This caused huge
code bloat since we ended up with one inlined version and one out-of-line
version since we pass the address to nf_hook_slow.
Before:
text data bss dec hex filename
8997385 1016524 524652 10538561 a0ce41 vmlinux
After:
text data bss dec hex filename
8994009 1016524 524652 10535185 a0c111 vmlinux
-------------------------------------------------------
-3376
All cases have been verified to generate tail-calls with and without
netfilter. The okfns in ipmr and xfrm4_input still remain inline because
gcc can't generate tail-calls for them.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP packets all have writable heads, that is, even though it's cloned, it is
writable up to the end of the TCP header. This patch makes skb_checksum_help
aware of this fact by using skb_clone_writable and avoiding a copy for TCP.
I've also modified the BUG_ON tests to be unsigned. The only case where this
makes a difference is if csum_start points to a location before skb->data.
Since skb->data should always include the header where the checksum field
is (and all currently callers adhere to that), this change is safe and may
uncover bugs later.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
I got confused by the dual nature of the off variable in the
function pskb_expand_head. The csum_start offset should use
nhead instead of off which can change depending on whether we
are using offsets or pointers.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Coverity checker spotted that we'll leak the storage allocated
to 'listeners' in netlink_kernel_create() when the
if (!nl_table[unit].registered)
check is false.
This patch avoids the leak.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Coverity checker spotted that we have already oops'ed if "dst" was
NULL.
Since "dst" being NULL doesn't seem to be possible at this point this
patch removes the NULL check.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Acked-by: Noriaki TAKAMIYA <takamiya@po.ntts.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces unnecessary uses of skb_copy by pskb_expand_head
on the IPv6 input path.
This allows us to remove the double pointers later.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the same change taht was done to ip_defrag. It
makes ipv6_frag_rcv return the last packet received of a train of fragments
rather than the head of that sequence.
This allows us to get rid of the sk_buff ** argument later.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
With all the users of the double pointers removed, this patch mops up by
finally replacing all occurances of sk_buff ** in the netfilter API by
sk_buff *.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces unnecessary uses of skb_copy, pskb_copy and
skb_realloc_headroom by functions such as skb_make_writable and
pskb_expand_head.
This allows us to remove the double pointers later.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the IPVS-specific version of skb_make_writable and
replaces it with the netfilter one.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all callers of netfilter can guarantee that the skb is not shared,
we no longer have to copy the skb in skb_make_writable.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to the special location of the bridging hook, it should never see a
shared packet anyway (certainly not with any in-kernel code). So it
makes sense to unshare the skb there if necessary as that will greatly
simplify the code below it (in particular, netfilter).
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As it is we always invoke pt_prev before ing_filter, even if there are no
ingress filters attached. This can cause unnecessary cloning in pt_prev.
This patch changes it so that we only invoke pt_prev if there are ingress
filters attached.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that ip_frag always returns the packet given to it on input, we can
change it to return an integer indicating error instead. This patch does
that and updates all its callers accordingly.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a bit of a hack. However it is worth it if you consider that
this is the only reason why we have to carry around the struct sk_buff **
pointers in netfilter.
It makes ip_defrag always return the packet that was given to it on input.
It does this by cloning the packet and replacing its original contents with
the head fragment if necessary.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates a new function skb_morph that's just like skb_clone
except that it lets user provide the spare skb that will be overwritten
by the one that's to be cloned.
This will be used by IP fragment reassembly so that we get back the same
skb that went in last (rather than the head skb that we get now which
requires us to carry around double pointers all over the place).
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates a new function __copy_skb_header to merge the common
code between copy_skb_header and skb_clone. Having two functions which
are largely the same is a source of wasted labour as well as confusion.
In fact the tc_verd stuff is almost certainly a bug since it's treated
differently in skb_clone compared to the callers of copy_skb_header
(skb_copy/pskb_copy/skb_copy_expand).
I've kept that difference in tact with a comment added asking for
clarification.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (131 commits)
NFSv4: Fix a typo in nfs_inode_reclaim_delegation
NFS: Add a boot parameter to disable 64 bit inode numbers
NFS: nfs_refresh_inode should clear cache_validity flags on success
NFS: Fix a connectathon regression in NFSv3 and NFSv4
NFS: Use nfs_refresh_inode() in ops that aren't expected to change the inode
SUNRPC: Don't call xprt_release in call refresh
SUNRPC: Don't call xprt_release() if call_allocate fails
SUNRPC: Fix buggy UDP transmission
[23/37] Clean up duplicate includes in
[2.6 patch] net/sunrpc/rpcb_clnt.c: make struct rpcb_program static
SUNRPC: Use correct type in buffer length calculations
SUNRPC: Fix default hostname created in rpc_create()
nfs: add server port to rpc_pipe info file
NFS: Get rid of some obsolete macros
NFS: Simplify filehandle revalidation
NFS: Ensure that nfs_link() returns a hashed dentry
NFS: Be strict about dentry revalidation when doing exclusive create
NFS: Don't zap the readdir caches upon error
NFS: Remove the redundant nfs_reval_fsid()
NFSv3: Always use directory post-op attributes in nfs3_proc_lookup
...
Fix up trivial conflict due to sock_owned_by_user() cleanup manually in
net/sunrpc/xprtsock.c
* 'nfs-server-stable' of git://linux-nfs.org/~bfields/linux:
knfsd: query filesystem for NFSv4 getattr of FATTR4_MAXNAME
knfsd: nfsv4 delegation recall should take reference on client
knfsd: don't shutdown callbacks until nfsv4 client is freed
knfsd: let nfsd manage timing out its own leases
knfsd: Add source address to sunrpc svc errors
knfsd: 64 bit ino support for NFS server
svcgss: move init code into separate function
knfsd: remove code duplication in nfsd4_setclientid()
nfsd warning fix
knfsd: fix callback rpc cred
knfsd: move nfsv4 slab creation/destruction to module init/exit
knfsd: spawn kernel thread to probe callback channel
knfsd: nfs4 name->id mapping not correctly parsing negative downcall
knfsd: demote some printk()s to dprintk()s
knfsd: cleanup of nfsd4 cmp_* functions
knfsd: delete code made redundant by map_new_errors
nfsd: fix horrible indentation in nfsd_setattr
nfsd: remove unused cache_for_each macro
nfsd: tone down inaccurate dprintk
make sync wakeups affine for cache-cold tasks: if a cache-cold task
is woken up by a sync wakeup then use the opportunity to migrate it
straight away. (the two tasks are 'related' because they communicate)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
all uses of and almost all assignments to lro_desc->tcp_ack assume that it's
net-endian; one converts net-endian to host-endian and sticks it in
lro_desc->tcp_ack.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix networking code kernel-doc for newly added parameters.
Warning(linux-2.6.23-git2//net/core/sock.c:879): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:570): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:594): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:617): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:641): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:667): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:722): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:959): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:1195): No description found for parameter 'dev'
Warning(linux-2.6.23-git2//net/core/dev.c:2105): No description found for parameter 'n'
Warning(linux-2.6.23-git2//net/core/dev.c:3272): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:3445): No description found for parameter 'net'
Warning(linux-2.6.23-git2//include/linux/netdevice.h:1301): No description found for parameter 'cpu'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A number of different drivers incorrect access the kobject name field
directly. This is not correct as the name might not be in the array.
Use the proper accessor function instead.
This changes the uevent buffer functions to use a struct instead of a
long list of parameters. It does no longer require the caller to do the
proper buffer termination and size accounting, which is currently wrong
in some places. It fixes a known bug where parts of the uevent
environment are overwritten because of wrong index calculations.
Many thanks to Mathieu Desnoyers for finding bugs and improving the
error handling.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This addition of lost_retrans_low to tcp_sock might be
unnecessary, it's not clear how often lost_retrans worker is
executed when there wasn't work to do.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Detection implemented with lost_retrans must work also when
fastpath is taken, yet most of the queue is skipped including
(very likely) those retransmitted skb's we're interested in.
This problem appeared when the hints got added, which removed
a need to always walk over the whole write queue head.
Therefore decicion for the lost_retrans worker loop entry must
be separated from the sacktag processing more than it was
necessary before.
It turns out to be problematic to optimize the worker loop
very heavily because ack_seqs of skb may have a number of
discontinuity points. Maybe similar approach as currently is
implemented could be attempted but that's becoming more and
more complex because the trend is towards less skb walking
in sacktag marker. Trying a simple work until all rexmitted
skbs heve been processed approach.
Maybe after(highest_sack_end_seq, tp->high_seq) checking is not
sufficiently accurate and causes entry too often in no-work-to-do
cases. Since that's not known, I've separated solution to that
from this patch.
Noticed because of report against a related problem from TAKANO
Ryousei <takano@axe-inc.co.jp>. He also provided a patch to
that part of the problem. This patch includes solution to it
(though this patch has to use somewhat different placement).
TAKANO's description and patch is available here:
http://marc.info/?l=linux-netdev&m=119149311913288&w=2
...In short, TAKANO's problem is that end_seq the loop is using
not necessarily the largest SACK block's end_seq because the
current ACK may still have higher SACK blocks which are later
by the loop.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both sacked_out and fackets_out are directly known from how
parameter. Since fackets_out is accurate, there's no need for
recounting (sacked_out was previously unnecessarily counted
in the loop anyway).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is necessary for upcoming DSACK bugfix. Reduces sacktag
length which is not very sad thing at all... :-)
Notice that there's a need to handle out-of-mem at caller's
place.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's on the way for future cutting of that function.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This condition (plain R) can arise at least in recovery that
is triggered after tcp_undo_loss. There isn't any reason why
they should not be marked as lost, not marking makes in_flight
estimator to return too large values.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
I was reading tcp_enter_loss while looking for Cedric's bug and
noticed bytes_acked adjustment is missing from FRTO side.
Since bytes_acked will only be used in tcp_cong_avoid, I think
it's safe to assume RTO would be spurious. During FRTO cwnd
will be not controlled by tcp_cong_avoid and if FRTO calls for
conventional recovery, cwnd is adjusted and the result of wrong
assumption is cleared from bytes_acked. If RTO was in fact
spurious, we did normal ABC already and can continue without
any additional adjustments.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
From RFC 3493, Section 5.2:
IPV6_MULTICAST_IF
Set the interface to use for outgoing multicast packets. The
argument is the index of the interface to use. If the
interface index is specified as zero, the system selects the
interface (for example, by looking up the address in a routing
table and using the resulting interface).
This patch adds support for (index == 0) to reset the value to it's
original state, allowing the system to choose the best interface. IPv4
already behaves this way.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch will add MODULE_ALIAS("ip6t_<modulename>") where missing,
otherwise you will get
ip6tables: No chain/target/match by that name
when xt_<modulename> is not already loaded.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
With your description I could reproduce the bug and actually you were
completely right: the code above is incorrect. Somehow I was able to
misread RFC1122 and mixed the roles :-(:
When a connection is >>closed actively<<, it MUST linger in
TIME-WAIT state for a time 2xMSL (Maximum Segment Lifetime).
However, it MAY >>accept<< a new SYN from the remote TCP to
reopen the connection directly from TIME-WAIT state, if it:
[...]
The fix is as follows: if the receiver initiated an active close, then the
sender may reopen the connection - otherwise try to figure out if we hold
a dead connection.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) fibnl needs to be declared outside of config ifdefs,
and also should not be explicitly initialized to NULL
2) nl_fib_input() args are wrong for netlink_kernel_create()
input method
Signed-off-by: David S. Miller <davem@davemloft.net>
As discussed before, this patch provides userland with a way to access
relevant options in Router Advertisements, after they are processed
and validated by the kernel. Extra options are processed in a generic
way; this patch only exports RDNSS options described in RFC5006, but
support to control which options are exported could be easily added.
A new rtnetlink message type is defined, to transport Neighbor
Discovery options, along with optional context information. At the
moment only the address of the router sending an RDNSS option is
included, but additional attributes may be later defined, if needed by
new use cases.
Signed-off-by: Pierre Ynard <linkfanel@yahoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch make processing netlink user -> kernel messages synchronious.
This change was inspired by the talk with Alexey Kuznetsov about current
netlink messages processing. He says that he was badly wrong when introduced
asynchronious user -> kernel communication.
The call netlink_unicast is the only path to send message to the kernel
netlink socket. But, unfortunately, it is also used to send data to the
user.
Before this change the user message has been attached to the socket queue
and sk->sk_data_ready was called. The process has been blocked until all
pending messages were processed. The bad thing is that this processing
may occur in the arbitrary process context.
This patch changes nlk->data_ready callback to get 1 skb and force packet
processing right in the netlink_unicast.
Kernel -> user path in netlink_unicast remains untouched.
EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
drop, but the process remains in the cycle until the message will be fully
processed. So, there is no need to use this kludges now.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are currently two ways to determine whether the netlink socket is a
kernel one or a user one. This patch creates a single inline call for
this purpose and unifies all the calls in the af_netlink.c
No similar calls are found outside af_netlink.c.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
netlink_sendskb does not use third argument. Clean it and save a couple of
bytes.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks
like outdated copy/paste from rtnetlink.c. Push them into sync with the
original.
Changes from v1:
- deleted comment in nfnetlink_rcv_msg by request of Patrick McHardy
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to process outstanding netlink user->kernel packets
during rtnl_unlock now. There is no rtnl_trylock in the rtnetlink_rcv
anymore.
Normal code path is the following:
netlink_sendmsg
netlink_unicast
netlink_sendskb
skb_queue_tail
netlink_data_ready
rtnetlink_rcv
mutex_lock(&rtnl_mutex);
netlink_run_queue(sk, qlen, &rtnetlink_rcv_msg);
mutex_unlock(&rtnl_mutex);
So, it is possible, that packets can be present in the rtnl->sk_receive_queue
during rtnl_unlock, but there is no need to process them at that moment as
rtnetlink_rcv for that packet is pending.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
If kernel_accept() returns an error, it may pass back a pointer to
freed memory (which the caller should ignore). Make it pass back NULL
instead for better safety.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expansion of original idea from Denis V. Lunev <den@openvz.org>
Add robustness and locking to the local_port_range sysctl.
1. Enforce that low < high when setting.
2. Use seqlock to ensure atomic update.
The locking might seem like overkill, but there are
cases where sysadmin might want to change value in the
middle of a DoS attack.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add port randomization rather than a simple fixed rover
for use with SCTP. This makes it act similar to TCP, UDP, DCCP
when allocating ports.
No longer need port_alloc_lock as well (suggestion by Brian Haley).
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fourth parameter of /proc/net/psched is supposed to show the timer
resultion and is used by HTB userspace to calculate the necessary
burst rate. Currently we show the clock resolution, which results in a
too low burst rate when the two differ.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the IPv4 x->type->input functions return the next protocol
instead of setting it directly. This is identical to how we do things in
IPv6 and will help us merge common code on the input path.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the setting of the IP length and checksum fields out of
the transforms and into the xfrmX_output functions. This would help future
efforts in merging the transforms themselves.
It also adds an optimisation to ipcomp due to the fact that the transport
offset is guaranteed to be zero.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the duplicate ipv6_{auth,esp,comp}_hdr structures since
they're identical to the IPv4 versions. Duplicating them would only create
problems for ourselves later when we need to add things like extended
sequence numbers.
I've also added transport header type conversion headers for these types
which are now used by the transforms.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 calling convention for x->mode->output is more general and could
help an eventual protocol-generic x->type->output implementation. This
patch adopts it for IPv4 as well and modifies the IPv4 type output functions
accordingly.
It also rewrites the IPv6 mac/transport header calculation to be based off
the network header where practical.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the calling convention so that on entry from
x->mode->output and before entry into x->type->output skb->data
will point to the payload instead of the IP header.
This is essentially a redistribution of skb_push/skb_pull calls
with the aim of minimising them on the common path of tunnel +
ESP.
It'll also let us use the same calling convention between IPv4
and IPv6 with the next patch.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The beet output function completely kills any extension headers by replacing
them with the IPv6 header. This is because it essentially ignores the
result of ip6_find_1stfragopt by simply acting as if there aren't any
extension headers.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
I pointed this out back when this patch was first proposed but it looks like
it got lost along the way.
The checksum only needs to be ignored for NAT-T in transport mode where
we lose the original inner addresses due to NAT. With BEET the inner
addresses will be intact so the checksum remains valid.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
To judge the timing for DAD, netif_carrier_ok() is used. However,
there is a possibility that dev->qdisc stays noop_qdisc even if
netif_carrier_ok() returns true. In that case, DAD NS is not sent out.
We need to defer the IPv6 device initialization until a valid qdisc
is specified.
Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The unregister_netdevice() and dev_change_net_namespace()
both check for dev->flags to be IFF_UP before calling the
dev_close(), but the dev_close() checks for IFF_UP itself,
so remove those unneeded checks.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Follows own function for each task principle, this is really
somewhat separate task being done in sacktag. Also reduces
indentation.
In addition, added ack_seq local var to break some long
lines & fixed coding style things.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just switch to the consolidated calls.
ipt_recent() has to initialize the private, so use
the __seq_open_private() helper.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.
The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The decryption handlers will skip the frame if the RX_FLAG_DECRYPTED
flag is set, so the early flag setting introduced by Johannes breaks
decryption. To work around this, call the handlers first and then set
the flag.
Signed-off-by: Mattias Nissler <mattias.nissler@gmx.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Problem description by Daniel Drake <dsd@gentoo.org>:
"This sequence of events causes loss of connectivity:
<plug in>
<associate as normal in managed mode>
ifconfig eth7 down
iwconfig eth7 mode monitor
ifconfig eth7 up
ifconfig eth7 down
iwconfig eth7 mode managed
<associate as normal>
At this point you are associated but TX does not work. This is because
the eth7 hard_start_xmit is still ieee80211_monitor_start_xmit."
The problem is caused by ieee80211_if_set_type checking for a non-zero
hard_start_xmit pointer value in order to avoid changing that value for
master devices. The fix is to make that check more explicitly linked to
master devices rather than simply checking if the value has been
previously set.
CC: Daniel Drake <dsd@gentoo.org>
Acked-by: Michael Wu <flamingice@sourmilk.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch releases the lock on the state before calling x->type->output.
It also adds the lock to the spots where they're currently needed.
Most of those places (all except mip6) are expected to disappear with
async crypto.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds locking so that when we're copying non-atomic fields such as
life-time or coaddr to user-space we don't get a partial result.
For af_key I've changed every instance of pfkey_xfrm_state2msg apart from
expiration notification to include the keys and life-times. This is in-line
with XFRM behaviour.
The actual cases affected are:
* pfkey_getspi: No change as we don't have any keys to copy.
* key_notify_sa:
+ ADD/UPD: This wouldn't work otherwise.
+ DEL: It can't hurt.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here's a good example of code duplication leading to code rot. The
notification patch did its own netlink message creation for xfrm states.
It duplicated code that was already in dump_one_state. Guess what, the
next time (and the time after) when someone updated dump_one_state the
notification path got zilch.
This patch moves that code from dump_one_state to copy_to_user_state_extra
and uses it in xfrm_notify_sa too. Unfortunately whoever updates this
still needs to update xfrm_sa_len since the notification path wants to
know the exact size for allocation.
At least I've added a comment saying so and if someone still forgest, we'll
have a WARN_ON telling us so.
I also changed the security size calculation to use xfrm_user_sec_ctx since
that's what we actually put into the skb. However it makes no practical
difference since it has the same size as xfrm_sec_ctx.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves some common code that conceptually belongs to the xfrm core
from af_key/xfrm_user into xfrm_alloc_spi.
In particular, the spin lock on the state is now taken inside xfrm_alloc_spi.
Previously it also protected the construction of the response PF_KEY/XFRM
messages to user-space. This is inconsistent as other identical constructions
are not protected by the state lock. This is bad because they in fact should
be protected but only in certain spots (so as not to hold the lock for too
long which may cause packet drops).
The SPI byte order conversion has also been moved.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no point in waking people up when creating/updating larval states
because they'll just go back to sleep again as larval states by definition
cannot be found by xfrm_state_find.
We should only wake them up when the larvals mature or die.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current the x->mode->output functions store the IPv6 nh pointer in the
skb network header. This is inconvenient because the network header then
has to be fixed up before the packet can leave the IPsec stack. The mac
header field is unused on output so we can use that to store this instead.
This patch does that and removes the network header fix-up in xfrm_output.
It also uses ipv6_hdr where appropriate in the x->type->output functions.
There is also a minor clean-up in esp4 to make it use the same code as
esp6 to help any subsequent effort to merge the two.
Lastly it kills two redundant skb_set_* statements in BEET that were
simply copied over from transport mode.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Constructs of the form
xfrm_state_hold(x);
foo(x);
xfrm_state_put(x);
tend to be broken because foo is either synchronous where this is totally
unnecessary or if foo is asynchronous then the reference count is in the
wrong spot.
In the case of xfrm_secpath_reject, the function is synchronous and therefore
we should just kill the reference count.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The newly created net namespace is set to 0 with memset()
in setup_net(). The setup_net() is also called for the
init_net_ns(), which is zeroed naturally as a global var.
So remove this memset and allocate new nets with the
kmem_cache_zalloc().
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6_fib.c, fib6_clean_node() casts a fib6_walker_t pointer to
a fib6_cleaner_t pointer assuming a struct fib6_walker_t (field 'w')
is the first field in struct fib6_walker_t.
To prevent any future problems that may occur if one day a field
is inadvertently inserted before the 'w' field in struct fib6_cleaner_t,
(and to improve readability), this patch uses the container_of() macro.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the net namespaces many code leaved the __init section,
thus making the kernel occupy more memory than it did before.
Since we have a config option that prohibits the namespace
creation, the functions that initialize/finalize some netns
stuff are simply not needed and can be freed after the boot.
Currently, this is almost not noticeable, since few calls
are no longer in __init, but when the namespaces will be
merged it will be possible to free more code. I propose to
use the __net_init, __net_exit and __net_initdata "attributes"
for functions/variables that are not used if the CONFIG_NET_NS
is not set to save more space in memory.
The exiting functions cannot just reside in the __exit section,
as noticed by David, since the init section will have
references on it and the compilation will fail due to modpost
checks. These references can exist, since the init namespace
never dies and the exit callbacks are never called. So I
introduce the __exit_refok attribute just like it is already
done with the __init_refok.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A net_device struct provides field dev_id. It is used for
unique ipv6 generation in case of shared network cards
(as for the OSA network cards of IBM System z).
If VLAN devices are built on top of such shared network cards,
this dev_id information needs to be transferred to the VLAN device.
Signed-off-by: Ursula Braun <braunu@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The lastused update check in xfrm_output can be done just as well in
the mode output function which is specific to RO.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the only callers of xfrm_replay_notify are in xfrm, we can remove
the export.
This patch also removes xfrm_aevent_doreplay since it's now called in just
one spot.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The replay counter is one of only two remaining things in the output code
that requires a lock on the xfrm state (the other being the crypto). This
patch moves it into the generic xfrm_output so we can remove the lock from
the transforms themselves.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions xfrm_state_check and xfrm_state_check_space are only used by
the output code in xfrm_output.c so we can move them over.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of the code in xfrm4_output_one and xfrm6_output_one are identical so
this patch moves them into a common xfrm_output function which will live
in net/xfrm.
In fact this would seem to fix a bug as on IPv4 we never reset the network
header after a transform which may upset netfilter later on.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>