Add the elastic array of void * pointer to the struct net.
The access rules are simple:
1. register the ops with register_pernet_gen_device to get
the id of your private pointer
2. call net_assign_generic() to put the private data on the
struct net (most preferably this should be done in the
->init callback of the ops registered)
3. do not store any private reference on the net_generic array;
4. do not change this pointer while the net is alive;
5. use the net_generic() to get the pointer.
When adding a new pointer, I copy the old array, replace it
with a new one and schedule the old for kfree after an RCU
grace period.
Since the net_generic explores the net->gen array inside rcu
read section and once set the net->gen->ptr[x] pointer never
changes, this grants us a safe access to generic pointers.
Quoting Paul: "... RCU is protecting -only- the net_generic
structure that net_generic() is traversing, and the [pointer]
returned by net_generic() is protected by a reference counter
in the upper-level struct net."
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make some per-net generic pointers, we need some way to address
them, i.e. - IDs. This is simple IDA-based IDs generator for pernet
subsystems.
Addressing questions about potential checkpoint/restart problems:
these IDs are "lite-offsets" within the net structure and are by no
means supposed to be exported to the userspace.
Since it will be used in the nearest future by devices only (tun,
vlan, tunnels, bridge, etc), I make it resemble the functionality
of register_pernet_device().
The new ids is stored in the *id pointer _before_ calling the init
callback to make this id available in this callback.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This expresses __skb_append in terms of __skb_queue_after, exploiting that
__skb_append(old, new, list) = __skb_queue_after(list, old, new).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel-doc comment for skb_segment is clearly wrong. This states
what it actually does.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SKF_ADF_NLATTR searches for a netlink attribute, which avoids manually
parsing and walking attributes. It takes the offset at which to start
searching in the 'A' register and the attribute type in the 'X' register
and returns the offset in the 'A' register. When the attribute is not
found it returns zero.
A top-level attribute can be located using a filter like this
(example for nfnetlink, using struct nfgenmsg):
...
{
/* A = offset of first attribute */
.code = BPF_LD | BPF_IMM,
.k = sizeof(struct nlmsghdr) + sizeof(struct nfgenmsg)
},
{
/* X = CTA_PROTOINFO */
.code = BPF_LDX | BPF_IMM,
.k = CTA_PROTOINFO,
},
{
/* A = netlink attribute offset */
.code = BPF_LD | BPF_B | BPF_ABS,
.k = SKF_AD_OFF + SKF_AD_NLATTR
},
{
/* Exit if not found */
.code = BPF_JMP | BPF_JEQ | BPF_K,
.k = 0,
.jt = <error>
},
...
A nested attribute below the CTA_PROTOINFO attribute would then
be parsed like this:
...
{
/* A += sizeof(struct nlattr) */
.code = BPF_ALU | BPF_ADD | BPF_K,
.k = sizeof(struct nlattr),
},
{
/* X = CTA_PROTOINFO_TCP */
.code = BPF_LDX | BPF_IMM,
.k = CTA_PROTOINFO_TCP,
},
{
/* A = netlink attribute offset */
.code = BPF_LD | BPF_B | BPF_ABS,
.k = SKF_AD_OFF + SKF_AD_NLATTR
},
...
The data of an attribute can be loaded into 'A' like this:
...
{
/* X = A (attribute offset) */
.code = BPF_MISC | BPF_TAX,
},
{
/* A = skb->data[X + k] */
.code = BPF_LD | BPF_B | BPF_IND,
.k = sizeof(struct nlattr),
},
...
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk_filter function is too big to be inlined. This saves 2296 bytes
of text on allyesconfig.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some minor style cleanups:
* Move __KERNEL__ definitions to one place in filter.h
* Use const for sk_filter_len
* Line wrapping
* Put EXPORT_SYMBOL next to function definition
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Such an accounting would cost us two more dereferences to get the
percpu variable from the struct net, so I make sock_prot_inuse_get
and _add calls work differently depending on CONFIG_NET_NS - without
it old optimized routines are used.
The per-cpu counter for init_net is prepared in core_initcall, so
that even af_inet, that starts as fs_initcall, will already have the
init_net prepared.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This counter is about to become per-proto-and-per-net, so we'll need
two arguments to determine which cell in this "table" to work with.
All the places, but proc already pass proper net to it - proc will be
tuned a bit later.
Some indentation with spaces in proc files is done to keep the file
coding style consistent.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's already some stuff on the struct net, that should better
be folded into netns_core structure. I'm making the per-proto inuse
counter be per-net also, which is also a candidate for this, so
introduce this structure and populate it a bit.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Constructive part of the set is finished here. We have to remove the
pcounter, so start with its init and free functions.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And redirect sock_prot_inuse_add and _get to use one.
As far as the dereferences are concerned. Before the patch we made
1 dereference to proto->inuse.add call, the call itself and then
called the __get_cpu_var() on a static variable. After the patch we
make a direct call, then one dereference to proto->inuse_idx and
then the same __get_cpu_var() on a still static variable. So this
patch doesn't seem to produce performance penalty on SMP.
This is not per-net yet, but I will deliberately make NET_NS=y case
separated from NET_NS=n one, since it'll cost us one-or-two more
dereferences to get the struct net and the inuse counter.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The inuse counters are going to become a per-cpu array. Introduce an
index for this array on the struct proto.
To handle the case of proto register-unregister-register loop the
bitmap is used. All its bits manipulations are protected with
proto_list_lock and a sanity check for the bitmap being exhausted is
also added.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extract hash function for pneigh entries from pneigh_lookup(),
__pneigh_lookup() and pneigh_delete() as pneigh_hash().
Extract core of pneigh_lookup() and __pneigh_lookup() as
__pneigh_lookup_1().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
SIOCADDMULTI/SIOCDELMULTI check whether the driver has a set_multicast_list
method to determine whether it supports multicast. Drivers implementing
secondary unicast support use set_rx_mode however.
Check for both dev->set_multicast_mode and dev->set_rx_mode to determine
multicast capabilities.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce an inline net_eq() to compare two namespaces.
Without CONFIG_NET_NS, since no namespace other than &init_net
exists, it is always 1.
We do not need to convert 1) inline vs inline and
2) inline vs &init_net comparisons.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Introduce neigh_parms/pneigh_entry inlines: neigh_parms_net(), pneigh_net().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Without CONFIG_NET_NS, no namespace other than &init_net exists,
no need to store net in seq_net_private.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
This file displays the registered packet types, but some of them
(packet sockets creates such) can be bound to a net device and showing
them in a wrong namespace is not correct.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Proxy neighbors do not have any reference counting, so any caller
of pneigh_lookup (unless it's a netlink triggered add/del routine)
should _not_ perform any actions on the found proxy entry.
There's one exception from this rule - the ipv6's ndisc_recv_ns()
uses found entry to check the flags for NTF_ROUTER.
This creates a race between the ndisc and pneigh_delete - after
the pneigh is returned to the caller, the nd_tbl.lock is dropped
and the deleting procedure may proceed.
One of the fixes would be to add a reference counting, but this
problem exists for ndisc only. Besides such a patch would be too
big for -rc4.
So I propose to introduce a __pneigh_lookup() which is supposed
to be called with the lock held and use it in ndisc code to check
the flags on alive pneigh entry.
Changes from v2:
As David noticed, Exported the __pneigh_lookup() to ipv6 module.
The checkpatch generates a warning on it, since the EXPORT_SYMBOL
does not follow the symbol itself, but in this file all the
exports come at the end, so I decided no to break this harmony.
Changes from v1:
Fixed comments from YOSHIFUJI - indentation of prototype in header
and the pndisc_check_router() name - and a compilation fix, pointed
by Daniel - the is_routed was (falsely) considered as uninitialized
by gcc.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update: My mailer ate one of Jarek's feedback mails... Fixed the
parameter in netif_set_gso_max_size() to be u32, not u16. Fixed the
whitespace issue due to a patch import botch. Changed the types from
u32 to unsigned int to be more consistent with other variables in the
area. Also brought the patch up to the latest net-2.6.26 tree.
Update: Made gso_max_size container 32 bits, not 16. Moved the
location of gso_max_size within netdev to be less hotpath. Made more
consistent names between the sock and netdev layers, and added a
define for the max GSO size.
Update: Respun for net-2.6.26 tree.
Update: changed max_gso_frame_size and sk_gso_max_size from signed to
unsigned - thanks Stephen!
This patch adds the ability for device drivers to control the size of
the TSO frames being sent to them, per TCP connection. By setting the
netdevice's gso_max_size value, the socket layer will set the GSO
frame size based on that value. This will propogate into the TCP
layer, and send TSO's of that size to the hardware.
This can be desirable to help tune the bursty nature of TSO on a
per-adapter basis, where one may have 1 GbE and 10 GbE devices
coexisting in a system, one running multiqueue and the other not, etc.
This can also be desirable for devices that cannot support full 64 KB
TSO's, but still want to benefit from some level of segmentation
offloading.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
zap_completion_queue() retrieves skbs from completion_queue where they have
zero skb->users counter. Before dev_kfree_skb_any() it should be non-zero
yet, so it's increased now.
Reported-and-tested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
'sync' wakeups are a hint towards the scheduler that (certain)
networking related wakeups likely create coupling between tasks.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a report by Andrew Morton and code analysis done
by Jarek Poplawski.
This reverts 33f807ba0d ("[NETPOLL]:
Kill NETPOLL_RX_DROP, set but never tested.") and
c7b6ea24b4 ("[NETPOLL]: Don't need
rx_flags.").
The rx_flags did get tested for zero vs. non-zero and therefore we do
need those tests and that code which sets NETPOLL_RX_DROP et al.
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some place, that calculate the ARP header length. These
calculations are correct, but
a) some operate with "magic" constants,
b) enlarge the code length (sometimes at the cost of coding style),
c) are not informative from the first glance.
The proposal is to introduce a helper, that includes all the good
sides of these calculations.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_update sends skb from neigh->arp_queue while neigh_timer_handler
has increased skbs refcount and calls solicit with the
skb. neigh_timer_handler should not increase skbs refcount but make a
copy of the skb and do solicit with the copy.
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The seq_file_operations' dev_mc_seq_xxx callbacks do the same thing as
the dev_seq_xxx ones do, but skip the SEQ_START_TOKEN.
So use the existing exported dev_seq_xxx calls and handle the
SEQ_START_TOKEN in the dev_mc_seq_show().
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This staff will be needed for non-netlink kernel sockets, which should
also not pin a namespace like tcp_socket and icmp_socket.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Device inside the namespace can be started and downed. So, active routing
cache should be cleaned up on device stop.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>