These pointers are RCU protected, so proper primitives should be used.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup())
without rtnl_lock(), adding and deleting from a qdisc list needs
additional locking. This patch adds global spinlock qdisc_list_lock
and wrapper functions for modifying the list. It is considered as a
temporary solution until hfsc_dequeue(), netem_dequeue() and
tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone.
With feedback from Herbert Xu and David S. Miller.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon reports by Denys Fedoryshchenko, and feedback
and help from Jarek Poplawski and Herbert Xu.
We always either:
1) Never made an external reference to this qdisc.
or
2) Did a dev_deactivate() which purged all asynchronous
references.
So do not lock the qdisc when we call qdisc_destroy(),
it's illegal anyways as when we drop the lock this is
free'd memory.
Signed-off-by: David S. Miller <davem@davemloft.net>
We can now kill them synchronously with all of the
previous dev_deactivate() cures.
This makes netdev destruction and shutdown saner as
the qdiscs hold references to the device.
Signed-off-by: David S. Miller <davem@davemloft.net>
The condition under which the previous qdisc has no more references
after we've attached &noop_qdisc is that both RUNNING and SCHED
are both seen clear while holding the root lock.
So just make specifically that check in the polling loop, instead
of this overly complex "check without then check with lock held"
sequence.
Signed-off-by: David S. Miller <davem@davemloft.net>
This new state lets dev_deactivate() mark a qdisc as having been
deactivated.
dev_queue_xmit() and ing_filter() check for this bit and do not
try to process the qdisc if the bit is set.
dev_deactivate() polls the qdisc after setting the bit, waiting
for both __QDISC_STATE_RUNNING and __QDISC_STATE_SCHED to clear.
This isn't perfect yet, but subsequent changesets will make it so.
This part is just one piece of the puzzle.
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon discussions with Jarek P. and Herbert Xu.
First, we're testing the wrong qdisc. We just reset the device
queue qdiscs to &noop_qdisc and checking it's state is completely
pointless here.
We want to wait until the previous qdisc that was sitting at
the ->qdisc pointer is not busy any more. And that would be
->qdisc_sleeping.
Because of how we propagate the samples qdisc pointer down into
qdisc_run and friends via per-cpu ->output_queue and netif_schedule,
we have to wait also for the __QDISC_STATE_SCHED bit to clear as
well.
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a bug report by Jeff Kirsher.
Don't use qdisc_root_lock() in these cases as the root
qdisc could have been changed, and we'd thus lock the
wrong object.
Tested by Emil S Tantilov who confirms that this seems
to fix the problem.
Signed-off-by: David S. Miller <davem@davemloft.net>
When support for multiple TX queues were added, the
netif_tx_lock() routines we converted to iterate over
all TX queues and grab each queue's spinlock.
This causes heartburn for lockdep and it's not a healthy
thing to do with lots of TX queues anyways.
So modify this to use a top-level lock and a "frozen"
state for the individual TX queues.
Signed-off-by: David S. Miller <davem@davemloft.net>
Bug report from Steven Jan Springl:
Issuing the following command causes a kernel oops:
tc qdisc add dev eth0 handle ffff: ingress
The problem mostly stems from all of the special case handling of
ingress qdiscs.
So, to fix this, do the grafting operation the same way we do for TX
qdiscs. Which means that dev_activate() and dev_deactivate() now do
the "qdisc_sleeping <--> qdisc" transitions on dev->rx_queue too.
Future simplifications are possible now, mainly because it is
impossible for dev_queue->{qdisc,qdisc_sleeping} to be NULL. There
are NULL checks all over to handle the ingress qdisc special case
that used to exist before this commit.
Signed-off-by: David S. Miller <davem@davemloft.net>
Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.
I could make at least one BUILD_BUG_ON conversion.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by Dave:
This patch adds a function to get the driver name from a struct net_device,
and consequently uses this in the watchdog timeout handler to print as
part of the message.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit a0c80b80e0.
After discussions with Jamal and Herbert on netdev, we should
provide at least minimal prioritization at the qdisc level
even in multiqueue situations.
Signed-off-by: David S. Miller <davem@davemloft.net>
Removed unused variable 'skb' in the dev_deactivate_queue function
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The stab bits can't be referenced uniless the full
packet scheduler layer is enabled.
Reported by Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add size table functions for qdiscs and calculate packet size in
qdisc_enqueue().
Based on patch by Patrick McHardy
http://marc.info/?l=linux-netdev&m=115201979221729&w=2
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like noop_qdisc, it needs a dummy backpointer and
explicit qdisc->q.lock initialization.
Based upon a report by Stephen Hemminger.
Signed-off-by: David S. Miller <davem@davemloft.net>
Idea is from Patrick McHardy.
Instead of managing the list of qdiscs on the device level, manage it
in the root qdisc of a netdev_queue. This solves all kinds of
visibility issues during qdisc destruction.
The way to iterate over all qdiscs of a netdev_queue is to visit
the netdev_queue->qdisc, and then traverse it's list.
The only special case is to ignore builting qdiscs at the root when
dumping or doing a qdisc_lookup(). That was not needed previously
because builtin qdiscs were not added to the device's qdisc_list.
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of 'pfifo_fast' we have just plain 'fifo_fast'.
No priority queues, just a straight FIFO.
This is necessary in order to legally have a seperate
qdisc per queue in multi-TX-queue setups, and thus get
full parallelization.
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows less strict control of access to the qdisc attached to a
netdev_queue. It is even allowed to enqueue into a qdisc which is
in the process of being destroyed. The RCU handler will toss out
those packets.
We will need this to handle sharing of a qdisc amongst multiple
TX queues. In such a setup the lock has to be shared, so will
be inside of the qdisc itself. At which point the netdev_queue
lock cannot be used to hard synchronize access to the ->qdisc
pointer.
One operation we have to keep inside of qdisc_destroy() is the list
deletion. It is the only piece of state visible after the RCU quiesce
period, so we have to undo it early and under the appropriate locking.
The operations in the RCU handler do not need any looking because the
qdisc tree is no longer visible to anything at that point.
Signed-off-by: David S. Miller <davem@davemloft.net>
We are registering the device, there is no way anyone can get
at this object's qdiscs yet in any meaningful way.
Signed-off-by: David S. Miller <davem@davemloft.net>
When we have shared qdiscs, packets come out of the qdiscs
for multiple transmit queues.
Therefore it doesn't make any sense to schedule the transmit
queue when logically we cannot know ahead of time the TX
queue of the SKB that the qdisc->dequeue() will give us.
Just for sanity I added a BUG check to make sure we never
get into a state where the noop_qdisc is scheduled.
Signed-off-by: David S. Miller <davem@davemloft.net>
When code wants to lock the qdisc tree state, the logic
operation it's doing is locking the top-level qdisc that
sits of the root of the netdev_queue.
Add qdisc_root_lock() to represent this and convert the
easiest cases.
In order for this to work out in all cases, we have to
hook up the noop_qdisc to a dummy netdev_queue.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it is associated with a netdev_queue, but when we have
qdisc sharing that no longer makes any sense.
Signed-off-by: David S. Miller <davem@davemloft.net>
We liberate any dangling gso_skb during qdisc destruction.
It really only matters for the root qdisc. But when qdiscs
can be shared by multiple netdev_queue objects, we can't
have the gso_skb in the netdev_queue any more.
Signed-off-by: David S. Miller <davem@davemloft.net>
This effectively "flips the switch" by making the core networking
and multiqueue-aware drivers use the new TX multiqueue structures.
Non-multiqueue drivers need no changes. The interfaces they use such
as netif_stop_queue() degenerate into an operation on TX queue zero.
So everything "just works" for them.
Code that really wants to do "X" to all TX queues now invokes a
routine that does so, such as netif_tx_wake_all_queues(),
netif_tx_stop_all_queues(), etc.
pktgen and netpoll required a little bit more surgery than the others.
In particular the pktgen changes, whilst functional, could be largely
improved. The initial check in pktgen_xmit() will sometimes check the
wrong queue, which is mostly harmless. The thing to do is probably to
invoke fill_packet() earlier.
The bulk of the netpoll changes is to make the code operate solely on
the TX queue indicated by by the SKB queue mapping.
Setting of the SKB queue mapping is entirely confined inside of
net/core/dev.c:dev_pick_tx(). If we end up needing any kind of
special semantics (drops, for example) it will be implemented here.
Finally, we now have a "real_num_tx_queues" which is where the driver
indicates how many TX queues are actually active.
With IGB changes from Jeff Kirsher.
Signed-off-by: David S. Miller <davem@davemloft.net>
alloc_netdev_mq() now allocates an array of netdev_queue
structures for TX, based upon the queue_count argument.
Furthermore, all accesses to the TX queues are now vectored
through the netdev_get_tx_queue() and netdev_for_each_tx_queue()
interfaces. This makes it easy to grep the tree for all
things that want to get to a TX queue of a net device.
Problem spots which are not really multiqueue aware yet, and
only work with one queue, can easily be spotted by grepping
for all netdev_get_tx_queue() calls that pass in a zero index.
Signed-off-by: David S. Miller <davem@davemloft.net>
Accesses are mostly structured such that when there are multiple TX
queues the code transformations will be a little bit simpler.
Signed-off-by: David S. Miller <davem@davemloft.net>
Only plain netif_schedule() remains taking a net_device, mostly as a
compatability item while we transition the rest of these interfaces.
Everything else calls netif_schedule_queue() or __netif_schedule(),
both of which take a netdev_queue pointer.
Signed-off-by: David S. Miller <davem@davemloft.net>
Every qdisc is assosciated with a queue, and in the case of ingress
qdiscs that will now be netdev->rx_queue so using that queue's lock is
the thing to do.
Signed-off-by: David S. Miller <davem@davemloft.net>
The lock is now an attribute of the device queue.
One thing to notice is that "suspicious" places
emerge which will need specific training about
multiple queue handling. They are so marked with
explicit "netdev->rx_queue" and "netdev->tx_queue"
references.
Signed-off-by: David S. Miller <davem@davemloft.net>
It can be obtained via the netdev_queue. So create a helper routine,
qdisc_dev(), to make the transformations nicer looking.
Now, qdisc_alloc() now no longer needs a net_device pointer argument.
Signed-off-by: David S. Miller <davem@davemloft.net>
A netdev_queue is an entity managed by a qdisc.
Currently there is one RX and one TX queue, and a netdev_queue merely
contains a backpointer to the net_device.
The Qdisc struct is augmented with a netdev_queue pointer as well.
Eventually the 'dev' Qdisc member will go away and we will have the
resulting hierarchy:
net_device --> netdev_queue --> Qdisc
Also, qdisc_alloc() and qdisc_create_dflt() now take a netdev_queue
pointer argument.
Signed-off-by: David S. Miller <davem@davemloft.net>
Note, in the following patch, 'err' is initialized as:
int err = -ENOBUFS;
Signed-off-by: WANG Cong <wcong@critical-links.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WARN_ON_ONCE() gives a stack trace including the full module list.
Having this in the kernel dump for the timeout case in the
generic netdev watchdog will help us see quicker which driver
is involved. It also allows us to collect statistics
and patterns in terms of which drivers have this event occuring.
Suggested by Andrew Morton
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The qdisc_run loop is currently unbounded and runs entirely in a
softirq. This is bad as it may create an unbounded softirq run.
This patch fixes this by calling need_resched and breaking out if
necessary.
It also adds a break out if the jiffies value changes since that would
indicate we've been transmitting for too long which starves other
softirqs.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert packet schedulers to use the netlink API. Unfortunately a gradual
conversion is not possible without breaking compilation in the middle or
adding lots of casts, so this patch converts them all in one step. The
patch has been mostly generated automatically with some minor edits to
at least allow seperate conversion of classifiers and actions.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __acquires() and __releases() annotations to suppress some sparse
warnings.
example of warnings :
net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
count at exit
net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
unexpected unlock
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc_class_ops are const, and Qdisc_ops are mostly read.
Using "const" and "__read_mostly" qualifiers helps to reduce false
sharing.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>