As done two years ago on IP route cache table (commit
22c047ccbc) , we can avoid using one
lock per hash bucket for the huge TCP/DCCP hash tables.
On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
litle performance differences. (we hit a different cache line for the
rwlock, but then the bucket cache line have a better sharing factor
among cpus, since we dirty it less often). For netstat or ss commands
that want a full scan of hash table, we perform fewer memory accesses.
Using a 'small' table of hashed rwlocks should be more than enough to
provide correct SMP concurrency between different buckets, without
using too much memory. Sizing of this table depends on
num_possible_cpus() and various CONFIG settings.
This patch provides some locking abstraction that may ease a future
work using a different model for TCP/DCCP table.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Assigning initial values of `0' is redundant when loading a new CCID structure,
since in net/dccp/ccid.c the entire CCID structure is zeroed out prior to
initialisation in ccid_new():
struct ccid {
struct ccid_operations *ccid_ops;
char ccid_priv[0];
};
// ...
if (rx) {
memset(ccid + 1, 0, ccid_ops->ccid_hc_rx_obj_size);
if (ccid->ccid_ops->ccid_hc_rx_init != NULL &&
ccid->ccid_ops->ccid_hc_rx_init(ccid, sk) != 0)
goto out_free_ccid;
} else {
memset(ccid + 1, 0, ccid_ops->ccid_hc_tx_obj_size);
/* analogous to the rx case */
}
This patch therefore removes the redundant assignments. Thanks to Arnaldo for
the inspiration.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This fixes `unaligned (read) access' errors of the type
Kernel unaligned access at TPC[100f970c] dccp_parse_options+0x4f4/0x7e0 [dccp]
Kernel unaligned access at TPC[1011f2e4] ccid3_hc_tx_parse_options+0x1ac/0x380 [dccp_ccid3]
Kernel unaligned access at TPC[100f9898] dccp_parse_options+0x680/0x880 [dccp]
by using the get_unaligned macro for parsing options.
Commiter note: Preserved the sparse __be{16,32} annotations.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This adds support for converting the 11 currently defined Reset codes into system
error numbers, which are stored in sk_err for further interpretation.
This makes the externally visible API behaviour similar to TCP, since a client
connecting to a non-existing port will experience ECONNREFUSED.
* Code 0, Unspecified, is interpreted as non-error (0);
* Code 1, Closed (normal termination), also maps into 0;
* Code 2, Aborted, maps into "Connection reset by peer" (ECONNRESET);
* Code 3, No Connection and
Code 7, Connection Refused, map into "Connection refused" (ECONNREFUSED);
* Code 4, Packet Error, maps into "No message of desired type" (ENOMSG);
* Code 5, Option Error, maps into "Illegal byte sequence" (EILSEQ);
* Code 6, Mandatory Error, maps into "Operation not supported on transport endpoint" (EOPNOTSUPP);
* Code 8, Bad Service Code, maps into "Invalid request code" (EBADRQC);
* Code 9, Too Busy, maps into "Too many users" (EUSERS);
* Code 10, Bad Init Cookie, maps into "Invalid request descriptor" (EBADR);
* Code 11, Aggression Penalty, maps into "Quota exceeded" (EDQUOT)
which makes sense in terms of using more than the `fair share' of bandwidth.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This fixes the following problem: client connects to peer which has no DCCP
enabled or loaded; ICMP error messages ("Protocol Unavailable") can be seen
on the wire, but the application hangs. Reason: ICMP packets don't get through
to dccp_v4_err.
When reporting errors, a sequence number check is made for the DCCP packet
that had caused an ICMP error to arrive.
Such checks can not be made if the socket is in state LISTEN, RESPOND (which
in the implementation is the same as LISTEN), or REQUEST, since update_gsr()
has not been called in these states, hence the sequence window is 0..0.
This patch fixes the problem by adding the REQUEST state as another exemption
to the window check. The error reporting now works as expected on connecting.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This fixes a problem when analysing erroneous packets in dccp_v{4,6}_err:
* dccp_hdr_seq currently takes an skb
* however, the transport headers in the skb are shifted, due to the
preceding IPv4/v6 header.
Fixed for v4 and v6 by changing dccp_hdr_seq to take a struct dccp_hdr as
argument. Verified that the correct sequence number is now reported in the
error handler.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Just like UDP.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Leandro Melo de Sales <leandroal@gmail.com>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have this new macro, use it where possible.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
By adding module aliases to inet_diag, tcp_diag and dccp_diag, we let
them load automatically as needed. This makes tools like "ss" run
faster.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not define the sysctl_dccp_sync_ratelimit sysctl variable in the
CONFIG_SYSCTL dependent sysctl.c module - move it to input.c instead.
This fixes the following build bug:
net/built-in.o: In function `dccp_check_seqno':
input.c:(.text+0xbd859): undefined reference to `sysctl_dccp_sync_ratelimit'
distcc[29953] ERROR: compile (null) on localhost failed
make: *** [vmlinux] Error 1
Found via 'make randconfig' build testing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With all the users of the double pointers removed from the IPv6 input path,
this patch converts all occurances of sk_buff ** to sk_buff * in IPv6 input
handlers.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes two bugs in processing of connection-Requests in
v{4,6}_conn_request:
1. Due to using the variable `reset_code', the Reset code generated
internally by dccp_parse_options() is overwritten with the
initialised value ("Too Busy") of reset_code, which is not what is
intended.
2. When receiving a connection-Request on a multicast or broadcast
address, no Reset should be generated, to avoid storms of such
packets. Instead of jumping to the `drop' label, the
v{4,6}_conn_request functions now return 0. Below is why in my
understanding this is correct:
When the conn_request function returns < 0, then the caller,
dccp_rcv_state_process(), returns 1. In all instances where
dccp_rcv_state_process is called (dccp_v4_do_rcv, dccp_v6_do_rcv,
and dccp_child_process), a return value of != 0 from
dccp_rcv_state_process() means that a Reset is generated.
If on the other hand the conn_request function returns 0, the
packet is discarded and no Reset is generated.
Note: There may be a related problem when sending the Response, due to
the following.
if (dccp_v6_send_response(sk, req, NULL))
goto drop_and_free;
/* ... */
drop_and_free:
return -1;
In this case, if send_response fails due to transmission errors, the
next thing that is generated is a Reset with a code "Too Busy". I
haven't been able to conjure up such a condition, but it might be good
to change the behaviour here also (not done by this patch).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The elapsed time uses u32, but printk was using %d, not %u.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This
* removes a declaration of a non-existent function
__dccp_minisock_init;
* shifts the initialisation function dccp_minisock_init() from
options.c to minisocks.c, where it is more naturally expected to
be.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces several uses of standard arithmetic with the DCCP
sequence number arithmetic functions. The problem here is that the
sequence number wrap-around was not taken into consideration.
* Condition "seqp->ccid2s_seq <= prev->ccid2s_seq" has been replaced
by
dccp_delta_seqno(seqp->ccid2s_seq, prev->ccid2s_seq) >= 0
since if seqp is `before' prev, then the delta_seqno() is positive.
* The test whether sequence numbers `a' and `b' are consecutive has
the form
dccp_delta_seqno(a, b) == 1
* Increment of ccid2hctx_rpseq could be done using dccp_inc_seqno(),
but since here the incremented ccid2hctx_rpseq == seqno, used
assignment instead.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb's passed to ccid2_hc_tx_send_packet() are headerless, the packet
type is decided later, in dccp_write_xmit(). Therefore the first test
of the switch/case block is always true, the others are never reached.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes a test for `val < 1' which would only have been triggered
when val < 0, due to a preceding test for 0. Fixed by using an
unsigned type for cwnd (as in TCP) instead.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes an ugly BUG_ON which has been pointed out by Arnaldo.
Instead of freezing up the machine, a `critical' message is now issued
to the system log.
There is potential of doing this more gracefully (eg. there are a few
internal variables which could be updated despite the lack of memory),
but that requires more complicated changes to the algorithm; thus a
`FIXME' has been added.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch simplifies the interface of ccid2_hc_tx_alloc_seq():
* ccid2_hc_tx_alloc_seq() is always called with an argument of
CCID2_SEQBUF_LEN;
* other code - ccid2_hc_tx_check_sanity() - even depends on the
assumption that ccid2_hc_tx_alloc_seq() has been called with this
particular size;
* passing the `gfp_t' argument to ccid2_hc_tx_alloc_seq() is
redundant with gfp_any().
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This just sets the parameter to bool, since debugging messages are
either on or off.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This enables applications to query the current value of the Maximum
Packet Size via a socket option, suggested as a SHOULD in (RFC 4340,
p. 102).
This socket option is useful to avoid the annoying bail-out via
`-EMSGSIZE'. In particular, as fragmentation is not currently
supported (and its use is partly discouraged in RFC 4340).
With this option, it is possible to size buffers accordingly, e.g.
int buflen = dccp_get_cur_mps(sockfd);
/* or */
if (msgsize > dccp_get_cur_mps(sockfd))
die("message is too large for this path");
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This performs a minor optimisation: when ccid_hc_tx_send_packet
returns a value greater zero, then the same call previously was done
again at the begin of the while loop in dccp_wait_for_ccid.
This patch exploits the available information and schedule-timeouts
directly instead.
Documentation also added.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since DCCP requires to close both ends of a connection simultaneously,
permission to write in state DCCP_CLOSING is removed in dccp_sendmsg():
* if the sending end closed, it would encounter a write error anyhow;
* if the other end has closed the connection, it accepts no more data.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This factors code common to dccp_v{4,6}_ctl_send_reset into a separate function,
and adds support for filling in the Data 1 ... Data 3 fields from RFC 4340, 5.6.
It is useful to have this separate, since the following Reset codes will always
be generated from the control socket rather than via dccp_send_reset:
* Code 3, "No Connection", cf. 8.3.1;
* Code 4, "Packet Error" (identification for Data 1 added);
* Code 5, "Option Error" (identification for Data 1..3 added, will be used later);
* Code 6, "Mandatory Error" (same as Option Error);
* Code 7, "Connection Refused" (what on Earth is the difference to "No Connection"?);
* Code 8, "Bad Service Code";
* Code 9, "Too Busy";
* Code 10, "Bad Init Cookie" (not used).
Code 0 is not recommended by the RFC, the following codes would be used in
dccp_send_reset() instead, since they all relate to an established DCCP connection:
* Code 1, "Closed";
* Code 2, "Aborted";
* Code 11, "Aggression Penalty" (12.3).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This replaces normal addition with mod-48 addition so that sequence number
wraparound is respected.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This implements a SHOULD from RFC 4340, 7.5.4:
"To protect against denial-of-service attacks, DCCP implementations SHOULD
impose a rate limit on DCCP-Syncs sent in response to sequence-invalid packets,
such as not more than eight DCCP-Syncs per second."
The rate-limit is maintained on a per-socket basis. This is a more stringent
policy than enforcing the rate-limit on a per-source-address basis and
protects against attacks with forged source addresses.
Moreover, the mechanism is deliberately kept simple. In contrast to
xrlim_allow(), bursts of Sync packets in reply to sequence-invalid packets
are not supported. This foils such attacks where the receipt of a Sync
triggers further sequence-invalid packets. (I have tested this mechanism against
xrlim_allow algorithm for Syncs, permitting bursts just increases the problems.)
In order to keep flexibility, the timeout parameter can be set via sysctl; and
the whole mechanism can even be disabled (which is however not recommended).
The algorithm in this patch has been improved with regard to wrapping issues
thanks to a suggestion by Arnaldo.
Commiter note: Rate limited the step 6 DCCP_WARN too, as it says we're
sending a sync.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
In this patch, duplicated code is removed for the case when a Reset packet is
sent from a connected socket. This code duplication is between dccp_make_reset
and dccp_transmit_skb, which already contained an (up to now entirely unused)
switch statement to fill in the reset code from the DCCP_SKB_CB.
The only thing that has been removed is the call to dst_clone(dst), since
the queue_xmit functions use sk_dst_cache anyway.
I wasn't sure which purpose inet_sk_rebuild_header served, so I left it in.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This adds fields to support the informational Data 1..3 fields of the
DCCP-Reset packets (RFC 4340, 5.6), and makes minor cosmetic changes
to documentation.
Code which fills in these fields follows in subsequent patches, it is
primarily used for reporting option-processing and feature-negotiation
errors.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This adds a FIXME to signal that the function dccp_send_delayed_ack is nowhere
used in the entire DCCP/CCID code.
Using a delayed Ack timer is suggested in 11.3 of RFC 4340, but it has also
rather subtle implications for the Ack-Ratio-accounting.
CCID2 does not use this (maybe it should).
I think leaving the function in is good, in case someone wants to implement
this.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This moves several instances of testing against NULL into the function which is
used to de-reference the CCID-private data.
Committer note: Made the BUG_ON depend on having CONFIG_IP_DCCP_CCID3_DEBUG, as it
is too much to have this on production code. Also made sure that
the macro is used only after checking if sk_state is not LISTEN,
to make it equivalent to what we had before.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This fixes the code to correspond to RFC 4340, 7.5.4, which states the
exception that a Sync received in state REQUEST generates a Reset (not
a SyncAck).
To achieve this, only a small change is required. Since
dccp_rcv_request_sent_state_process() already uses the correct Reset Code
number 4 ("Packet Error"), we only need to shift the if-statement a few
lines further down.
(To test this case: replace DCCP_PKT_RESPONSE with DCCP_PKT_SYNC
in dccp_make_response.)
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
The parameter `seq' of dccp_send_sync() is in fact an acknowledgement number
and not a sequence number - thus renamed by this patch into `ackno'.
Secondly, a `critical' warning is added when a Sync/SyncAck could not be sent.
Sanity: I have checked all other functions that are called in dccp_transmit_skb,
there are no clashes with the use of dccpd_ack_seq; no other function is
using this slot at the same time.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This updates sequence number checking with regard to RFC 4340, 7.5.4.
Missing in the code was an exception for sequence-invalid Reset packets,
which get a Sync acknowledging GSR, instead of (as usual) P.seqno.
This can lead to an oscillating ping-pong flood of Reset packets.
In fact, it has been observed on the wire as follows:
1. client establishes connection to server;
2. before server can write to client, client crashes without notifying
the server (NB: now no longer possible due to ABORT function);
3. server sends DCCP-Data packet (has no ackno);
4. client generates Reset "No Connection", seqno=0, increments seqno;
5. server replies with Sync, using ackno = P.seqno;
6. client generates Reset "No Connection" with seqno = ackno + 1;
7. goto (5).
The difference is that now in (5) the server uses GSR. This causes the
Reset sent by the client in (6) to become sequence-valid, so that in (7)
the vicious circle is broken; the Reset is then enqueued and causes the
socket to enter TIMEWAIT state.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is in part required by the next patch; it
* replaces 6 instances of `DCCP_SKB_CB(skb)->dccpd_seq' with `seqno';
* replaces 7 instances of `DCCP_SKB_CB(skb)->dccpd_ack_seq' with `ackno';
* replaces 1 use of dccp_inc_seqno() by unfolding `ADD48' macro in place.
No changes in algorithm, all changes are text replacement/substitution.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The third parameter of dccp_sample_rtt now becomes useless and is removed.
Also combined the subtraction of the timestamp echo and the elapsed time.
This is safe, since (a) presence of timestamp echo is tested first and (b)
elapsed time is either present and non-zero or it is not set and equals 0
due to the memset in dccp_parse_options.
To avoid measuring option-processing time, the timestamp for measuring the
initial Request/Response RTT sample is taken directly when the function is
called (the Linux implementation always adds a timestamp on the Request,
so there is no loss in doing this).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides a timesource, conveniently used for DCCP timestamps, which
returns the elapsed time in 10s of microseconds since initialisation.
This makes for a wrap-around time of about 11.9 hours, which should be
sufficient for most applications.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reduces the number of timestamps taken in the receive path
for each packet.
The ccid3_hc_tx_update_x() routine is called in
* the receive path for each CCID3-controlled packet
* for the nofeedback timer (if no feedback arrives during 4 RTT)
Currently, when there is no loss, each packet gets timestamped twice.
The patch resolves this by recycling the first timestamp taken on packet
reception for RTT sampling.
When the no_feedback_timer() is called, then the timestamp argument is
simply set to NULL - so that ccid3_hc_tx_update_x() takes care of the logic.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.
Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This trivial patch removes the unneeded pointer newdp, which is never used.
Signed-off-by: Micah Gruber <micah.gruber@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hopefully captured all single statement cases under net/. I'm
not too sure if there is some policy about #includes that are
"guaranteed" (ie., in the current tree) to be available through
some other #included header, so I just added linux/kernel.h to
each changed file that didn't #include it previously.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now to convert the ackvec code to ktime_t so that we can get rid of
dccp_timestamp and the epoch thing in dccp_sock.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code was too complicated, if p > 0 in ccid3_hc_tx_no_feedback_timer the
timestamp was being obtained to be passed to ccid3_hc_tx_update_x, where only
if p > 0 the timestamp was needed, so just leave it to ccid3_hc_tx_update_x to
obtain the timestamp if needed.
This will help in the upcoming changesets where we'll convert t_ld to ktime_t.
We'll eventually try to reuse ktime_get_real() calls again.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a memory leak in net/dccp/feat.c::dccp_feat_empty_confirm(). If we
hit the 'default:' case of the 'switch' statement, then we return without
freeing 'opt', thus leaking 'struct dccp_opt_pend' bytes.
The leak is fixed easily enough by adding a kfree(opt); before the return
statement.
The patch also changes the layout of the 'switch' to be more in line with
CodingStyle.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure that spin_unlock_wait() is properly ordered wrt atomic_inc().
(akpm: can't we convert this code to use rwlocks?)
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.
This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
AFAICT now that jprobe.entry is a void *, JPROBE_ENTRY doesn't do anything
useful - so remove it ..
I've left a do-nothing version so that out-of-tree jprobes code will still
compile without modifications.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on <draft-ietf-ipv6-deprecate-rh0-00.txt>.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ccid3_hc_tx_send_packet currently returns 0 when the time difference between
current time and t_nom is less than 1000 microseconds.
In this case the packet is sent immediately; but, unlike other packets that can
be emitted on first attempt, it will not have its window counter updated and
its options set as required. This is a bug.
Fix: Require the time difference to be at least 1000 microseconds. The
algorithm then converges: time differences > 1000 microseconds trigger the
timer in dccp_write_xmit; after timer expiry this function is tried again; when
the time difference is less than 1000, the packet will have its options added
and window counter updated as required.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This updates the computation of t_nom and t_last_win_count to use the newer
gettimeofday interface.
Committer note: used ktime_to_timeval to set the 'now' variable to t_ld in
ccid3hctx_no_feedback_timer
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
It had just a slab cache, so, for the sake of simplicity just make
dccp_trfc_lib module init routine create the slab cache, no need for users of
the lib to create a private loss_interval object.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Now ccid3_hc_rx_update_li is ready to be moved to
net/dccp/ccids/lib/loss_interval, it uses the same interface as the other
functions there.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This is a preparatory patch for moving these loss interval functions from
net/dccp/ccids/ccid3.c to net/dccp/ccids/lib/loss_interval.c.
Based on a patch by Ian McDonald.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
When compiling with EXTRA_CFLAGS=-W noticed that tstamp is not initialised
correctly in dccp_li_calc_first_li.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
When compiling with EXTRA_CFLAGS=-W notice that we have signed/unsigned issue
in dccp.h.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Recent gcc versions emit warnings when unsigned variables are
compared < 0 or >= 0.
Signed-off-by: Bill Nottingham <notting@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current IPSEC rule resolution behavior we have does not work for a
lot of people, even though technically it's an improvement from the
-EAGAIN buisness we had before.
Right now we'll block until the key manager resolves the route. That
works for simple cases, but many folks would rather packets get
silently dropped until the key manager resolves the IPSEC rules.
We can't tell these folks to "set the socket non-blocking" because
they don't have control over the non-block setting of things like the
sockets used to resolve DNS deep inside of the resolver libraries in
libc.
With that in mind I coded up the patch below with some help from
Herbert Xu which provides packet-drop behavior during larval state
resolution, controllable via sysctl and off by default.
This lays the framework to either:
1) Make this default at some point or...
2) Move this logic into xfrm{4,6}_policy.c and implement the
ARP-like resolution queue we've all been dreaming of.
The idea would be to queue packets to the policy, then
once the larval state is resolved by the key manager we
re-resolve the route and push the packets out. The
packets would timeout if the rule didn't get resolved
in a certain amount of time.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use menuconfigs instead of menus, so the whole menu can be disabled at
once instead of going through all options.
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
SPIN_LOCK_UNLOCKED cleanup,use __SPIN_LOCK_UNLOCKED instead
Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This prints the value of the parsed Elapsed Time when received via a
Timestamp Echo option [RFC 4342, 13.3].
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes an error in the calculation of t_ipi when X converges towards
very low sending rates (between 1 and 64 bytes per second).
Although this case may not sound likely, it can be reproduced by connecting,
hitting enter (1 byte sent) and waiting for some time, during which the
nofeedback timer halves the sending rate until finally it reaches the region
1..64 bytes/sec. Computing X is handled correctly (tested separately); but by
dividing X _before_ entering the calculation of t_ipi, X becomes zero as
a result. This in turn triggers a BUG condition caught in scaled_div().
Fixed by replacing with equivalent statement and explicit typecast for good
measure.
Calculation verified and effect of patch tested - reduced never below 1 byte
per 64 seconds afterwards, i.e. not allowing divide-by-zero.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch follows the following recommendation made in an erratum to RFC 4342:
"Senders MAY additionally make use of other available RTT measurements,
including those from the initial Request-Response packet exchange."
It implements larger initial windows with regard to this inital RTT measurement,
using the mechanism suggested in draft-ietf-dccp-rfc3448bis, section 4.2.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces the existing occurrences of RTT sampling with
the use of the new function dccp_sample_rtt.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recurring problem, in particular in the CCID code, is that RTT samples
from packets with timestamp echo and elapsed time options need to be taken.
This service is provided via a new function dccp_sample_rtt in this patch.
Furthermore, to protect against `insane' RTT samples, the sampled value
is bounded between 100 microseconds and 4 seconds - for which u32 is sufficient.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CCID 3 and TFRC specs (RFC 4342, RFC 3448, draft-3448bis) make frequent
reference to the computation of the RFC-3390 initial sending rate:
1. Initial sending rate when RTT is known (RFC 4342, p. 6)
2. Response to Idle/Application-Limited periods (RFC 4342, 5.1)
This warrants putting the code into its own function, for later code reuse.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This clears the following sparc64 build warnings:
1) warning: format "%ld" expects type "long int", but argument 3 has type "suseconds_t"
2) warning: format "%llu" expects type "long long unsigned int", but argument 3 has type "__u64"
Fixed by using typecast to unsigned. This is argued to be safe, since the quantities, after
de-scaling (factor 2^6) fit all in u32.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a few more fields of interest to /proc/net/dccpprobe, the following output ensues:
1 2 3 4 5 6 7 8 9 10 11
sec.usec src:sport dst:dport size s rtt p X_calc X_recv X t_ipi
Also made the formatting consistent.
Scripts that go with this can be downloaded from http://139.133.210.30/users/gerrit/dccp/dccp_probe/
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds more detail in the wait_for_ccid packet scheduling loop.
In particular, it informs about (i) when delay is used and (ii) why
a packet is discarded.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently debugging output (when configured) is automatically enabled when
DCCP modules are compiled into the kernel rather than built as loadable modules.
This is not necessary, since the module parameters in this case become kernel
commandline parameters, e.g. DCCP or CCID3 debug output can be enabled for a
static build by appending the following at the boot prompt:
dccp.dccp_debug=1 dccp_ccid3.ccid3_debug=1
This patch therefore does away with the more complicated way of always enabling
debug output for static builds
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This:
1. removes a race condition in the access to the scheduled send time t_nom which
results from allowing asynchronous r/w access to t_nom without locks;
2. updates the inter-packet interval t_ipi = s/X when `s' changes, following a
suggestion by Ian McDonald.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a few debugging statements to ccid3.c
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a bug which uses an invalid comparison.
The bug resulted in the use of invalid loss intervals.
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This improves the slow-start phase by using the MSS
(as suggested in RFC 4342, sec. 5) instead of the packet size s.
Also figured out that __u32 is ample resource enough.
After applying, I got the following in the logs:
ccid3_hc_tx_packet_recv: client(f7421700), s=6, MSS=1424, w_init=4380, R_sample=176us, X=24886363
Had the previous variant been used, w_init would have been as low as 24.
Committer note: removed unneeded cast to unsigned long long that was
causing a compiler warning on 64bit architectures.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No code change at all.
This splits ccid3.c into a RX and a TX section, so that the file has an
organisation similar to the other ones (e.g. packet_history.{h,c}).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since CCID3 avoids sending 0-byte data packets (cf. ccid3_hc_tx_send_packet),
testing for zero-payload length, as performed by ccid3_hc_tx_update_s, is
redundant - hence removed by this patch.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>