- 22 Mar, 2012 19 commits
-
-
Alex Elder authored
Make a small change in the code that counts down kvecs consumed by a ceph_tcp_sendmsg() call. Same functionality, just blocked out a little differently. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Move blocks of code out of loops in read_partial_message_section() and read_partial_message(). They were only was getting called at the end of the last iteration of the loop anyway. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Calculate CRC in a separate step from rearranging the byte order of the result, to improve clarity and readability. Use offsetof() to determine the number of bytes to include in the CRC calculation. In read_partial_message(), switch which value gets byte-swapped, since the just-computed CRC is already likely to be in a register. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Change the name (and type) of a few CRC-related Boolean local variables so they contain the word "do", to distingish their purpose from variables used for holding an actual CRC value. Note that in the process of doing this I identified a fairly serious logic error in write_partial_msg_pages(): the value of "do_crc" assigned appears to be the opposite of what it should be. No attempt to fix this is made here; this change preserves the erroneous behavior. The problem I found is documented here: http://tracker.newdream.net/issues/2064 Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Many ceph-related Boolean options offer the ability to both enable and disable a feature. For all those that don't offer this, add a new option so that they do. Note that ceph_show_options()--which reports mount options currently in effect--only reports the option if it is different from the default value. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
This gathers a number of very minor changes: - use %hu when formatting the a socket address's address family - null out the ceph_msgr_wq pointer after the queue has been destroyed - drop a needless cast in ceph_write_space() - add a WARN() call in ceph_state_change() in the event an unrecognized socket state is encountered - rearrange the logic in ceph_con_get() and ceph_con_put() so that: - the reference counts are only atomically read once - the values displayed via dout() calls are known to be meaningful at the time they are formatted Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
There is no real need for ceph_tcp_connect() to return the socket pointer it creates, since it already assigns it to con->sock, which is visible to the caller. Instead, have it return an error code, which tidies things up a bit. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Define a helper function to perform various cleanup operations. Use it both in the exit routine and in the init routine in the event of an error. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
The messenger workqueue has no need to be public. So give it static scope. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Encapsulate the operation of adding a new chunk of data to the next open slot in a ceph_connection's out_kvec array. Also add a "reset" operation to make subsequent add operations start at the beginning of the array again. Use these routines throughout, avoiding duplicate code and ensuring all calls are handled consistently. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
One of the arguments to prepare_write_connect() indicates whether it is being called immediately after a call to prepare_write_banner(). Move the prepare_write_banner() call inside prepare_write_connect(), and reinterpret (and rename) the "after_banner" argument so it indicates that prepare_write_connect() should *make* the call rather than should know it has already been made. This was split out from the next patch to highlight this change in logic. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
ceph_parse_options() takes the address of a pointer as an argument and uses it to return the address of an allocated structure if successful. With this interface is not evident at call sites that the pointer is always initialized. Change the interface to return the address instead (or a pointer-coded error code) to make the validity of the returned pointer obvious. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
This fixes some spots where a type cast to (void *) was used as as a universal type hiding mechanism. Instead, properly cast the type to the intended target type. Signed-off-by:
Alex Elder <elder@newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
This eliminates type casts in some places where they are not required. Signed-off-by:
Alex Elder <elder@newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
A spinlock is used to protect a value used for selecting an array index for a string used for formatting a socket address for human consumption. The index is reset to 0 if it ever reaches the maximum index value. Instead, use an ever-increasing atomic variable as a sequence number, and compute the array index by masking off all but the sequence number's lowest bits. Make the number of entries in the array a power of two to allow the use of such a mask (to avoid jumps in the index value when the sequence number wraps). The length of these strings is somewhat arbitrarily set at 60 bytes. The worst-case length of a string produced is 54 bytes, for an IPv6 address that can't be shortened, e.g.: [1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767 Change it so we arbitrarily use 64 bytes instead; if nothing else it will make the array of these line up better in hex dumps. Rename a few things to reinforce the distinction between the number of strings in the array and the length of individual strings. Signed-off-by:
Alex Elder <elder@newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Rearrange ceph_tcp_connect() a bit, making use of "else" rather than re-testing a value with consecutive "if" statements. Don't record a connection's socket pointer unless the connect operation is successful. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Each messenger allocates a page to be used when writing zeroes out in the event of error or other abnormal condition. Instead, use the kernel ZERO_PAGE() for that purpose. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Xi Wang authored
The existing overflow check (n > ULONG_MAX / b) didn't work, because n = ULONG_MAX / b would both bypass the check and still overflow the allocation size a + n * b. The correct check should be (n > (ULONG_MAX - a) / b). Signed-off-by:
Xi Wang <xi.wang@gmail.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Jim Schutt authored
The Ceph messenger would sometimes queue multiple work items to write data to a socket when the socket buffer was full. Fix this problem by making ceph_write_space() use SOCK_NOSPACE in the same way that net/core/stream.c:sk_stream_write_space() does, i.e., clearing it only when sufficient space is available in the socket buffer. Signed-off-by:
Jim Schutt <jaschut@sandia.gov> Reviewed-by:
Alex Elder <elder@dreamhost.com>
-
- 17 Mar, 2012 2 commits
-
-
Pablo Neira Ayuso authored
Kerin Millar reported hardlockups while running `conntrackd -c' in a busy firewall. That system (with several processors) was acting as backup in a primary-backup setup. After several tries, I found a race condition between the deletion operation of ctnetlink and timeout expiration. This patch fixes this problem. Tested-by:
Kerin Millar <kerframil@gmail.com> Reported-by:
Kerin Millar <kerframil@gmail.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
RongQing.Li authored
ip6_mc_find_dev_rcu() is called with rcu_read_lock(), so don't need to dev_hold(). With dev_hold(), not corresponding dev_put(), will lead to leak. [ bug introduced in 96b52e61 (ipv6: mcast: RCU conversions) ] Signed-off-by:
RongQing.Li <roy.qing.li@gmail.com> Acked-by:
Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 16 Mar, 2012 2 commits
-
-
Eric Dumazet authored
This reverts commit d47a0ac7 (sch_sfq: dont put new flow at the end of flows) As Jesper found out, patch sounded great but has bad side effects. In stress situation, pushing new flows in front of the queue can prevent old flows doing any progress. Packets can stay in SFQ queue for unlimited amount of time. It's possible to add heuristics to limit this problem, but this would add complexity outside of SFQ scope. A more sensible answer to Dave Taht concerns (who reported the issued I tried to solve in original commit) is probably to use a qdisc hierarchy so that high prio packets dont enter a potentially crowded SFQ qdisc. Reported-by:
Jesper Dangaard Brouer <jdb@comx.dk> Cc: Dave Taht <dave.taht@gmail.com> Signed-off-by:
Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
commit 87a11578 ( ipv6: Move xfrm_lookup() call down into icmp6_dst_alloc().) forgot to convert one error path, leading to crashes in mld_sendpack() Many thanks to Dave Jones for providing a very complete bug report. Reported-by:
Dave Jones <davej@redhat.com> Signed-off-by:
Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 11 Mar, 2012 1 commit
-
-
Eric Dumazet authored
commit ea4fc0d6 (ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit()) added a serious regression on synflood handling. Simon Kirby discovered a successful connection was delayed by 20 seconds before being responsive. In my tests, I discovered that xmit frames were lost, and needed ~4 retransmits and a socket dst rebuild before being really sent. In case of syncookie initiated connection, we use a different path to initialize the socket dst, and inet->cork.fl.u.ip4 is left cleared. As ip_queue_xmit() now depends on inet flow being setup, fix this by copying the temp flowi4 we use in cookie_v4_check(). Reported-by:
Simon Kirby <sim@netnation.com> Bisected-by:
Simon Kirby <sim@netnation.com> Signed-off-by:
Eric Dumazet <eric.dumazet@gmail.com> Tested-by:
Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 08 Mar, 2012 4 commits
-
-
Steffen Klassert authored
As we invalidate the inetpeer tree along with the routing cache now, we don't need a genid to reset the redirect handling when the routing cache is flushed. Signed-off-by:
Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Steffen Klassert authored
We initialize the routing metrics with the values cached on the inetpeer in rt_init_metrics(). So if we have the metrics cached on the inetpeer, we ignore the user configured fib_metrics. To fix this issue, we replace the old tree with a fresh initialized inet_peer_base. The old tree is removed later with a delayed work queue. Signed-off-by:
Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Paulius Zaleckas authored
Now we have: eth0: link *down* br0: port 1(eth0) entered *forwarding* state br_log_state(p) should be called *after* p->state is set to BR_STATE_DISABLED. Reported-by:
Zilvinas Valinskas <zilvinas@wilibox.com> Signed-off-by:
Paulius Zaleckas <paulius.zaleckas@gmail.com> Acked-by:
Stephen Hemminger <shemminger@vyatta.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Paulius Zaleckas authored
When br_log_state() is reporting state it should say "entered" istead of "entering" since state at this point is already changed. Signed-off-by:
Paulius Zaleckas <paulius.zaleckas@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 07 Mar, 2012 1 commit
-
-
Jesse Gross authored
When modifying IP addresses or ports on a UDP packet we don't correctly follow the rules for unchecksummed packets. This meant that packets without a checksum can be given a incorrect new checksum and packets with a checksum can become marked as being unchecksummed. This fixes it to handle those requirements. Signed-off-by:
Jesse Gross <jesse@nicira.com>
-
- 06 Mar, 2012 8 commits
-
-
Ben Pfaff authored
When OVS_VPORT_ATTR_NAME is specified and dp_ifindex is nonzero, the logical behavior would be for the vport name lookup scope to be limited to the specified datapath, but in fact the dp_ifindex value was ignored. This commit causes the search scope to be honored. Signed-off-by:
Ben Pfaff <blp@nicira.com> Signed-off-by:
Jesse Gross <jesse@nicira.com>
-
Li Wei authored
When forwarding was set and a new net device is register, we need add this device to the all-router mcast group. Signed-off-by:
Li Wei <lw@cn.fujitsu.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Pablo Neira Ayuso authored
If reliable event delivery is enabled and ctnetlink fails to deliver the destroy event in early_drop, the conntrack subsystem cannot drop any the candidate flow that was planned to be evicted. Reported-by:
Kerin Millar <kerframil@gmail.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
When net.bridge.bridge-nf-filter-vlan-tagged is 0 (default), vlan packets arriving should not be sent to ip(6)tables by bridge netfilter. However, it turns out that we currently always send VLAN packets to netfilter, if .. a), CONFIG_VLAN_8021Q is enabled ; or b), CONFIG_VLAN_8021Q is not set but rx vlan offload is enabled on the bridge port. This is because bridge netfilter treats skb with skb->protocol == ETH_P_IP{V6} as "non-vlan packet". With rx vlan offload on or CONFIG_VLAN_8021Q=y, the vlan header has already been removed here, and we cannot rely on skb->protocol alone. Fix this by only using skb->protocol if the skb has no vlan tag, or if a vlan tag is present and filter-vlan-tagged bridge netfilter sysctl is enabled. We cannot remove the skb->protocol == htons(ETH_P_8021Q) test because the vlan tag is still around in the CONFIG_VLAN_8021Q=n && "ethtool -K $itf rxvlan off" case. reproducer: iptables -t raw -I PREROUTING -i br0 iptables -t raw -I PREROUTING -i br0.1 Then send packets to an ip address configured on br0.1 interface. Even with net.bridge.bridge-nf-filter-vlan-tagged=0, the 1st rule will match instead of the 2nd one. With this patch applied, the 2nd rule will match instead. In the non-local address case, netfilter won't be consulted after this patch unless the sysctl is switched on. Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Pablo Neira Ayuso authored
In adf7ff8, a invalid dereference was added in ebt_make_names. CC [M] net/bridge/netfilter/ebtables.o net/bridge/netfilter/ebtables.c: In function `ebt_make_names': net/bridge/netfilter/ebtables.c:1371:20: warning: `t' may be used uninitialized in this function [-Wuninitialized] Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Pablo Neira Ayuso authored
Since 7d367e06 , ctnetlink_new_conntrack is called without holding the nf_conntrack_lock spinlock. Thus, ctnetlink_parse_nat_setup does not require to release that spinlock anymore in the NAT module autoload case. Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Santosh Nayak authored
user-space ebtables expects 32 bytes-long names, but xt_match names use 29 bytes. We have to copy less 29 bytes and then, make sure we fill the remaining bytes with zeroes. Signed-off-by:
Santosh Nayak <santoshprasadnayak@gmail.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Neal Cardwell authored
This commit fixes tcp_shift_skb_data() so that it does not shift SACKed data below snd_una. This fixes an issue whose symptoms exactly match reports showing tp->sacked_out going negative since 3.3.0-rc4 (see "WARNING: at net/ipv4/tcp_input.c:3418" thread on netdev). Since 2008 (832d11c5) tcp_shift_skb_data() had been shifting SACKed ranges that were below snd_una. It checked that the *end* of the skb it was about to shift from was above snd_una, but did not check that the end of the actual shifted range was above snd_una; this commit adds that check. Shifting SACKed ranges below snd_una is problematic because for such ranges tcp_sacktag_one() short-circuits: it does not declare anything as SACKed and does not increase sacked_out. Before the fixes in commits cc9a672e and daef52ba , shifting SACKed ranges below snd_una happened to work because tcp_shifted_skb() was always (incorrectly) passing in to tcp_sacktag_one() an skb whose end_seq tcp_shift_skb_data() had already guaranteed was beyond snd_una. Hence tcp_sacktag_one() never short-circuited and always increased tp->sacked_out in this case. After those two fixes, my testing has verified that shifting SACKed ranges below snd_una could cause tp->sacked_out to go negative with the following sequence of events: (1) tcp_shift_skb_data() sees an skb whose end_seq is beyond snd_una, then shifts a prefix of that skb that is below snd_una (2) tcp_shifted_skb() increments the packet count of the already-SACKed prev sk_buff (3) tcp_sacktag_one() sees the end of the new SACKed range is below snd_una, so it short-circuits and doesn't increase tp->sacked_out (5) tcp_clean_rtx_queue() sees the SACKed skb has been ACKed, decrements tp->sacked_out by this "inflated" pcount that was missing a matching increase in tp->sacked_out, and hence tp->sacked_out underflows to a u32 like 0xFFFFFFFF, which casted to s32 is negative. (6) this leads to the warnings seen in the recent "WARNING: at net/ipv4/tcp_input.c:3418" thread on the netdev list; e.g.: tcp_input.c:3418 WARN_ON((int)tp->sacked_out < 0); More generally, I think this bug can be tickled in some cases where two or more ACKs from the receiver are lost and then a DSACK arrives that is immediately above an existing SACKed skb in the write queue. This fix changes tcp_shift_skb_data() to abort this sequence at step (1) in the scenario above by noticing that the bytes are below snd_una and not shifting them. Signed-off-by:
Neal Cardwell <ncardwell@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 05 Mar, 2012 3 commits
-
-
Ulrich Weber authored
otherwise source IPv6 address of ICMPV6_MGM_QUERY packet might be random junk if IPv6 is disabled on interface or link-local address is not yet ready (DAD). Signed-off-by:
Ulrich Weber <ulrich.weber@sophos.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
nlmsg_parse() might return an error, so test its return value before potential random memory accesses. Errors introduced in commit 115c9b81 (rtnetlink: Fix problem with buffer allocation) Signed-off-by:
Eric Dumazet <eric.dumazet@gmail.com> Cc: Greg Rose <gregory.v.rose@intel.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Joakim Tjernlund authored
commit bridge: send proper message_age in config BPDU added this gem: bpdu.message_age = (jiffies - root->designated_age) p->designated_age = jiffies + bpdu->message_age; Notice how bpdu->message_age is negated when reassigned to bpdu.message_age. This causes message age to decrease breaking the STP protocol. Signed-off-by:
Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by:
David S. Miller <davem@davemloft.net>
-