- 22 Mar, 2012 19 commits
-
-
Alex Elder authored
If a message queued for send gets revoked, zeroes are sent over the wire instead of any unsent data. This is done by constructing a message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg(). Since we are already working with a page in this case we can use the sendpage interface instead. Create a new ceph_tcp_sendpage() helper that sets up flags to match the way ceph_tcp_sendmsg() does now. Signed-off-by:
Alex Elder <elder@dreamhost.com> Reviewed-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
CRC's are computed for all messages between ceph entities. The CRC computation for the data portion of message can optionally be disabled using the "nocrc" (common) ceph option. The default is for CRC computation for the data portion to be enabled. Unfortunately, the code that implements this feature interprets the feature flag wrong, meaning that by default the CRC's have *not* been computed (or checked) for the data portion of messages unless the "nocrc" option was supplied. Fix this, in write_partial_msg_pages() and read_partial_message(). Also change the flag variable in write_partial_msg_pages() to be "no_datacrc" to match the usage elsewhere in the file. This fixes http://tracker.newdream.net/issues/2064 Signed-off-by:
Alex Elder <elder@dreamhost.com> Reviewed-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Nothing too big here. - define the size of the buffer used for consuming ignored incoming data using a symbolic constant - simplify the condition determining whether to unmap the page in write_partial_msg_pages(): do it for crc but not if the page is the zero page Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Make a small change in the code that counts down kvecs consumed by a ceph_tcp_sendmsg() call. Same functionality, just blocked out a little differently. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Move blocks of code out of loops in read_partial_message_section() and read_partial_message(). They were only was getting called at the end of the last iteration of the loop anyway. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Calculate CRC in a separate step from rearranging the byte order of the result, to improve clarity and readability. Use offsetof() to determine the number of bytes to include in the CRC calculation. In read_partial_message(), switch which value gets byte-swapped, since the just-computed CRC is already likely to be in a register. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Change the name (and type) of a few CRC-related Boolean local variables so they contain the word "do", to distingish their purpose from variables used for holding an actual CRC value. Note that in the process of doing this I identified a fairly serious logic error in write_partial_msg_pages(): the value of "do_crc" assigned appears to be the opposite of what it should be. No attempt to fix this is made here; this change preserves the erroneous behavior. The problem I found is documented here: http://tracker.newdream.net/issues/2064 Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
This gathers a number of very minor changes: - use %hu when formatting the a socket address's address family - null out the ceph_msgr_wq pointer after the queue has been destroyed - drop a needless cast in ceph_write_space() - add a WARN() call in ceph_state_change() in the event an unrecognized socket state is encountered - rearrange the logic in ceph_con_get() and ceph_con_put() so that: - the reference counts are only atomically read once - the values displayed via dout() calls are known to be meaningful at the time they are formatted Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
There is no real need for ceph_tcp_connect() to return the socket pointer it creates, since it already assigns it to con->sock, which is visible to the caller. Instead, have it return an error code, which tidies things up a bit. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Define a helper function to perform various cleanup operations. Use it both in the exit routine and in the init routine in the event of an error. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
The messenger workqueue has no need to be public. So give it static scope. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Encapsulate the operation of adding a new chunk of data to the next open slot in a ceph_connection's out_kvec array. Also add a "reset" operation to make subsequent add operations start at the beginning of the array again. Use these routines throughout, avoiding duplicate code and ensuring all calls are handled consistently. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
One of the arguments to prepare_write_connect() indicates whether it is being called immediately after a call to prepare_write_banner(). Move the prepare_write_banner() call inside prepare_write_connect(), and reinterpret (and rename) the "after_banner" argument so it indicates that prepare_write_connect() should *make* the call rather than should know it has already been made. This was split out from the next patch to highlight this change in logic. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
This fixes some spots where a type cast to (void *) was used as as a universal type hiding mechanism. Instead, properly cast the type to the intended target type. Signed-off-by:
Alex Elder <elder@newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
This eliminates type casts in some places where they are not required. Signed-off-by:
Alex Elder <elder@newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
A spinlock is used to protect a value used for selecting an array index for a string used for formatting a socket address for human consumption. The index is reset to 0 if it ever reaches the maximum index value. Instead, use an ever-increasing atomic variable as a sequence number, and compute the array index by masking off all but the sequence number's lowest bits. Make the number of entries in the array a power of two to allow the use of such a mask (to avoid jumps in the index value when the sequence number wraps). The length of these strings is somewhat arbitrarily set at 60 bytes. The worst-case length of a string produced is 54 bytes, for an IPv6 address that can't be shortened, e.g.: [1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767 Change it so we arbitrarily use 64 bytes instead; if nothing else it will make the array of these line up better in hex dumps. Rename a few things to reinforce the distinction between the number of strings in the array and the length of individual strings. Signed-off-by:
Alex Elder <elder@newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Rearrange ceph_tcp_connect() a bit, making use of "else" rather than re-testing a value with consecutive "if" statements. Don't record a connection's socket pointer unless the connect operation is successful. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Alex Elder authored
Each messenger allocates a page to be used when writing zeroes out in the event of error or other abnormal condition. Instead, use the kernel ZERO_PAGE() for that purpose. Signed-off-by:
Alex Elder <elder@dreamhost.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Jim Schutt authored
The Ceph messenger would sometimes queue multiple work items to write data to a socket when the socket buffer was full. Fix this problem by making ceph_write_space() use SOCK_NOSPACE in the same way that net/core/stream.c:sk_stream_write_space() does, i.e., clearing it only when sufficient space is available in the socket buffer. Signed-off-by:
Jim Schutt <jaschut@sandia.gov> Reviewed-by:
Alex Elder <elder@dreamhost.com>
-
- 31 Oct, 2011 1 commit
-
-
Paul Gortmaker authored
These files are non modular, but need to export symbols using the macros now living in export.h -- call out the include so that things won't break when we remove the implicit presence of module.h from everywhere. Signed-off-by:
Paul Gortmaker <paul.gortmaker@windriver.com>
-
- 25 Oct, 2011 3 commits
-
-
Noah Watkins authored
Change ceph_parse_ips to take either names given as IP addresses or standard hostnames (e.g. localhost). The DNS lookup is done using the dns_resolver facility similar to its use in AFS, NFS, and CIFS. This patch defines CONFIG_CEPH_LIB_USE_DNS_RESOLVER that controls if this feature is on or off. Signed-off-by:
Noah Watkins <noahwatkins@gmail.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
Any non-masked msg allocation failure should generate a warning and stack trace to the console. All of these need to eventually be replaced by safe preallocation or msgpools. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
The pool allocation failures are masked by the pool; there is no need to spam the console about them. (That's the whole point of having the pool in the first place.) Mark msg allocations whose failure is safely handled as such. Signed-off-by:
Sage Weil <sage@newdream.net>
-
- 16 Sep, 2011 1 commit
-
-
Jim Schutt authored
Commit 4cf9d544 recorded when an outgoing ceph message was ACKed, in order to avoid unnecessary connection resets when an OSD is busy. However, ack_stamp is uninitialized, so there is a window between when the message is sent and when it is ACKed in which handle_timeout() interprets the unitialized value as an expired timeout, and resets the connection unnecessarily. Close the window by initializing ack_stamp. Signed-off-by:
Jim Schutt <jaschut@sandia.gov> Signed-off-by:
Sage Weil <sage@newdream.net>
-
- 26 Jul, 2011 1 commit
-
-
Sage Weil authored
Keep track of when an outgoing message is ACKed (i.e., the server fully received it and, presumably, queued it for processing). Time out OSD requests only if it's been too long since they've been received. This prevents timeouts and connection thrashing when the OSDs are simply busy and are throttling the requests they read off the network. Reviewed-by:
Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
- 19 May, 2011 5 commits
-
-
Sage Weil authored
Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
If we get a WAIT as a client something went wrong; error out. And don't fall through to an unrelated case. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
If there is no get_authorizer method we set the out_kvec to a bogus pointer. The length is also zero in that case, so it doesn't much matter, but it's better not to add the empty item in the first place. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
If a connection is closed and/or reopened (ceph_con_close, ceph_con_open) it can race with a callback. con_work does various state checks for closed or reopened sockets at the beginning, but drops con->mutex before making callbacks. We need to check for state bit changes after retaking the lock to ensure we restart con_work and execute those CLOSED/OPENING tests or else we may end up operating under stale assumptions. In Jim's case, this was causing 'bad tag' errors. There are four cases where we re-take the con->mutex inside con_work: catch them all and return EAGAIN from try_{read,write} so that we can restart con_work. Reported-by:
Jim Schutt <jaschut@sandia.gov> Tested-by:
Jim Schutt <jaschut@sandia.gov> Signed-off-by:
Sage Weil <sage@newdream.net>
-
- 03 May, 2011 1 commit
-
-
Henry C Chang authored
If memory allocation failed, calling ceph_msg_put() will cause GPF since some of ceph_msg variables are not initialized first. Fix Bug #970. Signed-off-by:
Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by:
Sage Weil <sage@newdream.net>
-
- 04 Mar, 2011 3 commits
-
-
Sage Weil authored
The standby logic used to be pretty dependent on the work requeueing behavior that changed when we switched to WQ_NON_REENTRANT. It was also very fragile. Restructure things so that: - We clear WRITE_PENDING when we set STANDBY. This ensures we will requeue work when we wake up later. - con_work backs off if STANDBY is set. There is nothing to do if we are in standby. - clear_standby() helper is called by both con_send() and con_keepalive(), the two actions that can wake us up again. Move the connect_seq++ logic here. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
There was some broken keepalive code using a dead variable. Shift to using the proper bit flag. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
With commit f363e45f we replaced a bunch of hacky workqueue mutual exclusion logic with the WQ_NON_REENTRANT flag. One pieces of fallout is that the exponential backoff breaks in certain cases: * con_work attempts to connect. * we get an immediate failure, and the socket state change handler queues immediate work. * con_work calls con_fault, we decide to back off, but can't queue delayed work. In this case, we add a BACKOFF bit to make con_work reschedule delayed work next time it runs (which should be immediately). Signed-off-by:
Sage Weil <sage@newdream.net>
-
- 03 Mar, 2011 1 commit
-
-
Sage Weil authored
If we mark the connection CLOSED we will give up trying to reconnect to this server instance. That is appropriate for things like a protocol version mismatch that won't change until the server is restarted, at which point we'll get a new addr and reconnect. An authorization failure like this is probably due to the server not properly rotating it's secret keys, however, and should be treated as transient so that the normal backoff and retry behavior kicks in. Signed-off-by:
Sage Weil <sage@newdream.net>
-
- 25 Jan, 2011 2 commits
-
-
Sage Weil authored
Pass errors from writing to the socket up the stack. If we get -EAGAIN, return 0 from the helper to simplify the callers' checks. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
If we get EAGAIN when trying to read from the socket, it is not an error. Return 0 from the helper in this case to simplify the error handling cases in the caller (indirectly, try_read). Fix try_read to pass any error to it's caller (con_work) instead of almost always returning 0. This let's us respond to things like socket disconnects. Signed-off-by:
Sage Weil <sage@newdream.net>
-
- 12 Jan, 2011 1 commit
-
-
Tejun Heo authored
ceph messenger code does a rather complex dancing around multithread workqueue to make sure the same work item isn't executed concurrently on different CPUs. This restriction can be provided by workqueue with WQ_NON_REENTRANT. Make ceph_msgr_wq non-reentrant workqueue with the default concurrency level and remove the QUEUED/BUSY logic. * This removes backoff handling in con_work() but it couldn't reliably block execution of con_work() to begin with - queue_con() can be called after the work started but before BUSY is set. It seems that it was an optimization for a rather cold path and can be safely removed. * The number of concurrent work items is bound by the number of connections and connetions are independent from each other. With the default concurrency level, different connections will be executed independently. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Sage Weil <sage@newdream.net> Cc: ceph-devel@vger.kernel.org Signed-off-by:
Sage Weil <sage@newdream.net>
-
- 14 Dec, 2010 1 commit
-
-
Sage Weil authored
create_workqueue() returns NULL on failure. Signed-off-by:
Sage Weil <sage@newdream.net>
-
- 09 Nov, 2010 1 commit
-
-
Sage Weil authored
The alignment used for reading data into or out of pages used to be taken from the data_off field in the message header. This only worked as long as the page alignment matched the object offset, breaking direct io to non-page aligned offsets. Instead, explicitly specify the page alignment next to the page vector in the ceph_msg struct, and use that instead of the message header (which probably shouldn't be trusted). The alloc_msg callback is responsible for filling in this field properly when it sets up the page vector. Signed-off-by:
Sage Weil <sage@newdream.net>
-