- 14 Jan, 2014 3 commits
-
-
Ilya Dryomov authored
The check that makes sure that we have enough memory allocated to read in the entire header of the message in question is currently busted. It compares front_len of the incoming message with iov_len field of ceph_msg::front structure, which is used primarily to indicate the amount of data already read in, and not the size of the allocated buffer. Under certain conditions (e.g. a short read from a socket followed by that socket's shutdown and owning ceph_connection reset) this results in a warning similar to [85688.975866] libceph: get_reply front 198 > preallocated 122 (4#0) and, through another bug, leads to forever hung tasks and forced reboots. Fix this by comparing front_len with front_alloc_len field of struct ceph_msg, which stores the actual size of the buffer. Fixes: http://tracker.ceph.com/issues/5425 Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Rename front local variable to front_len in get_reply() to make its purpose more clear. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Rename front_max field of struct ceph_msg to front_alloc_len to make its purpose more clear. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
- 31 Dec, 2013 22 commits
-
-
Ilya Dryomov authored
Similar to userspace, don't bail with "parse_ips bad ip ..." if the specified port is port 0, instead use port CEPH_MON_PORT (6789, the default monitor port). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit 8b38f10bc2ee3643a33ea5f9545ad5c00e4ac5b4. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit ea3a0bb8b773360d73b8b77fa32115ef091c9857. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
This allows all of the tunables to be overridden by a specific rule. Reflects ceph.git commits d129e09e57fbc61cfd4f492e3ee77d0750c9d292, 0497db49e5973b50df26251ed0e3f4ac7578e66e. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
The legacy behavior is to make the normal number of tries for the recursive chooseleaf call. The descend_once tunable changed this to making a single try and bail if we get a reject (note that it is impossible to collide in the recursive case). The new set_chooseleaf_tries lets you select the number of recursive chooseleaf attempts for indep mode, or default to 1. Use the same behavior for firstn, except default to total_tries when the legacy tunables are set (for compatibility). This makes the rule step override the (new) default of 1 recursive attempt, keeping behavior consistent with indep mode. Reflects ceph.git commit 685c6950ef3df325ef04ce7c986e36ca2514c5f1. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
This aligns the internal identifier names with the user-visible names in the decompiled crush map language. Reflects ceph.git commit caa0e22e15e4226c3671318ba1f61314bf6da2a6. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Since we can specify the recursive retries in a rule, we may as well also specify the non-recursive tries too for completeness. Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Parameterize the attempts for the _firstn choose method, and apply the rule-specified tries count to firstn mode as well. Note that we have slightly different behavior here than with indep: If the firstn value is not specified for firstn, we pass through the normal attempt count. This maintains compatibility with legacy behavior. Note that this is usually *not* actually N^2 work, though, because of the descend_once tunable. However, descend_once is unfortunately *not* the same thing as 1 chooseleaf try because it is only checked on a reject but not on a collision. Sigh. In contrast, for indep, if tries is not specified we default to 1 recursive attempt, because that is simply more sane, and we have the option to do so. The descend_once tunable has no effect for indep. Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Explicitly control the number of sample attempts, and allow the number of tries in the recursive call to be explicitly controlled via the rule. This is important because the amount of time we want to spend looking for a solution may be rule dependent (e.g., higher for the wide indep pool than the rep pools). (We should do the same for the other tunables, by the way!) Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Pass down the parent's 'r' value so that we will sample different values in the recursive call when the parent tries multiple times. This avoids doing useless work (calling multiple times and trying the same values). Reflects ceph.git commit 2731d3030d7a3e80922b7f1b7756f9a4a124bac5. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Pass numrep (the width of the result) separately from the number of results we want *this* iteration. This makes things less awkward when we do a recursive call (for chooseleaf) and want only one item. Reflects ceph.git commit 1b567ee08972f268c11b43fc881e57b5984dd08b. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Now that indep is handled by crush_choose_indep, rename crush_choose to crush_choose_firstn and remove all the conditionals. This ends up stripping out *lots* of code. Note that it *also* makes it obvious that the shenanigans we were playing with r' for uniform buckets were broken for firstn mode. This appears to have happened waaaay back in commit dae8bec9 (or earlier)... 2007. Reflects ceph.git commit 94350996cb2035850bcbece6a77a9b0394177ec9. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit 4551fee9ad89d0427ed865d766d0d44004d3e3e1. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit 86e978036a4ecbac4c875e7c00f6c5bbe37282d3. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
For firstn mode, if we fail to make a valid placement choice, we just continue and return a short result to the caller. For indep mode, however, we need to make the position stable, and return an undefined value on failed placements to avoid shifting later results to the left. Reflects ceph.git commit b1d4dd4eb044875874a1d01c01c7d766db5d0a80. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
This is only present to size the temporary scratch arrays that we put on the stack. Let the caller allocate them as they wish and remove the limitation. Reflects ceph.git commit 1cfe140bf2dab99517589a82a916f4c75b9492d1. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit 3cef755428761f2481b1dd0e0fbd0464ac483fc5. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit e7d47827f0333c96ad43d257607fb92ed4176550. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit 43a01c9973c4b83f2eaa98be87429941a227ddde. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Pass the size of the weight vector into crush_do_rule() to ensure that we don't access values past the end. This can happen if the caller misbehaves and passes a weight vector that is smaller than max_devices. Currently the monitor tries to prevent that from happening, but this will gracefully tolerate previous bad osdmaps that got into this state. It's also a bit more defensive. Reflects ceph.git commit 5922e2c2b8335b5e46c9504349c3a55b7434c01a. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
This updates ceph_features.h so that it has all feature bits defined in ceph.git. In the interim since the last update, ceph.git crossed the "32 feature bits" point, and, the addition of the 33rd bit wasn't handled correctly. The work-around is squashed into this commit and reflects ceph.git commit 053659d05e0349053ef703b414f44965f368b9f0. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
In preparation for ceph_features.h update, change all features fields from unsigned int/u32 to u64. (ceph.git has ~40 feature bits at this point.) Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
- 13 Dec, 2013 3 commits
-
-
Josh Durgin authored
With the current full handling, there is a race between osds and clients getting the first map marked full. If the osd wins, it will return -ENOSPC to any writes, but the client may already have writes in flight. This results in the client getting the error and propagating it up the stack. For rbd, the block layer turns this into EIO, which can cause corruption in filesystems above it. To avoid this race, osds are being changed to drop writes that came from clients with an osdmap older than the last osdmap marked full. In order for this to work, clients must resend all writes after they encounter a full -> not full transition in the osdmap. osds will wait for an updated map instead of processing a request from a client with a newer map, so resent writes will not be dropped by the osd unless there is another not full -> full transition. This approach requires both osds and clients to be fixed to avoid the race. Old clients talking to osds with this fix may hang instead of returning EIO and potentially corrupting an fs. New clients talking to old osds have the same behavior as before if they encounter this race. Fixes: http://tracker.ceph.com/issues/6938 Reviewed-by:
Sage Weil <sage@inktank.com> Signed-off-by:
Josh Durgin <josh.durgin@inktank.com>
-
Josh Durgin authored
The PAUSEWR and PAUSERD flags are meant to stop the cluster from processing writes and reads, respectively. The FULL flag is set when the cluster determines that it is out of space, and will no longer process writes. PAUSEWR and PAUSERD are purely client-side settings already implemented in userspace clients. The osd does nothing special with these flags. When the FULL flag is set, however, the osd responds to all writes with -ENOSPC. For cephfs, this makes sense, but for rbd the block layer translates this into EIO. If a cluster goes from full to non-full quickly, a filesystem on top of rbd will not behave well, since some writes succeed while others get EIO. Fix this by blocking any writes when the FULL flag is set in the osd client. This is the same strategy used by userspace, so apply it by default. A follow-on patch makes this configurable. __map_request() is called to re-target osd requests in case the available osds changed. Add a paused field to a ceph_osd_request, and set it whenever an appropriate osd map flag is set. Avoid queueing paused requests in __map_request(), but force them to be resent if they become unpaused. Also subscribe to the next osd map from the monitor if any of these flags are set, so paused requests can be unblocked as soon as possible. Fixes: http://tracker.ceph.com/issues/6079 Reviewed-by:
Sage Weil <sage@inktank.com> Signed-off-by:
Josh Durgin <josh.durgin@inktank.com>
-
Li Wang authored
Wake up possible waiters, invoke the call back if any, unregister the request Signed-off-by:
Li Wang <liwang@ubuntukylin.com> Signed-off-by:
Yunchuan Wen <yunchuanwen@ubuntukylin.com> Signed-off-by:
Sage Weil <sage@inktank.com>
-
- 19 Oct, 2013 1 commit
-
-
Joe Perches authored
There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by:
Joe Perches <joe@perches.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 09 Sep, 2013 1 commit
-
-
Josh Durgin authored
Without a way to flush the osd client's notify workqueue, a watch event that is unregistered could continue receiving callbacks indefinitely. Unregistering the event simply means no new notifies are added to the queue, but there may still be events in the queue that will call the watch callback for the event. If the queue is flushed after the event is unregistered, the caller can be sure no more watch callbacks will occur for the canceled watch. Signed-off-by:
Josh Durgin <josh.durgin@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
- 04 Sep, 2013 1 commit
-
-
Sage Weil authored
Fix a typo that used the wrong bitmask for the pg.seed calculation. This is normally unnoticed because in most cases pg_num == pgp_num. It is, however, a bug that is easily corrected. CC: stable@vger.kernel.org Signed-off-by:
Sage Weil <sage@inktank.com> Reviewed-by:
Alex Elder <alex.elder@linary.org>
-
- 27 Aug, 2013 3 commits
-
-
Dan Carpenter authored
create_singlethread_workqueue() returns NULL on error, and it doesn't return ERR_PTRs. I tweaked the error handling a little to be consistent with earlier in the function. Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Dan Carpenter authored
There are two places where we read "nr_maps" if both of them are set to zero then we would hit a NULL dereference here. Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Dan Carpenter authored
We've tried to fix the error paths in this function before, but there is still a hidden goto in the ceph_decode_need() macro which goes to the wrong place. We need to release the "req" and unlock a mutex before returning. Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
- 15 Aug, 2013 1 commit
-
-
Li Wang authored
This patch implements fallocate and punch hole support for Ceph kernel client. Signed-off-by:
Li Wang <liwang@ubuntukylin.com> Signed-off-by:
Yunchuan Wen <yunchuanwen@ubuntukylin.com>
-
- 10 Aug, 2013 2 commits
-
-
Tejun Heo authored
dbf2576e ("workqueue: make all workqueues non-reentrant") made WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages. This patch doesn't introduce any behavior changes. Signed-off-by:
Tejun Heo <tj@kernel.org> Reviewed-by:
Sage Weil <sage@inktank.com> Cc: ceph-devel@vger.kernel.org
-
majianpeng authored
For nofail == false request, if __map_request failed, the caller does cleanup work, like releasing the relative pages. It doesn't make any sense to retry this request. CC: stable@vger.kernel.org Signed-off-by:
Jianpeng Ma <majianpeng@gmail.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
- 25 Jul, 2013 1 commit
-
-
Eric Dumazet authored
Several call sites use the hardcoded following condition : sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) Lets use a helper because TCP_NOTSENT_LOWAT support will change this condition for TCP sockets. Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 03 Jul, 2013 2 commits
-
-
Yan, Zheng authored
We can't use !req->r_sent to check if OSD request is sent for the first time, this is because __cancel_request() zeros req->r_sent when OSD map changes. Rather than adding a new variable to struct ceph_osd_request to indicate if it's sent for the first time, We can call the unsafe callback only when unsafe OSD reply is received. If OSD's first reply is safe, just skip calling the unsafe callback. The purpose of unsafe callback is adding unsafe request to a list, so that fsync(2) can wait for the safe reply. fsync(2) doesn't need to wait for a write(2) that hasn't returned yet. So it's OK to add request to the unsafe list when the first OSD reply is received. (ceph_sync_write() returns after receiving the first OSD reply) Signed-off-by:
Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
Tyler Hicks authored
A malicious monitor can craft an auth reply message that could cause a NULL function pointer dereference in the client's kernel. To prevent this, the auth_none protocol handler needs an empty ceph_auth_client_ops->build_request() function. CVE-2013-1059 Signed-off-by:
Tyler Hicks <tyhicks@canonical.com> Reported-by:
Chanam Park <chanam.park@hkpco.kr> Reviewed-by:
Seth Arnold <seth.arnold@canonical.com> Reviewed-by:
Sage Weil <sage@inktank.com> Cc: stable@vger.kernel.org
-