1. 18 Feb, 2013 11 commits
  2. 14 Feb, 2013 4 commits
    • Alex Elder's avatar
      libceph: don't require r_num_pages for bio requests · 9cbb1d72
      Alex Elder authored
      There is a check in the completion path for osd requests that
      ensures the number of pages allocated is enough to hold the amount
      of incoming data expected.
      
      For bio requests coming from rbd the "number of pages" is not really
      meaningful (although total length would be).  So stop requiring that
      nr_pages be supplied for bio requests.  This is done by checking
      whether the pages pointer is null before checking the value of
      nr_pages.
      
      Note that this value is passed on to the messenger, but there it's
      only used for debugging--it's never used for validation.
      
      While here, change another spot that used r_pages in a debug message
      inappropriately, and also invalidate the r_con_filling_msg pointer
      after dropping a reference to it.
      
      This resolves:
          http://tracker.ceph.com/issues/3875
      
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      9cbb1d72
    • Alex Elder's avatar
      rbd: don't take extra bio reference for osd client · 1e32d34c
      Alex Elder authored
      Currently, if the OSD client finds an osd request has had a bio list
      attached to it, it drops a reference to it (or rather, to the first
      entry on that list) when the request is released.
      
      The code that added that reference (i.e., the rbd client) is
      therefore required to take an extra reference to that first bio
      structure.
      
      The osd client doesn't really do anything with the bio pointer other
      than transfer it from the osd request structure to outgoing (for
      writes) and ingoing (for reads) messages.  So it really isn't the
      right place to be taking or dropping references.
      
      Furthermore, the rbd client already holds references to all bio
      structures it passes to the osd client, and holds them until the
      request is completed.  So there's no need for this extra reference
      whatsoever.
      
      So remove the bio_put() call in ceph_osdc_release_request(), as
      well as its matching bio_get() call in rbd_osd_req_create().
      
      This change could lead to a crash if old libceph.ko was used with
      new rbd.ko.  Add a compatibility check at rbd initialization time to
      avoid this possibilty.
      
      This resolves:
          http://tracker.ceph.com/issues/3798    and
          http://tracker.ceph.com/issues/3799
      
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      1e32d34c
    • Alex Elder's avatar
      libceph: add a compatibility check interface · 72fe25e3
      Alex Elder authored
      An upcoming change implements semantic change that could lead to
      a crash if an old version of the libceph kernel module is used with
      a new version of the rbd kernel module.
      
      In order to preclude that possibility, this adds a compatibilty
      check interface.  If this interface doesn't exist, the modules are
      obviously not compatible.  But if it does exist, this provides a way
      of letting the caller know whether it will operate properly with
      this libceph module.
      
      Perhaps confusingly, it returns false right now.  The semantic
      change mentioned above will make it return true.
      
      This resolves:
          http://tracker.ceph.com/issues/3800
      
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      72fe25e3
    • Alex Elder's avatar
      libceph: fix messenger CONFIG_BLOCK dependencies · 3ebc21f7
      Alex Elder authored
      The ceph messenger has a few spots that are only used when
      bio messages are supported, and that's only when CONFIG_BLOCK
      is defined.  This surrounds a couple of spots with #ifdef's
      that would cause a problem if CONFIG_BLOCK were not present
      in the kernel configuration.
      
      This resolves:
          http://tracker.ceph.com/issues/3976
      
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      3ebc21f7
  3. 25 Jan, 2013 1 commit
    • Cong Ding's avatar
      libceph: fix undefined behavior when using snprintf() · 1ec3911d
      Cong Ding authored
      
      The variable "str" is used as both the source and destination in
      function snprintf(), which is undefined behavior based on C11. The
      original description in C11 is:
      	"If copying takes place between objects that
      	overlap, the behavior is undefined."
      
      And, the function of ceph_osdmap_state_str() is to return the osdmap
      state, so it should return "doesn't exist" when all the conditions
      are not satisfied. I fix it in this patch.
      
      [elder@inktank.com: shortened the commit message]
      Signed-off-by: default avatarCong Ding <dinggnu@gmail.com>
      Reviewed-by: default avatarAlex Elder <elder@inktank.com>
      1ec3911d
  4. 17 Jan, 2013 14 commits
    • Alex Elder's avatar
      libceph: pass num_op with ops · ae7ca4a3
      Alex Elder authored
      
      Both ceph_osdc_alloc_request() and ceph_osdc_build_request() are
      provided an array of ceph osd request operations.  Rather than just
      passing the number of operations in the array, the caller is
      required append an additional zeroed operation structure to signal
      the end of the array.
      
      All callers know the number of operations at the time these
      functions are called, so drop the silly zero entry and supply that
      number directly.  As a result, get_num_ops() is no longer needed.
      This also means that ceph_osdc_alloc_request() never uses its ops
      argument, so that can be dropped.
      
      Also rbd_create_rw_ops() no longer needs to add one to reserve room
      for the additional op.
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      ae7ca4a3
    • Alex Elder's avatar
      libceph: don't set pages or bio in ceph_osdc_alloc_request() · 54a54007
      Alex Elder authored
      
      Only one of the two callers of ceph_osdc_alloc_request() provides
      page or bio data for its payload.  And essentially all that function
      was doing with those arguments was assigning them to fields in the
      osd request structure.
      
      Simplify ceph_osdc_alloc_request() by having the caller take care of
      making those assignments
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      54a54007
    • Alex Elder's avatar
      libceph: don't set flags in ceph_osdc_alloc_request() · d178a9e7
      Alex Elder authored
      
      The only thing ceph_osdc_alloc_request() really does with the
      flags value it is passed is assign it to the newly-created
      osd request structure.  Do that in the caller instead.
      
      Both callers subsequently call ceph_osdc_build_request(), so have
      that function (instead of ceph_osdc_alloc_request()) issue a warning
      if a request comes through with neither the read nor write flags set.
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      d178a9e7
    • Alex Elder's avatar
      libceph: drop osdc from ceph_calc_raw_layout() · e75b45cf
      Alex Elder authored
      
      The osdc parameter to ceph_calc_raw_layout() is not used, so get rid
      of it.  Consequently, the corresponding parameter in calc_layout()
      becomes unused, so get rid of that as well.
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      e75b45cf
    • Alex Elder's avatar
      libceph: drop snapid in ceph_calc_raw_layout() · 4d6b250b
      Alex Elder authored
      
      A snapshot id must be provided to ceph_calc_raw_layout() even though
      it is not needed at all for calculating the layout.
      
      Where the snapshot id *is* needed is when building the request
      message for an osd operation.
      
      Drop the snapid parameter from ceph_calc_raw_layout() and pass
      that value instead in ceph_osdc_build_request().
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      4d6b250b
    • Alex Elder's avatar
      libceph: pass length to ceph_calc_file_object_mapping() · e8afad65
      Alex Elder authored
      
      ceph_calc_file_object_mapping() takes (among other things) a "file"
      offset and length, and based on the layout, determines the object
      number ("bno") backing the affected portion of the file's data and
      the offset into that object where the desired range begins.  It also
      computes the size that should be used for the request--either the
      amount requested or something less if that would exceed the end of
      the object.
      
      This patch changes the input length parameter in this function so it
      is used only for input.  That is, the argument will be passed by
      value rather than by address, so the value provided won't get
      updated by the function.
      
      The value would only get updated if the length would surpass the
      current object, and in that case the value it got updated to would
      be exactly that returned in *oxlen.
      
      Only one of the two callers is affected by this change.  Update
      ceph_calc_raw_layout() so it records any updated value.
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      e8afad65
    • Alex Elder's avatar
      libceph: pass length to ceph_osdc_build_request() · 0120be3c
      Alex Elder authored
      
      The len argument to ceph_osdc_build_request() is set up to be
      passed by address, but that function never updates its value
      so there's no need to do this.  Tighten up the interface by
      passing the length directly.
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      0120be3c
    • Alex Elder's avatar
      libceph: kill op_needs_trail() · 5b9d1b1c
      Alex Elder authored
      
      Since every osd message is now prepared to include trailing data,
      there's no need to check ahead of time whether any operations will
      make use of the trail portion of the message.
      
      We can drop the second argument to get_num_ops(), and as a result we
      can also get rid of op_needs_trail() which is no longer used.
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      5b9d1b1c
    • Alex Elder's avatar
      libceph: always allow trail in osd request · c885837f
      Alex Elder authored
      
      An osd request structure contains an optional trail portion, which
      if present will contain data to be passed in the payload portion of
      the message containing the request.  The trail field is a
      ceph_pagelist pointer, and if null it indicates there is no trail.
      
      A ceph_pagelist structure contains a length field, and it can
      legitimately hold value 0.  Make use of this to change the
      interpretation of the "trail" of an osd request so that every osd
      request has trailing data, it just might have length 0.
      
      This means we change the r_trail field in a ceph_osd_request
      structure from a pointer to a structure that is always initialized.
      
      Note that in ceph_osdc_start_request(), the trail pointer (or now
      address of that structure) is assigned to a ceph message's trail
      field.  Here's why that's still OK (looking at net/ceph/messenger.c):
          - What would have resulted in a null pointer previously will now
            refer to a 0-length page list.  That message trail pointer
            is used in two functions, write_partial_msg_pages() and
            out_msg_pos_next().
          - In write_partial_msg_pages(), a null page list pointer is
            handled the same as a message with 0-length trail, and both
            result in a "in_trail" variable set to false.  The trail
            pointer is only used if in_trail is true.
          - The only other place the message trail pointer is used is
            out_msg_pos_next().  That function is only called by
            write_partial_msg_pages() and only touches the trail pointer
            if the in_trail value it is passed is true.
      Therefore a null ceph_msg->trail pointer is equivalent to a non-null
      pointer referring to a 0-length page list structure.
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      c885837f
    • Alex Elder's avatar
      rbd: drop oid parameters from ceph_osdc_build_request() · af77f26c
      Alex Elder authored
      
      The last two parameters to ceph_osd_build_request() describe the
      object id, but the values passed always come from the osd request
      structure whose address is also provided.  Get rid of those last
      two parameters.
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      af77f26c
    • Alex Elder's avatar
      libceph: reformat __reset_osd() · c3acb181
      Alex Elder authored
      
      Reformat __reset_osd() into three distinct blocks of code
      handling the three return cases.
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
      c3acb181
    • Sage Weil's avatar
      crush: avoid recursion if we have already collided · 7d7c1f61
      Sage Weil authored
      
      This saves us some cycles, but does not affect the placement result at
      all.
      
      This corresponds to ceph.git commit 4abb53d4f.
      Signed-off-by: default avatarSage Weil <sage@inktank.com>
      7d7c1f61
    • Jim Schutt's avatar
      libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed · 1604f488
      Jim Schutt authored
      
      Add libceph support for a new CRUSH tunable recently added to Ceph servers.
      
      Consider the CRUSH rule
        step chooseleaf firstn 0 type <node_type>
      
      This rule means that <n> replicas will be chosen in a manner such that
      each chosen leaf's branch will contain a unique instance of <node_type>.
      
      When an object is re-replicated after a leaf failure, if the CRUSH map uses
      a chooseleaf rule the remapped replica ends up under the <node_type> bucket
      that held the failed leaf.  This causes uneven data distribution across the
      storage cluster, to the point that when all the leaves but one fail under a
      particular <node_type> bucket, that remaining leaf holds all the data from
      its failed peers.
      
      This behavior also limits the number of peers that can participate in the
      re-replication of the data held by the failed leaf, which increases the
      time required to re-replicate after a failure.
      
      For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
      inner and outer descents.
      
      If the tree descent down to <node_type> is the outer descent, and the descent
      from <node_type> down to a leaf is the inner descent, the issue is that a
      down leaf is detected on the inner descent, so only the inner descent is
      retried.
      
      In order to disperse re-replicated data as widely as possible across a
      storage cluster after a failure, we want to retry the outer descent. So,
      fix up crush_choose() to allow the inner descent to return immediately on
      choosing a failed leaf.  Wire this up as a new CRUSH tunable.
      
      Note that after this change, for a chooseleaf rule, if the primary OSD
      in a placement group has failed, choosing a replacement may result in
      one of the other OSDs in the PG colliding with the new primary.  This
      requires that OSD's data for that PG to need moving as well.  This
      seems unavoidable but should be relatively rare.
      
      This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc.
      Signed-off-by: default avatarJim Schutt <jaschut@sandia.gov>
      Reviewed-by: default avatarSage Weil <sage@inktank.com>
      1604f488
    • Yan, Zheng's avatar
      ceph: re-calculate truncate_size for strip object · a41bad1a
      Yan, Zheng authored
      
      Otherwise osd may truncate the object to larger size.
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: default avatarSage Weil <sage@inktank.com>
      a41bad1a
  5. 11 Jan, 2013 2 commits
  6. 10 Jan, 2013 5 commits
    • Randy Dunlap's avatar
      nfs: fix sunrpc/clnt.c kernel-doc warnings · 7144bca6
      Randy Dunlap authored
      
      Fix new kernel-doc warnings in clnt.c:
      
        Warning(net/sunrpc/clnt.c:561): No description found for parameter 'flavor'
        Warning(net/sunrpc/clnt.c:561): Excess function parameter 'auth' description in 'rpc_clone_client_set_auth'
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: linux-nfs@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7144bca6
    • Romain Kuntz's avatar
      ipv6: use addrconf_get_prefix_route for prefix route lookup [v2] · 21caa662
      Romain Kuntz authored
      Replace ip6_route_lookup() with addrconf_get_prefix_route() when
      looking up for a prefix route. This ensures that the connected prefix
      is looked up in the main table, and avoids the selection of other
      matching routes located in different tables as well as blackhole
      or prohibited entries.
      
      In addition, this fixes an Opps introduced by commit 64c6d08e
      
       (ipv6:
      del unreachable route when an addr is deleted on lo), that would occur
      when a blackhole or prohibited entry is selected by ip6_route_lookup().
      Such entries have a NULL rt6i_table argument, which is accessed by
      __ip6_del_rt() when trying to lock rt6i_table->tb6_lock.
      
      The function addrconf_is_prefix_route() is not used anymore and is
      removed.
      
      [v2] Minor indentation cleanup and log updates.
      Signed-off-by: default avatarRomain Kuntz <r.kuntz@ipflavors.com>
      Acked-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: default avatarYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21caa662
    • Romain Kuntz's avatar
      ipv6: fix the noflags test in addrconf_get_prefix_route · 85da53bf
      Romain Kuntz authored
      
      The tests on the flags in addrconf_get_prefix_route() does no make
      much sense: the 'noflags' parameter contains the set of flags that
      must not match with the route flags, so the test must be done
      against 'noflags', and not against 'flags'.
      Signed-off-by: default avatarRomain Kuntz <r.kuntz@ipflavors.com>
      Acked-by: default avatarYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85da53bf
    • Eric Dumazet's avatar
      tcp: fix splice() and tcp collapsing interaction · f26845b4
      Eric Dumazet authored
      
      Under unusual circumstances, TCP collapse can split a big GRO TCP packet
      while its being used in a splice(socket->pipe) operation.
      
      skb_splice_bits() releases the socket lock before calling
      splice_to_pipe().
      
      [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc()
      [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]-
      [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13
      
      To fix this problem, we must eat skbs in tcp_recv_skb().
      
      Remove the inline keyword from tcp_recv_skb() definition since
      it has three call sites.
      Reported-by: default avatarChristian Becker <c.becker@traviangames.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f26845b4
    • Eric Dumazet's avatar
      tcp: splice: fix an infinite loop in tcp_read_sock() · ff905b1e
      Eric Dumazet authored
      commit 02275a2e
      
       (tcp: don't abort splice() after small transfers)
      added a regression.
      
      [   83.843570] INFO: rcu_sched self-detected stall on CPU
      [   83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)
      [   83.844582] Task dump for CPU 6:
      [   83.844584] netperf         R  running task        0  8966   8952 0x0000000c
      [   83.844587]  0000000000000000 0000000000000006 0000000000006c6c 0000000000000000
      [   83.844589]  000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10
      [   83.844592]  ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8
      [   83.844594] Call Trace:
      [   83.844596]  [<ffffffff81088679>] ? vprintk_emit+0x1c9/0x4c0
      [   83.844601]  [<ffffffff815ad449>] ? schedule+0x29/0x70
      [   83.844606]  [<ffffffff81537bd2>] ? tcp_splice_data_recv+0x42/0x50
      [   83.844610]  [<ffffffff8153beaa>] ? tcp_read_sock+0xda/0x260
      [   83.844613]  [<ffffffff81537b90>] ? tcp_prequeue_process+0xb0/0xb0
      [   83.844615]  [<ffffffff8153c0f0>] ? tcp_splice_read+0xc0/0x250
      [   83.844618]  [<ffffffff814dc0c2>] ? sock_splice_read+0x22/0x30
      [   83.844622]  [<ffffffff811b820b>] ? do_splice_to+0x7b/0xa0
      [   83.844627]  [<ffffffff811ba4bc>] ? sys_splice+0x59c/0x5d0
      [   83.844630]  [<ffffffff8119745b>] ? putname+0x2b/0x40
      [   83.844633]  [<ffffffff8118bcb4>] ? do_sys_open+0x174/0x1e0
      [   83.844636]  [<ffffffff815b6202>] ? system_call_fastpath+0x16/0x1b
      
      if recv_actor() returns 0, we should stop immediately,
      because looping wont give a chance to drain the pipe.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff905b1e
  7. 09 Jan, 2013 1 commit
  8. 08 Jan, 2013 2 commits
    • Trond Myklebust's avatar
      SUNRPC: Ensure we release the socket write lock if the rpc_task exits early · 87ed5003
      Trond Myklebust authored
      
      If the rpc_task exits while holding the socket write lock before it has
      allocated an rpc slot, then the usual mechanism for releasing the write
      lock in xprt_release() is defeated.
      
      The problem occurs if the call to xprt_lock_write() initially fails, so
      that the rpc_task is put on the xprt->sending wait queue. If the task
      exits after being assigned the lock by __xprt_lock_write_func, but
      before it has retried the call to xprt_lock_and_alloc_slot(), then
      it calls xprt_release() while holding the write lock, but will
      immediately exit due to the test for task->tk_rqstp != NULL.
      Reported-by: default avatarChris Perl <chris.perl@gmail.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@vger.kernel.org [>= 3.1]
      87ed5003
    • Heiko Carstens's avatar
      s390/irq: remove split irq fields from /proc/stat · 420f42ec
      Heiko Carstens authored
      Now that irq sum accounting for /proc/stat's "intr" line works again we
      have the oddity that the sum field (first field) contains only the sum
      of the second (external irqs) and third field (I/O interrupts).
      The reason for that is that these two fields are already sums of all other
      fields. So if we would sum up everything we would count every interrupt
      twice.
      This is broken since the split interrupt accounting was merged two years
      ago: 052ff461
      
       "[S390] irq: have detailed
      statistics for interrupt types".
      To fix this remove the split interrupt fields from /proc/stat's "intr"
      line again and only have them in /proc/interrupts.
      
      This restores the old behaviour, seems to be the only sane fix and mimics
      a behaviour from other architectures where /proc/interrupts also contains
      more than /proc/stat's "intr" line does.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      420f42ec