1. 06 Feb, 2019 1 commit
    • Chuck Lever's avatar
      svcrdma: Remove max_sge check at connect time · e248aa7b
      Chuck Lever authored
      Two and a half years ago, the client was changed to use gathered
      Send for larger inline messages, in commit 655fec69 ("xprtrdma:
      Use gathered Send for large inline messages"). Several fixes were
      required because there are a few in-kernel device drivers whose
      max_sge is 3, and these were broken by the change.
      
      Apparently my memory is going, because some time later, I submitted
      commit 25fd86ec ("svcrdma: Don't overrun the SGE array in
      svc_rdma_send_ctxt"), and after that, commit f3c1fd0e
      
       ("svcrdma:
      Reduce max_send_sges"). These too incorrectly assumed in-kernel
      device drivers would have more than a few Send SGEs available.
      
      The fix for the server side is not the same. This is because the
      fundamental problem on the server is that, whether or not the client
      has provisioned a chunk for the RPC reply, the server must squeeze
      even the most complex RPC replies into a single RDMA Send. Failing
      in the send path because of Send SGE exhaustion should never be an
      option.
      
      Therefore, instead of failing when the send path runs out of SGEs,
      switch to using a bounce buffer mechanism to handle RPC replies that
      are too complex for the device to send directly. That allows us to
      remove the max_sge check to enable drivers with small max_sge to
      work again.
      Reported-by: default avatarDon Dutile <ddutile@redhat.com>
      Fixes: 25fd86ec
      
       ("svcrdma: Don't overrun the SGE array in ...")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      e248aa7b
  2. 28 Dec, 2018 1 commit
  3. 28 Nov, 2018 1 commit
  4. 09 Aug, 2018 1 commit
    • Chuck Lever's avatar
      svcrdma: Avoid releasing a page in svc_xprt_release() · a53d5cb0
      Chuck Lever authored
      
      svc_xprt_release() invokes svc_free_res_pages(), which releases
      pages between rq_respages and rq_next_page.
      
      Historically, the RPC/RDMA transport has set these two pointers to
      be different by one, which means:
      
      - one page gets released when svc_recv returns 0. This normally
      happens whenever one or more RDMA Reads need to be dispatched to
      complete construction of an RPC Call.
      
      - one page gets released after every call to svc_send.
      
      In both cases, this released page is immediately refilled by
      svc_alloc_arg. There does not seem to be a reason for releasing this
      page.
      
      To avoid this unnecessary memory allocator traffic, set rq_next_page
      more carefully.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      a53d5cb0
  5. 24 Jul, 2018 1 commit
  6. 11 May, 2018 13 commits
    • Chuck Lever's avatar
      svcrdma: Persistently allocate and DMA-map Send buffers · 99722fe4
      Chuck Lever authored
      
      While sending each RPC Reply, svc_rdma_sendto allocates and DMA-
      maps a separate buffer where the RPC/RDMA transport header is
      constructed. The buffer is unmapped and released in the Send
      completion handler. This is significant per-RPC overhead,
      especially for small RPCs.
      
      Instead, allocate and DMA-map a buffer, and cache it in each
      svc_rdma_send_ctxt. This buffer and its mapping can be re-used
      for each RPC, saving the cost of memory allocation and DMA
      mapping.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      99722fe4
    • Chuck Lever's avatar
      svcrdma: Simplify svc_rdma_send() · 3abb03fa
      Chuck Lever authored
      
      Clean up: No current caller of svc_rdma_send's passes in a chained
      WR. The logic that counts the chain length can be replaced with a
      constant (1).
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      3abb03fa
    • Chuck Lever's avatar
      svcrdma: Remove post_send_wr · 986b7889
      Chuck Lever authored
      
      Clean up: Now that the send_wr is part of the svc_rdma_send_ctxt,
      svc_rdma_post_send_wr is nearly empty.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      986b7889
    • Chuck Lever's avatar
      svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxt · 25fd86ec
      Chuck Lever authored
      
      Receive buffers are always the same size, but each Send WR has a
      variable number of SGEs, based on the contents of the xdr_buf being
      sent.
      
      While assembling a Send WR, keep track of the number of SGEs so that
      we don't exceed the device's maximum, or walk off the end of the
      Send SGE array.
      
      For now the Send path just fails if it exceeds the maximum.
      
      The current logic in svc_rdma_accept bases the maximum number of
      Send SGEs on the largest NFS request that can be sent or received.
      In the transport layer, the limit is actually based on the
      capabilities of the underlying device, not on properties of the
      Upper Layer Protocol.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      25fd86ec
    • Chuck Lever's avatar
      svcrdma: Introduce svc_rdma_send_ctxt · 4201c746
      Chuck Lever authored
      
      svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
      free list. This eliminates the overhead of calling kmalloc / kfree,
      both of which grab a globally shared lock that disables interrupts.
      Introduce a replacement to svc_rdma_op_ctxt's that is built
      especially for the svcrdma Send path.
      
      Subsequent patches will take advantage of this new structure by
      allocating real resources which are then cached in these objects.
      The allocations are freed when the transport is torn down.
      
      I've renamed the structure so that static type checking can be used
      to ensure that uses of op_ctxt and send_ctxt are not confused. As an
      additional clean up, structure fields are renamed to conform with
      kernel coding conventions.
      
      Additional clean ups:
      - Handle svc_rdma_send_ctxt_get allocation failure at each call
        site, rather than pre-allocating and hoping we guessed correctly
      - All send_ctxt_put call-sites request page freeing, so remove
        the @free_pages argument
      - All send_ctxt_put call-sites unmap SGEs, so fold that into
        svc_rdma_send_ctxt_put
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      4201c746
    • Chuck Lever's avatar
      svcrdma: Clean up Send SGE accounting · 23262790
      Chuck Lever authored
      
      Clean up: Since there's already a svc_rdma_op_ctxt being passed
      around with the running count of mapped SGEs, drop unneeded
      parameters to svc_rdma_post_send_wr().
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      23262790
    • Chuck Lever's avatar
      svcrdma: Refactor svc_rdma_dma_map_buf · f016f305
      Chuck Lever authored
      
      Clean up: svc_rdma_dma_map_buf does mostly the same thing as
      svc_rdma_dma_map_page, so let's fold these together.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      f016f305
    • Chuck Lever's avatar
      svcrdma: Persistently allocate and DMA-map Receive buffers · 3316f063
      Chuck Lever authored
      
      The current Receive path uses an array of pages which are allocated
      and DMA mapped when each Receive WR is posted, and then handed off
      to the upper layer in rqstp::rq_arg. The page flip releases unused
      pages in the rq_pages pagelist. This mechanism introduces a
      significant amount of overhead.
      
      So instead, kmalloc the Receive buffer, and leave it DMA-mapped
      while the transport remains connected. This confers a number of
      benefits:
      
      * Each Receive WR requires only one receive SGE, no matter how large
        the inline threshold is. This helps the server-side NFS/RDMA
        transport operate on less capable RDMA devices.
      
      * The Receive buffer is left allocated and mapped all the time. This
        relieves svc_rdma_post_recv from the overhead of allocating and
        DMA-mapping a fresh buffer.
      
      * svc_rdma_wc_receive no longer has to DMA unmap the Receive buffer.
        It has to DMA sync only the number of bytes that were received.
      
      * svc_rdma_build_arg_xdr no longer has to free a page in rq_pages
        for each page in the Receive buffer, making it a constant-time
        function.
      
      * The Receive buffer is now plugged directly into the rq_arg's
        head[0].iov_vec, and can be larger than a page without spilling
        over into rq_arg's page list. This enables simplification of
        the RDMA Read path in subsequent patches.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      3316f063
    • Chuck Lever's avatar
      svcrdma: Preserve Receive buffer until svc_rdma_sendto · 3a88092e
      Chuck Lever authored
      
      Rather than releasing the incoming svc_rdma_recv_ctxt at the end of
      svc_rdma_recvfrom, hold onto it until svc_rdma_sendto.
      
      This permits the contents of the Receive buffer to be preserved
      through svc_process and then referenced directly in sendto as it
      constructs Write and Reply chunks to return to the client.
      
      The real changes will come in subsequent patches.
      
      Note: I cannot use ->xpo_release_rqst for this purpose because that
      is called _before_ ->xpo_sendto. svc_rdma_sendto uses information in
      the received Call transport header to construct the Reply transport
      header, which is preserved in the RPC's Receive buffer.
      
      The historical comment in svc_send() isn't helpful: it is already
      obvious that ->xpo_release_rqst is being called before ->xpo_sendto,
      but there is no explanation for this ordering going back to the
      beginning of the git era.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      3a88092e
    • Chuck Lever's avatar
      svcrdma: Introduce svc_rdma_recv_ctxt · ecf85b23
      Chuck Lever authored
      
      svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
      free list. This eliminates the overhead of calling kmalloc / kfree,
      both of which grab a globally shared lock that disables interrupts.
      To reduce contention further, separate the use of these objects in
      the Receive and Send paths in svcrdma.
      
      Subsequent patches will take advantage of this separation by
      allocating real resources which are then cached in these objects.
      The allocations are freed when the transport is torn down.
      
      I've renamed the structure so that static type checking can be used
      to ensure that uses of op_ctxt and recv_ctxt are not confused. As an
      additional clean up, structure fields are renamed to conform with
      kernel coding conventions.
      
      As a final clean up, helpers related to recv_ctxt are moved closer
      to the functions that use them.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      ecf85b23
    • Chuck Lever's avatar
      svcrdma: Trace key RDMA API events · bd2abef3
      Chuck Lever authored
      
      This includes:
        * Posting on the Send and Receive queues
        * Send, Receive, Read, and Write completion
        * Connect upcalls
        * QP errors
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      bd2abef3
    • Chuck Lever's avatar
      svcrdma: Trace key RPC/RDMA protocol events · 98895edb
      Chuck Lever authored
      
      This includes:
        * Transport accept and tear-down
        * Decisions about using Write and Reply chunks
        * Each RDMA segment that is handled
        * Whenever an RDMA_ERR is sent
      
      As a clean-up, I've standardized the order of the includes, and
      removed some now redundant dprintk call sites.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      98895edb
    • Chuck Lever's avatar
  7. 18 Jan, 2018 1 commit
  8. 13 Jul, 2017 1 commit
  9. 28 Jun, 2017 2 commits
  10. 25 Apr, 2017 8 commits
  11. 08 Feb, 2017 2 commits
  12. 30 Nov, 2016 3 commits
    • Chuck Lever's avatar
      svcrdma: Further clean-up of svc_rdma_get_inv_rkey() · fafedf81
      Chuck Lever authored
      
      No longer any need for the dprintk().
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      fafedf81
    • Chuck Lever's avatar
      svcrdma: Remove BH-disabled spin locking in svc_rdma_send() · e4eb42ce
      Chuck Lever authored
      
      svcrdma's current SQ accounting algorithm takes sc_lock and disables
      bottom-halves while posting all RDMA Read, Write, and Send WRs.
      
      This is relatively heavyweight serialization. And note that Write and
      Send are already fully serialized by the xpt_mutex.
      
      Using a single atomic_t should be all that is necessary to guarantee
      that ib_post_send() is called only when there is enough space on the
      send queue. This is what the other RDMA-enabled storage targets do.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      e4eb42ce
    • Chuck Lever's avatar
      svcrdma: Renovate sendto chunk list parsing · 5fdca653
      Chuck Lever authored
      
      The current sendto code appears to support clients that provide only
      one of a Read list, a Write list, or a Reply chunk. My reading of
      that code is that it doesn't support the following cases:
      
       - Read list + Write list
       - Read list + Reply chunk
       - Write list + Reply chunk
       - Read list + Write list + Reply chunk
      
      The protocol allows more than one Read or Write chunk in those
      lists. Some clients do send a Read list and Reply chunk
      simultaneously. NFSv4 WRITE uses a Read list for the data payload,
      and a Reply chunk because the GETATTR result in the reply can
      contain a large object like an ACL.
      
      Generalize one of the sendto code paths needed to support all of
      the above cases, and attempt to ensure that only one pass is done
      through the RPC Call's transport header to gather chunk list
      information for building the reply.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      5fdca653
  13. 23 Sep, 2016 3 commits
    • Chuck Lever's avatar
      svcrdma: support Remote Invalidation · 25d55296
      Chuck Lever authored
      
      Support Remote Invalidation. A private message is exchanged with
      the client upon RDMA transport connect that indicates whether
      Send With Invalidation may be used by the server to send RPC
      replies. The invalidate_rkey is arbitrarily chosen from among
      rkeys present in the RPC-over-RDMA header's chunk lists.
      
      Send With Invalidate improves performance only when clients can
      recognize, while processing an RPC reply, that an rkey has already
      been invalidated. That has been submitted as a separate change.
      
      In the future, the RPC-over-RDMA protocol might support Remote
      Invalidation properly. The protocol needs to enable signaling
      between peers to indicate when Remote Invalidation can be used
      for each individual RPC.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      25d55296
    • Chuck Lever's avatar
      svcrdma: Skip put_page() when send_reply() fails · 9995237b
      Chuck Lever authored
      Message from syslogd@klimt at Aug 18 17:00:37 ...
       kernel:page:ffffea0020639b00 count:0 mapcount:0 mapping:          (null) index:0x0
      Aug 18 17:00:37 klimt kernel: flags: 0x2fffff80000000()
      Aug 18 17:00:37 klimt kernel: page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
      
      Aug 18 17:00:37 klimt kernel: kernel BUG at /home/cel/src/linux/linux-2.6/include/linux/mm.h:445!
      Aug 18 17:00:37 klimt kernel: RIP: 0010:[<ffffffffa05c21c1>] svc_rdma_sendto+0x641/0x820 [rpcrdma]
      
      send_reply() assigns its page argument as the first page of ctxt. On
      error, send_reply() already invokes svc_rdma_put_context(ctxt, 1);
      which does a put_page() on that very page. No need to do that again
      as svc_rdma_sendto exits.
      
      Fixes: 3e1eeb98
      
       ("svcrdma: Close connection when a send error occurs")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      9995237b
    • Chuck Lever's avatar
      svcrdma: Tail iovec leaves an orphaned DMA mapping · cace564f
      Chuck Lever authored
      The ctxt's count field is overloaded to mean the number of pages in
      the ctxt->page array and the number of SGEs in the ctxt->sge array.
      Typically these two numbers are the same.
      
      However, when an inline RPC reply is constructed from an xdr_buf
      with a tail iovec, the head and tail often occupy the same page,
      but each are DMA mapped independently. In that case, ->count equals
      the number of pages, but it does not equal the number of SGEs.
      There's one more SGE, for the tail iovec. Hence there is one more
      DMA mapping than there are pages in the ctxt->page array.
      
      This isn't a real problem until the server's iommu is enabled. Then
      each RPC reply that has content in that iovec orphans a DMA mapping
      that consists of real resources.
      
      krb5i and krb5p always populate that tail iovec. After a couple
      million sent krb5i/p RPC replies, the NFS server starts behaving
      erratically. Reboot is needed to clear the problem.
      
      Fixes: 9d11b51c
      
       ("svcrdma: Fix send_reply() scatter/gather set-up")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      cace564f
  14. 13 May, 2016 1 commit
  15. 01 Mar, 2016 1 commit
    • Chuck Lever's avatar
      svcrdma: Use new CQ API for RPC-over-RDMA server send CQs · be99bb11
      Chuck Lever authored
      Calling ib_poll_cq() to sort through WCs during a completion is a
      common pattern amongst RDMA consumers. Since commit 14d3a3b2
      
      
      ("IB: add a proper completion queue abstraction"), WC sorting can
      be handled by the IB core.
      
      By converting to this new API, svcrdma is made a better neighbor to
      other RDMA consumers, as it allows the core to schedule the delivery
      of completions more fairly amongst all active consumers.
      
      This new API also aims each completion at a function that is
      specific to the WR's opcode. Thus the ctxt->wr_op field and the
      switch in process_context is replaced by a set of methods that
      handle each completion type.
      
      Because each ib_cqe carries a pointer to a completion method, the
      core can now post operations on a consumer's QP, and handle the
      completions itself.
      
      The server's rdma_stat_sq_poll and rdma_stat_sq_prod metrics are no
      longer updated.
      
      As a clean up, the cq_event_handler, the dto_tasklet, and all
      associated locking is removed, as they are no longer referenced or
      used.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      be99bb11