- 06 Feb, 2019 1 commit
-
-
Chuck Lever authored
Two and a half years ago, the client was changed to use gathered Send for larger inline messages, in commit 655fec69 ("xprtrdma: Use gathered Send for large inline messages"). Several fixes were required because there are a few in-kernel device drivers whose max_sge is 3, and these were broken by the change. Apparently my memory is going, because some time later, I submitted commit 25fd86ec ("svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxt"), and after that, commit f3c1fd0e ("svcrdma: Reduce max_send_sges"). These too incorrectly assumed in-kernel device drivers would have more than a few Send SGEs available. The fix for the server side is not the same. This is because the fundamental problem on the server is that, whether or not the client has provisioned a chunk for the RPC reply, the server must squeeze even the most complex RPC replies into a single RDMA Send. Failing in the send path because of Send SGE exhaustion should never be an option. Therefore, instead of failing when the send path runs out of SGEs, switch to using a bounce buffer mechanism to handle RPC replies that are too complex for the device to send directly. That allows us to remove the max_sge check to enable drivers with small max_sge to work again. Reported-by:
Don Dutile <ddutile@redhat.com> Fixes: 25fd86ec ("svcrdma: Don't overrun the SGE array in ...") Cc: stable@vger.kernel.org Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 28 Dec, 2018 1 commit
-
-
Vasily Averin authored
xpo_prep_reply_hdr are not used now. It was defined for tcp transport only, however it cannot be called indirectly, so let's move it to its caller and remove unused callback. Signed-off-by:
Vasily Averin <vvs@virtuozzo.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 28 Nov, 2018 1 commit
-
-
Chuck Lever authored
o Select the R_key to invalidate while the CPU cache still contains the received RPC Call transport header, rather than waiting until we're about to send the RPC Reply. o Choose Send With Invalidate if there is exactly one distinct R_key in the received transport header. If there's more than one, the client will have to perform local invalidation after it has already waited for remote invalidation. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 09 Aug, 2018 1 commit
-
-
Chuck Lever authored
svc_xprt_release() invokes svc_free_res_pages(), which releases pages between rq_respages and rq_next_page. Historically, the RPC/RDMA transport has set these two pointers to be different by one, which means: - one page gets released when svc_recv returns 0. This normally happens whenever one or more RDMA Reads need to be dispatched to complete construction of an RPC Call. - one page gets released after every call to svc_send. In both cases, this released page is immediately refilled by svc_alloc_arg. There does not seem to be a reason for releasing this page. To avoid this unnecessary memory allocator traffic, set rq_next_page more carefully. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 24 Jul, 2018 1 commit
-
-
Bart Van Assche authored
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL as third argument to ib_post_(send|recv|srq_recv)(). Signed-off-by:
Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by:
Chuck Lever <chuck.lever@oracle.com> Acked-by:
Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by:
Jason Gunthorpe <jgg@mellanox.com>
-
- 11 May, 2018 13 commits
-
-
Chuck Lever authored
While sending each RPC Reply, svc_rdma_sendto allocates and DMA- maps a separate buffer where the RPC/RDMA transport header is constructed. The buffer is unmapped and released in the Send completion handler. This is significant per-RPC overhead, especially for small RPCs. Instead, allocate and DMA-map a buffer, and cache it in each svc_rdma_send_ctxt. This buffer and its mapping can be re-used for each RPC, saving the cost of memory allocation and DMA mapping. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Clean up: No current caller of svc_rdma_send's passes in a chained WR. The logic that counts the chain length can be replaced with a constant (1). Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Clean up: Now that the send_wr is part of the svc_rdma_send_ctxt, svc_rdma_post_send_wr is nearly empty. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Receive buffers are always the same size, but each Send WR has a variable number of SGEs, based on the contents of the xdr_buf being sent. While assembling a Send WR, keep track of the number of SGEs so that we don't exceed the device's maximum, or walk off the end of the Send SGE array. For now the Send path just fails if it exceeds the maximum. The current logic in svc_rdma_accept bases the maximum number of Send SGEs on the largest NFS request that can be sent or received. In the transport layer, the limit is actually based on the capabilities of the underlying device, not on properties of the Upper Layer Protocol. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt free list. This eliminates the overhead of calling kmalloc / kfree, both of which grab a globally shared lock that disables interrupts. Introduce a replacement to svc_rdma_op_ctxt's that is built especially for the svcrdma Send path. Subsequent patches will take advantage of this new structure by allocating real resources which are then cached in these objects. The allocations are freed when the transport is torn down. I've renamed the structure so that static type checking can be used to ensure that uses of op_ctxt and send_ctxt are not confused. As an additional clean up, structure fields are renamed to conform with kernel coding conventions. Additional clean ups: - Handle svc_rdma_send_ctxt_get allocation failure at each call site, rather than pre-allocating and hoping we guessed correctly - All send_ctxt_put call-sites request page freeing, so remove the @free_pages argument - All send_ctxt_put call-sites unmap SGEs, so fold that into svc_rdma_send_ctxt_put Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Clean up: Since there's already a svc_rdma_op_ctxt being passed around with the running count of mapped SGEs, drop unneeded parameters to svc_rdma_post_send_wr(). Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Clean up: svc_rdma_dma_map_buf does mostly the same thing as svc_rdma_dma_map_page, so let's fold these together. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
The current Receive path uses an array of pages which are allocated and DMA mapped when each Receive WR is posted, and then handed off to the upper layer in rqstp::rq_arg. The page flip releases unused pages in the rq_pages pagelist. This mechanism introduces a significant amount of overhead. So instead, kmalloc the Receive buffer, and leave it DMA-mapped while the transport remains connected. This confers a number of benefits: * Each Receive WR requires only one receive SGE, no matter how large the inline threshold is. This helps the server-side NFS/RDMA transport operate on less capable RDMA devices. * The Receive buffer is left allocated and mapped all the time. This relieves svc_rdma_post_recv from the overhead of allocating and DMA-mapping a fresh buffer. * svc_rdma_wc_receive no longer has to DMA unmap the Receive buffer. It has to DMA sync only the number of bytes that were received. * svc_rdma_build_arg_xdr no longer has to free a page in rq_pages for each page in the Receive buffer, making it a constant-time function. * The Receive buffer is now plugged directly into the rq_arg's head[0].iov_vec, and can be larger than a page without spilling over into rq_arg's page list. This enables simplification of the RDMA Read path in subsequent patches. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Rather than releasing the incoming svc_rdma_recv_ctxt at the end of svc_rdma_recvfrom, hold onto it until svc_rdma_sendto. This permits the contents of the Receive buffer to be preserved through svc_process and then referenced directly in sendto as it constructs Write and Reply chunks to return to the client. The real changes will come in subsequent patches. Note: I cannot use ->xpo_release_rqst for this purpose because that is called _before_ ->xpo_sendto. svc_rdma_sendto uses information in the received Call transport header to construct the Reply transport header, which is preserved in the RPC's Receive buffer. The historical comment in svc_send() isn't helpful: it is already obvious that ->xpo_release_rqst is being called before ->xpo_sendto, but there is no explanation for this ordering going back to the beginning of the git era. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt free list. This eliminates the overhead of calling kmalloc / kfree, both of which grab a globally shared lock that disables interrupts. To reduce contention further, separate the use of these objects in the Receive and Send paths in svcrdma. Subsequent patches will take advantage of this separation by allocating real resources which are then cached in these objects. The allocations are freed when the transport is torn down. I've renamed the structure so that static type checking can be used to ensure that uses of op_ctxt and recv_ctxt are not confused. As an additional clean up, structure fields are renamed to conform with kernel coding conventions. As a final clean up, helpers related to recv_ctxt are moved closer to the functions that use them. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
This includes: * Posting on the Send and Receive queues * Send, Receive, Read, and Write completion * Connect upcalls * QP errors Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
This includes: * Transport accept and tear-down * Decisions about using Write and Reply chunks * Each RDMA segment that is handled * Whenever an RDMA_ERR is sent As a clean-up, I've standardized the order of the includes, and removed some now redundant dprintk call sites. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 18 Jan, 2018 1 commit
-
-
Chuck Lever authored
This change improves Receive efficiency by posting Receives only on the same CPU that handles Receive completion. Improved latency and throughput has been noted with this change. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 13 Jul, 2017 1 commit
-
-
Colin Ian King authored
The current check will always be true and will always jump to err1, this looks dubious to me. I believe && should be used instead of ||. Detected by CoverityScan, CID#1450120 ("Logically Dead Code") Fixes: 107c1d0a ("svcrdma: Avoid Send Queue overflow") Signed-off-by:
Colin Ian King <colin.king@canonical.com> Reviewed-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 28 Jun, 2017 2 commits
-
-
Chuck Lever authored
Sanity case: Catch the case where more Work Requests are being posted to the Send Queue than there are Send Queue Entries. This might happen if a client sends a chunk with more segments than there are SQEs for the transport. The server can't send that reply, so the transport will deadlock unless the client drops the RPC. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
The server displays "svcrdma: failed to post Send WR (-107)" in the kernel log when the client disconnects. This could flood the server's log, so remove the message. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 25 Apr, 2017 8 commits
-
-
Chuck Lever authored
req_maps are no longer used by the send path and can thus be removed. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Observed at Connectathon 2017. If a client has underestimated the size of a Write or Reply chunk, the Linux server writes as much payload data as it can, then it recognizes there was a problem and closes the connection without sending the transport header. This creates a couple of problems: <> The client never receives indication of the server-side failure, so it continues to retransmit the bad RPC. Forward progress on the transport is blocked. <> The reply payload pages are not moved out of the svc_rqst, thus they can be released by the RPC server before the RDMA Writes have completed. The new rdma_rw-ized helpers return a distinct error code when a Write/Reply chunk overrun occurs, so it's now easy for the caller (svc_rdma_sendto) to recognize this case. Instead of dropping the connection, post an RDMA_ERROR message. The client now sees an RDMA_ERROR and can properly terminate the RPC transaction. As part of the new logic, set up the same delayed release for these payload pages as would have occurred in the normal case. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can be refactored to reduce code duplication and remove C structure- based XDR encoding. It is also relocated to the source file that contains its only caller. This is a refactoring change only. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
The current svcrdma sendto code path posts one RDMA Write WR at a time. Each of these Writes typically carries a small number of pages (for instance, up to 30 pages for mlx4 devices). That means a 1MB NFS READ reply requires 9 ib_post_send() calls for the Write WRs, and one for the Send WR carrying the actual RPC Reply message. Instead, use the new rdma_rw API. The details of Write WR chain construction and memory registration are taken care of in the RDMA core. svcrdma can focus on the details of the RPC-over-RDMA protocol. This gives three main benefits: 1. All Write WRs for one RDMA segment are posted in a single chain. As few as one ib_post_send() for each Write chunk. 2. The Write path can now use FRWR to register the Write buffers. If the device's maximum page list depth is large, this means a single Write WR is needed for each RPC's Write chunk data. 3. The new code introduces support for RPCs that carry both a Write list and a Reply chunk. This combination can be used for an NFSv4 READ where the data payload is large, and thus is removed from the Payload Stream, but the Payload Stream is still larger than the inline threshold. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Replace C structure-based XDR decoding with more portable code that instead uses pointer arithmetic. This is a refactoring change only. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Clean up: extract the logic to save pages under I/O into a helper to add a big documenting comment without adding clutter in the send path. This is a refactoring change only. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Introduce a helper to DMA-map a reply's transport header before sending it. This will in part replace the map vector cache. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Clean up: Move the ib_send_wr off the stack, and move common code to post a Send Work Request into a helper. This is a refactoring change only. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 08 Feb, 2017 2 commits
-
-
Chuck Lever authored
Replace C structure-based XDR decoding with pointer arithmetic. Pointer arithmetic is considered more portable, and is used throughout the kernel's existing XDR encoders. The gcc optimizer generates similar assembler code either way. Byte-swapping before a memory store on x86 typically results in an instruction pipeline stall. Avoid byte-swapping when encoding a new header. svcrdma currently doesn't alter a connection's credit grant value after the connection has been accepted, so it is effectively a constant. Cache the byte-swapped value in a separate field. Christoph suggested pulling the header encoding logic into the only function that uses it. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Commit 5fdca653 ("svcrdma: Renovate sendto chunk list parsing") missed a spot. svc_rdma_xdr_get_reply_hdr_len() also assumes the Write list has only one Write chunk. There's no harm in making this code more general. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 30 Nov, 2016 3 commits
-
-
Chuck Lever authored
No longer any need for the dprintk(). Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
svcrdma's current SQ accounting algorithm takes sc_lock and disables bottom-halves while posting all RDMA Read, Write, and Send WRs. This is relatively heavyweight serialization. And note that Write and Send are already fully serialized by the xpt_mutex. Using a single atomic_t should be all that is necessary to guarantee that ib_post_send() is called only when there is enough space on the send queue. This is what the other RDMA-enabled storage targets do. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
The current sendto code appears to support clients that provide only one of a Read list, a Write list, or a Reply chunk. My reading of that code is that it doesn't support the following cases: - Read list + Write list - Read list + Reply chunk - Write list + Reply chunk - Read list + Write list + Reply chunk The protocol allows more than one Read or Write chunk in those lists. Some clients do send a Read list and Reply chunk simultaneously. NFSv4 WRITE uses a Read list for the data payload, and a Reply chunk because the GETATTR result in the reply can contain a large object like an ACL. Generalize one of the sendto code paths needed to support all of the above cases, and attempt to ensure that only one pass is done through the RPC Call's transport header to gather chunk list information for building the reply. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 23 Sep, 2016 3 commits
-
-
Chuck Lever authored
Support Remote Invalidation. A private message is exchanged with the client upon RDMA transport connect that indicates whether Send With Invalidation may be used by the server to send RPC replies. The invalidate_rkey is arbitrarily chosen from among rkeys present in the RPC-over-RDMA header's chunk lists. Send With Invalidate improves performance only when clients can recognize, while processing an RPC reply, that an rkey has already been invalidated. That has been submitted as a separate change. In the future, the RPC-over-RDMA protocol might support Remote Invalidation properly. The protocol needs to enable signaling between peers to indicate when Remote Invalidation can be used for each individual RPC. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
Message from syslogd@klimt at Aug 18 17:00:37 ... kernel:page:ffffea0020639b00 count:0 mapcount:0 mapping: (null) index:0x0 Aug 18 17:00:37 klimt kernel: flags: 0x2fffff80000000() Aug 18 17:00:37 klimt kernel: page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) Aug 18 17:00:37 klimt kernel: kernel BUG at /home/cel/src/linux/linux-2.6/include/linux/mm.h:445! Aug 18 17:00:37 klimt kernel: RIP: 0010:[<ffffffffa05c21c1>] svc_rdma_sendto+0x641/0x820 [rpcrdma] send_reply() assigns its page argument as the first page of ctxt. On error, send_reply() already invokes svc_rdma_put_context(ctxt, 1); which does a put_page() on that very page. No need to do that again as svc_rdma_sendto exits. Fixes: 3e1eeb98 ("svcrdma: Close connection when a send error occurs") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Chuck Lever authored
The ctxt's count field is overloaded to mean the number of pages in the ctxt->page array and the number of SGEs in the ctxt->sge array. Typically these two numbers are the same. However, when an inline RPC reply is constructed from an xdr_buf with a tail iovec, the head and tail often occupy the same page, but each are DMA mapped independently. In that case, ->count equals the number of pages, but it does not equal the number of SGEs. There's one more SGE, for the tail iovec. Hence there is one more DMA mapping than there are pages in the ctxt->page array. This isn't a real problem until the server's iommu is enabled. Then each RPC reply that has content in that iovec orphans a DMA mapping that consists of real resources. krb5i and krb5p always populate that tail iovec. After a couple million sent krb5i/p RPC replies, the NFS server starts behaving erratically. Reboot is needed to clear the problem. Fixes: 9d11b51c ("svcrdma: Fix send_reply() scatter/gather set-up") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 13 May, 2016 1 commit
-
-
Chuck Lever authored
Get a fresh op_ctxt in send_reply() instead of in svc_rdma_sendto(). This ensures that svc_rdma_put_context() is invoked only once if send_reply() fails. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- 01 Mar, 2016 1 commit
-
-
Chuck Lever authored
Calling ib_poll_cq() to sort through WCs during a completion is a common pattern amongst RDMA consumers. Since commit 14d3a3b2 ("IB: add a proper completion queue abstraction"), WC sorting can be handled by the IB core. By converting to this new API, svcrdma is made a better neighbor to other RDMA consumers, as it allows the core to schedule the delivery of completions more fairly amongst all active consumers. This new API also aims each completion at a function that is specific to the WR's opcode. Thus the ctxt->wr_op field and the switch in process_context is replaced by a set of methods that handle each completion type. Because each ib_cqe carries a pointer to a completion method, the core can now post operations on a consumer's QP, and handle the completions itself. The server's rdma_stat_sq_poll and rdma_stat_sq_prod metrics are no longer updated. As a clean up, the cq_event_handler, the dto_tasklet, and all associated locking is removed, as they are no longer referenced or used. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-