- 25 Jan, 2017 4 commits
-
-
Coly Li authored
Commit 8a59f5d2 ("fs/romfs: return f_fsid for statfs(2)") generates a 64bit id from sb->s_bdev->bd_dev. This is only correct when romfs is defined with CONFIG_ROMFS_ON_BLOCK. If romfs is only defined with CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev will triger an oops. Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y, both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined. Therefore when calling huge_encode_dev() to generate a 64bit id, I use the follow order to choose parameter, - CONFIG_ROMFS_ON_BLOCK defined use sb->s_bdev->bd_dev - CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined use sb->s_dev when, - both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined leave id as 0 When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index, otherwise sb->s_dev is 0. This is a try-best effort to generate a uniq file system ID, if all the above conditions are not meet, f_fsid of this romfs instance will be 0. Generally only one romfs can be built on single MTD block device, this method is enough to identify multiple romfs instances in a computer. Link: http://lkml.kernel.org/r/1482928596-115155-1-git-send-email-colyli@suse.de Signed-off-by:
Coly Li <colyli@suse.de> Reported-by:
Nong Li <nongli1031@gmail.com> Tested-by:
Nong Li <nongli1031@gmail.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Eric Dumazet authored
We have seen proc_pid_readdir() invocations holding cpu for more than 50 ms. Add a cond_resched() to be gentle with other tasks. [akpm@linux-foundation.org: coding style fix] Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andrea Arcangeli authored
With >=32 CPUs the userfaultfd selftest triggered a graceful but unexpected SIGBUS because VM_FAULT_RETRY was returned by handle_userfault() despite the UFFDIO_COPY wasn't completed. This seems caused by rwsem waking the thread blocked in handle_userfault() and we can't run up_read() before the wait_event sequence is complete. Keeping the wait_even sequence identical to the first one, would require running userfaultfd_must_wait() again to know if the loop should be repeated, and it would also require retaking the rwsem and revalidating the whole vma status. It seems simpler to wait the targeted wakeup so that if false wakeups materialize we still wait for our specific wakeup event, unless of course there are signals or the uffd was released. Debug code collecting the stack trace of the wakeup showed this: $ ./userfaultfd 100 99999 nr_pages: 25600, nr_pages_per_cpu: 800 bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2 bounces: 99997, mode: rnd ver poll, Bus error (core dumped) save_stack_trace+0x2b/0x50 try_to_wake_up+0x2a6/0x580 wake_up_q+0x32/0x70 rwsem_wake+0xe0/0x120 call_rwsem_wake+0x1b/0x30 up_write+0x3b/0x40 vm_mmap_pgoff+0x9c/0xc0 SyS_mmap_pgoff+0x1a9/0x240 SyS_mmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x1f/0xbd 0xffffffffffffffff FAULT_FLAG_ALLOW_RETRY missing 70 CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0xb8/0x112 handle_userfault+0x572/0x650 handle_mm_fault+0x12cb/0x1520 __do_page_fault+0x175/0x500 trace_do_page_fault+0x61/0x270 do_async_page_fault+0x19/0x90 async_page_fault+0x25/0x30 This always happens when the main userfault selftest thread is running clone() while glibc runs either mprotect or mmap (both taking mmap_sem down_write()) to allocate the thread stack of the background threads, while locking/userfault threads already run at full throttle and are susceptible to false wakeups that may cause handle_userfault() to return before than expected (which results in graceful SIGBUS at the next attempt). This was reproduced only with >=32 CPUs because the loop to start the thread where clone() is too quick with fewer CPUs, while with 32 CPUs there's already significant activity on ~32 locking and userfault threads when the last background threads are started with clone(). This >=32 CPUs SMP race condition is likely reproducible only with the selftest because of the much heavier userfault load it generates if compared to real apps. We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a patch floating around that provides it also hidden this problem but in reality only is successfully at hiding the problem. False wakeups could still happen again the second time handle_userfault() is invoked, even if it's a so rare race condition that getting false wakeups twice in a row is impossible to reproduce. This full fix is needed for correctness, the only alternative would be to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP support can stick to a strict "one more" VM_FAULT_RETRY logic (no need of returning it infinite times to avoid the SIGBUS). Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com Signed-off-by:
Andrea Arcangeli <aarcange@redhat.com> Reported-by:
Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com> Tested-by:
Mike Kravetz <mike.kravetz@oracle.com> Acked-by:
Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Ross Zwisler authored
As reported by Arnd: https://lkml.org/lkml/2017/1/10/756 Compiling with the following configuration: # CONFIG_EXT2_FS is not set # CONFIG_EXT4_FS is not set # CONFIG_XFS_FS is not set # CONFIG_FS_IOMAP depends on the above filesystems, as is not set CONFIG_FS_DAX=y generates build warnings about unused functions in fs/dax.c: fs/dax.c:878:12: warning: `dax_insert_mapping' defined but not used [-Wunused-function] static int dax_insert_mapping(struct address_space *mapping, ^~~~~~~~~~~~~~~~~~ fs/dax.c:572:12: warning: `copy_user_dax' defined but not used [-Wunused-function] static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size, ^~~~~~~~~~~~~ fs/dax.c:542:12: warning: `dax_load_hole' defined but not used [-Wunused-function] static int dax_load_hole(struct address_space *mapping, void **entry, ^~~~~~~~~~~~~ fs/dax.c:312:14: warning: `grab_mapping_entry' defined but not used [-Wunused-function] static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index, ^~~~~~~~~~~~~~~~~~ Now that the struct buffer_head based DAX fault paths and I/O path have been removed we really depend on iomap support being present for DAX. Make this explicit by selecting FS_IOMAP if we compile in DAX support. This allows us to remove conditional selections of FS_IOMAP when FS_DAX was present for ext2 and ext4, and to remove an #ifdef in fs/dax.c. Link: http://lkml.kernel.org/r/1484087383-29478-1-git-send-email-ross.zwisler@linux.intel.com Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by:
Arnd Bergmann <arnd@arndb.de> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 18 Jan, 2017 6 commits
-
-
Arnd Bergmann authored
A harmless warning just got introduced: fs/xfs/libxfs/xfs_dir2.h:40:8: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers] Removing the 'const' modifier avoids the warning and has no other effect. Fixes: 1fc4d33f ("xfs: replace xfs_mode_to_ftype table with switch statement") Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com>
-
Jeff Layton authored
sparse says: fs/ceph/mds_client.c:291:23: warning: restricted __le32 degrades to integer fs/ceph/mds_client.c:293:28: warning: restricted __le32 degrades to integer fs/ceph/mds_client.c:294:28: warning: restricted __le32 degrades to integer fs/ceph/mds_client.c:296:28: warning: restricted __le32 degrades to integer The op value is __le32, so we need to convert it before comparing it. Cc: stable@vger.kernel.org # needs backporting for < 3.14 Signed-off-by:
Jeff Layton <jlayton@redhat.com> Reviewed-by:
Sage Weil <sage@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Jeff Layton authored
sparse says: fs/ceph/inode.c:308:36: warning: incorrect type in argument 1 (different base types) fs/ceph/inode.c:308:36: expected unsigned int [unsigned] [usertype] a fs/ceph/inode.c:308:36: got restricted __le32 [usertype] frag fs/ceph/inode.c:308:46: warning: incorrect type in argument 2 (different base types) fs/ceph/inode.c:308:46: expected unsigned int [unsigned] [usertype] b fs/ceph/inode.c:308:46: got restricted __le32 [usertype] frag We need to convert these values to host-endian before calling the comparator. Fixes: a407846e ("ceph: don't assume frag tree splits in mds reply are sorted") Signed-off-by:
Jeff Layton <jlayton@redhat.com> Reviewed-by:
Sage Weil <sage@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Jeff Layton authored
sparse says: fs/ceph/dir.c:1248:50: warning: incorrect type in assignment (different base types) fs/ceph/dir.c:1248:50: expected restricted __le32 [usertype] mask fs/ceph/dir.c:1248:50: got int [signed] [assigned] mask Fixes: 200fd27c ("ceph: use lookup request to revalidate dentry") Signed-off-by:
Jeff Layton <jlayton@redhat.com> Reviewed-by:
Sage Weil <sage@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Yan, Zheng authored
Commit 5c341ee3 ("ceph: fix scheduler warning due to nested blocking") causes infinite loop when process is interrupted. Fix it. Signed-off-by:
Yan, Zheng <zyan@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Amir Goldstein authored
ovl_lookup_layer() iterates on path elements of d->name.name but also frees and allocates a new pointer for d->name.name. For the case of lookup in upper layer, the initial d->name.name pointer is stable (dentry->d_name), but for lower layers, the initial d->name.name can be d->redirect, which can be freed during iteration. [SzM] Keep the count of remaining characters in the redirect path and calculate the current position from that. This works becuase only the prefix is modified, the ending always stays the same. Fixes: 02b69b28 ("ovl: lookup redirects") Signed-off-by:
Amir Goldstein <amir73il@gmail.com> Signed-off-by:
Miklos Szeredi <mszeredi@redhat.com>
-
- 17 Jan, 2017 13 commits
-
-
Eric Sandeen authored
The GETNEXTQOTA ioctl takes whatever ID is sent in, and looks for the next active quota for an user equal or higher to that ID. But if we are at the maximum ID and then ask for the "next" one, we may wrap back to zero. In this case, userspace may loop forever, because it will start querying again at zero. We'll fix this in userspace as well, but for the kernel, return -ENOENT if we ask for the next quota ID past UINT_MAX so the caller knows to stop. Signed-off-by:
Eric Sandeen <sandeen@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com>
-
Amir Goldstein authored
Check for invalid file type in xfs_dinode_verify() and fail to load the inode structure from disk. Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Amir Goldstein <amir73il@gmail.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com>
-
Amir Goldstein authored
The helper xfs_dentry_to_name() is used by 2 different classes of callers: Callers that pass zero mode and don't care about the returned name.type field and Callers that pass non zero mode and do care about the name.type field. Change xfs_dentry_to_name() to not take the mode argument and change the call sites of the first class to not pass the mode argument. Create a new helper xfs_dentry_mode_to_name() which does pass the mode argument and returns -EFSCORRUPTED if mode is invalid. Callers that translate non zero mode to on-disk file type now check the return value and will export the error to user instead of staging an invalid file type to be written to directory entry. Signed-off-by:
Amir Goldstein <amir73il@gmail.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com>
-
Amir Goldstein authored
The size of the xfs_mode_to_ftype[] conversion table was too small to handle an invalid value of mode=S_IFMT. Instead of fixing the table size, replace the conversion table with a conversion helper that uses a switch statement. Suggested-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Amir Goldstein <amir73il@gmail.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com>
-
Amir Goldstein authored
xfs_dir2.h dereferences some data types in inline functions and fails to include those type definitions, e.g.: xfs_dir2_data_aoff_t, struct xfs_da_geometry. Signed-off-by:
Amir Goldstein <amir73il@gmail.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com>
-
Amir Goldstein authored
This changes fixes an assertion hit when fuzzing on-disk i_mode values. The easy case to fix is when changing an empty file i_mode to S_IFDIR. In this case, xfs_dinode_verify() detects an illegal zero size for directory and fails to load the inode structure from disk. For the case of non empty file whose i_mode is changed to S_IFDIR, the ASSERT() statement in xfs_dir2_isblock() is replaced with return -EFSCORRUPTED, to avoid interacting with corrupted jusk also when XFS_DEBUG is disabled. Suggested-by:
Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Amir Goldstein <amir73il@gmail.com> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com>
-
Amir Goldstein authored
The ASSERT() condition is the normal case, not the exception, so testing the condition should be likely(), not unlikely(). Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Amir Goldstein <amir73il@gmail.com> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com>
-
Richard Weinberger authored
When replaying the journal it can happen that a journal entry points to a garbage collected node. This is the case when a power-cut occurred between a garbage collect run and a commit. In such a case nodes have to be read using the failable read functions to detect whether the found node matches what we expect. One corner case was forgotten, when the journal contains an entry to remove an inode all xattrs have to be removed too. UBIFS models xattr like directory entries, so the TNC code iterates over all xattrs of the inode and removes them too. This code re-uses the functions for walking directories and calls ubifs_tnc_next_ent(). ubifs_tnc_next_ent() expects to be used only after the journal and aborts when a node does not match the expected result. This behavior can render an UBIFS volume unmountable after a power-cut when xattrs are used. Fix this issue by using failable read functions in ubifs_tnc_next_ent() too when replaying the journal. Cc: stable@vger.kernel.org Fixes: 1e51764a ("UBIFS: add new flash file system") Reported-by:
Rock Lee <rockdotlee@gmail.com> Reviewed-by:
David Gstir <david@sigma-star.at> Signed-off-by:
Richard Weinberger <richard@nod.at>
-
Eric Biggers authored
In several places, ubifs checked for an encryption key before creating a file in an encrypted directory. This was redundant with fscrypt_setup_filename() or ubifs_new_inode(), and in the case of ubifs_link() it broke linking to special files. So remove the extra checks. Signed-off-by:
Eric Biggers <ebiggers@google.com> Signed-off-by:
Richard Weinberger <richard@nod.at>
-
Eric Biggers authored
The ubifs encryption ioctls did not work when called by a 32-bit program on a 64-bit kernel. Since 'struct fscrypt_policy' is not affected by the word size, ubifs just needs to allow these ioctls through, like what ext4 and f2fs do. Signed-off-by:
Eric Biggers <ebiggers@google.com> Signed-off-by:
Richard Weinberger <richard@nod.at>
-
Arnd Bergmann authored
This came up during the v4.10 merge window: warning: (UBIFS_FS_ENCRYPTION) selects FS_ENCRYPTION which has unmet direct dependencies (BLOCK) fs/crypto/crypto.c: In function 'fscrypt_zeroout_range': fs/crypto/crypto.c:355:9: error: implicit declaration of function 'bio_alloc';did you mean 'd_alloc'? [-Werror=implicit-function-declaration] bio = bio_alloc(GFP_NOWAIT, 1); The easiest way out is to limit UBIFS_FS_ENCRYPTION to configurations that also enable BLOCK. Fixes: d475a507 ("ubifs: Add skeleton for fscrypto") Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Richard Weinberger <richard@nod.at>
-
Peter Rosin authored
Without this, I get the following on reboot: UBIFS error (ubi1:0 pid 703): ubifs_load_znode: bad target node (type 1) length (8240) UBIFS error (ubi1:0 pid 703): ubifs_load_znode: have to be in range of 48-4144 UBIFS error (ubi1:0 pid 703): ubifs_load_znode: bad indexing node at LEB 13:11080, error 5 magic 0x6101831 crc 0xb1cb246f node_type 9 (indexing node) group_type 0 (no node group) sqnum 546 len 128 child_cnt 5 level 0 Branches: 0: LEB 14:72088 len 161 key (133, inode) 1: LEB 14:81120 len 160 key (134, inode) 2: LEB 20:26624 len 8240 key (134, data, 0) 3: LEB 14:81280 len 160 key (135, inode) 4: LEB 20:34864 len 8240 key (135, data, 0) UBIFS warning (ubi1:0 pid 703): ubifs_ro_mode.part.0: switched to read-only mode, error -22 CPU: 0 PID: 703 Comm: mount Not tainted 4.9.0-next-20161213+ #1197 Hardware name: Atmel SAMA5 [<c010d2ac>] (unwind_backtrace) from [<c010b250>] (show_stack+0x10/0x14) [<c010b250>] (show_stack) from [<c024df94>] (ubifs_jnl_update+0x2e8/0x614) [<c024df94>] (ubifs_jnl_update) from [<c0254bf8>] (ubifs_mkdir+0x160/0x204) [<c0254bf8>] (ubifs_mkdir) from [<c01a6030>] (vfs_mkdir+0xb0/0x104) [<c01a6030>] (vfs_mkdir) from [<c0286070>] (ovl_create_real+0x118/0x248) [<c0286070>] (ovl_create_real) from [<c0283ed4>] (ovl_fill_super+0x994/0xaf4) [<c0283ed4>] (ovl_fill_super) from [<c019c394>] (mount_nodev+0x44/0x9c) [<c019c394>] (mount_nodev) from [<c019c4ac>] (mount_fs+0x14/0xa4) [<c019c4ac>] (mount_fs) from [<c01b5338>] (vfs_kern_mount+0x4c/0xd4) [<c01b5338>] (vfs_kern_mount) from [<c01b6b80>] (do_mount+0x154/0xac8) [<c01b6b80>] (do_mount) from [<c01b782c>] (SyS_mount+0x74/0x9c) [<c01b782c>] (SyS_mount) from [<c0107f80>] (ret_fast_syscall+0x0/0x3c) UBIFS error (ubi1:0 pid 703): ubifs_mkdir: cannot create directory, error -22 overlayfs: failed to create directory /mnt/ovl/work/work (errno: 22); mounting read-only Fixes: 7799953b ("ubifs: Implement encrypt/decrypt for all IO") Signed-off-by:
Peter Rosin <peda@axentia.se> Tested-by:
Kevin Hilman <khilman@baylibre.com> Signed-off-by:
Richard Weinberger <richard@nod.at>
-
Colin Ian King authored
err is no longer being set on a successful return path, causing a garbage value being returned. Fix this by setting err to zero for the successful return path. Found with static analysis by CoverityScan, CID 1389473 Fixes: 7799953b ("ubifs: Implement encrypt/decrypt for all IO") Signed-off-by:
Colin Ian King <colin.king@canonical.com> Signed-off-by:
Richard Weinberger <richard@nod.at>
-
- 15 Jan, 2017 2 commits
-
-
Dave Kleikamp authored
If the last section of a core file ends with an unmapped or zero page, the size of the file does not correspond with the last dump_skip() call. gdb complains that the file is truncated and can be confusing to users. After all of the vma sections are written, make sure that the file size is no smaller than the current file position. This problem can be demonstrated with gdb's bigcore testcase on the sparc architecture. Signed-off-by:
Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Shaohua Li authored
lockdep reports a warnning. file_start_write/file_end_write only acquire/release the lock for regular files. So checking the files in aio side too. [ 453.532141] ------------[ cut here ]------------ [ 453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670 [ 453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0) [ 453.533011] Modules linked in: [ 453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964 [ 453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 [ 453.533011] ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000 [ 453.533011] ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008 [ 453.533011] ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef [ 453.533011] Call Trace: [ 453.533011] [<ffffffff8196cffb>] dump_stack+0x67/0x9c [ 453.533011] [<ffffffff81091ee1>] __warn+0x111/0x130 [ 453.533011] [<ffffffff81091f97>] warn_slowpath_fmt+0x97/0xb0 [ 453.533011] [<ffffffff81091f00>] ? __warn+0x130/0x130 [ 453.533011] [<ffffffff8191b789>] ? blk_finish_plug+0x29/0x60 [ 453.533011] [<ffffffff811205d4>] lock_release+0x434/0x670 [ 453.533011] [<ffffffff8198af94>] ? import_single_range+0xd4/0x110 [ 453.533011] [<ffffffff81322195>] ? rw_verify_area+0x65/0x140 [ 453.533011] [<ffffffff813aa696>] ? aio_write+0x1f6/0x280 [ 453.533011] [<ffffffff813aa6c9>] aio_write+0x229/0x280 [ 453.533011] [<ffffffff813aa4a0>] ? aio_complete+0x640/0x640 [ 453.533011] [<ffffffff8111df20>] ? debug_check_no_locks_freed+0x1a0/0x1a0 [ 453.533011] [<ffffffff8114793a>] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30 [ 453.533011] [<ffffffff81147985>] ? debug_lockdep_rcu_enabled+0x35/0x40 [ 453.533011] [<ffffffff812a92be>] ? __might_fault+0x7e/0xf0 [ 453.533011] [<ffffffff813ac9bc>] do_io_submit+0x94c/0xb10 [ 453.533011] [<ffffffff813ac2ae>] ? do_io_submit+0x23e/0xb10 [ 453.533011] [<ffffffff813ac070>] ? SyS_io_destroy+0x270/0x270 [ 453.533011] [<ffffffff8111d7b3>] ? mark_held_locks+0x23/0xc0 [ 453.533011] [<ffffffff8100201a>] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 453.533011] [<ffffffff813acb90>] SyS_io_submit+0x10/0x20 [ 453.533011] [<ffffffff824f96aa>] entry_SYSCALL_64_fastpath+0x18/0xad [ 453.533011] [<ffffffff81119190>] ? trace_hardirqs_off_caller+0xc0/0x110 [ 453.533011] ---[ end trace b2fbe664d1cc0082 ]--- Cc: Dmitry Monakhov <dmonakhov@openvz.org> Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Shaohua Li <shli@fb.com> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- 13 Jan, 2017 3 commits
-
-
Trond Myklebust authored
If the server reboots multiple times, the client should rely on the server to tell it that it cannot reclaim state as per section 9.6.3.4 in RFC7530 and section 8.4.2.1 in RFC5661. Currently, the client is being to conservative, and is assuming that if the server reboots while state recovery is in progress, then it must ignore state that was not recovered before the reboot. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
David Sheets authored
Commit bcb6f6d2 ("fuse: use timespec64") introduced clamped nsec values in time_to_jiffies but used the max of nsec and NSEC_PER_SEC - 1 instead of the min. Because of this, dentries would stay in the cache longer than requested and go stale in scenarios that relied on their timely eviction. Fixes: bcb6f6d2 ("fuse: use timespec64") Signed-off-by:
David Sheets <dsheets@docker.com> Signed-off-by:
Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # 4.9
-
Tahsin Erdogan authored
fuse_abort_conn() moves requests from pending list to a temporary list before canceling them. This operation races with request_wait_answer() which also tries to remove the request after it gets a fatal signal. It checks FR_PENDING flag to determine whether the request is still in the pending list. Make fuse_abort_conn() clear FR_PENDING flag so that request_wait_answer() does not remove the request from temporary list. This bug causes an Oops when trying to delete an already deleted list entry in end_requests(). Fixes: ee314a87 ("fuse: abort: no fc->lock needed for request ending") Signed-off-by:
Tahsin Erdogan <tahsin@google.com> Signed-off-by:
Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # 4.2+
-
- 12 Jan, 2017 9 commits
-
-
J. Bruce Fields authored
Oops--in 916d2d84 I moved some constants into an array for convenience, but here I'm accidentally writing to that array. The effect is that if you ever encounter a filesystem lacking support for ACLs or security labels, then all queries of supported attributes will report that attribute as unsupported from then on. Fixes: 916d2d84 "nfsd: clean up supported attribute handling" Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
Otherwise, the attribute cache remains marked as being expired. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Trond Myklebust authored
If the unlink wasn't successful, then the directory has presumably not changed. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Trond Myklebust authored
If a file is renamed, but stays in the same directory, we will still receive 2 change_info4 structures describing the change to that directory, but we only want to apply it once. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Trond Myklebust authored
We don't want to invalidate the directory attribute and data cache unless we know that a file was created, or the change attribute differs from the one in our cache. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Geng, Jichao authored
For no snapshot case, we should use ci->truncate_{seq,size}. Fixes: 5f743e45 ("ceph: record truncate size/seq for snap data writeback") Signed-off-by:
Geng, Jichao <geng.jichao@h3c.com> Signed-off-by:
Yan, Zheng <zyan@redhat.com>
-
Yan, Zheng authored
We should apply the check after getting the initial mdsmap. Fixes: e9e427f0 ("ceph: check availability of mds cluster on mount") Link: http://tracker.ceph.com/issues/18161 Signed-off-by:
Yan, Zheng <zyan@redhat.com>
-
Benjamin Coddington authored
I have reports of a crash that look like __fput() was called twice for a NFSv4.0 file. It seems possible that the state manager could try to reclaim a lock and take a reference on the fl->fl_file at the same time the file is being released if, during the close(), a signal interrupts the wait for outstanding IO while removing locks which then skips the removal of that lock. Since 83bfff23 ("nfs4: have do_vfs_lock take an inode pointer") has removed the need to traverse fl->fl_file->f_inode in nfs4_lock_done(), taking that reference is no longer necessary. Signed-off-by:
Benjamin Coddington <bcodding@redhat.com> Reviewed-by:
Jeff Layton <jlayton@redhat.com> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Damien Le Moal authored
All block device data fields and functions returning a number of 512B sectors are by convention named xxx_sectors while names in the form xxx_size are generally used for a number of bytes. The blk_queue_zone_size and bdev_zone_size functions were not following this convention so rename them. No functional change is introduced by this patch. Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Collapsed the two patches, they were nonsensically split and broke bisection. Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- 11 Jan, 2017 3 commits
-
-
Jan Kara authored
Commit 99579cce "xfs: skip dirty pages in ->releasepage()" started to skip dirty pages in xfs_vm_releasepage() which also has the effect that if a dirty page is truncated, it does not get freed by block_invalidatepage() and is lingering in LRU list waiting for reclaim. So a simple loop like: while true; do dd if=/dev/zero of=file bs=1M count=100 rm file done will keep using more and more memory until we hit low watermarks and start pagecache reclaim which will eventually reclaim also the truncate pages. Keeping these truncated (and thus never usable) pages in memory is just a waste of memory, is unnecessarily stressing page cache reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM prematurely. So instead of just skipping dirty pages in xfs_vm_releasepage(), return to old behavior of skipping them only if they have delalloc or unwritten buffers and fix the spurious warnings by warning only if the page is clean. CC: stable@vger.kernel.org CC: Brian Foster <bfoster@redhat.com> CC: Vlastimil Babka <vbabka@suse.cz> Reported-by:
Petr Tůma <petr.tuma@d3s.mff.cuni.cz> Fixes: 99579cce Signed-off-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Brian Foster <bfoster@redhat.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com>
-
Eric Ren authored
The crash happens rather often when we reset some cluster nodes while nodes contend fiercely to do truncate and append. The crash backtrace is below: dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18) ocfs2: End replay journal (node 318952601, slot 2) on device (253,18) ocfs2: Beginning quota recovery on device (253,18) for slot 2 ocfs2: Finishing quota recovery on device (253,18) for slot 2 (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode) (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1 ------------[ cut here ]------------ kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 Supported: No, Unsupported modules are loaded CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014 task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000 RIP: 0010:[<ffffffffa05c8c30>] [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282 RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246 RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414 R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448 R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020 FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0 Call Trace: ocfs2_setattr+0x698/0xa90 [ocfs2] notify_change+0x1ae/0x380 do_truncate+0x5e/0x90 do_sys_ftruncate.constprop.11+0x108/0x160 entry_SYSCALL_64_fastpath+0x12/0x6d Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff RIP [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] It's because ocfs2_inode_lock() get us stale LVB in which the i_size is not equal to the disk i_size. We mistakenly trust the LVB because the underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with DLM_SBF_VALNOTVALID properly for us. But, why? The current code tries to downconvert lock without DLM_LKF_VALBLK flag to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even if the lock resource type needs LVB. This is not the right way for fsdlm. The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node failure happens. The following diagram briefly illustrates how this crash happens: RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB; The 1st round: Node1 Node2 RSB1: PR RSB1(master): NULL->EX ocfs2_downconvert_lock(PR->NULL, set_lvb==0) ocfs2_dlm_lock(no DLM_LKF_VALBLK) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - dlm_lock(no DLM_LKF_VALBLK) convert_lock(overwrite lkb->lkb_exflags with no DLM_LKF_VALBLK) RSB1: NULL RSB1: EX reset Node2 dlm_recover_rsbs() recover_lvb() /* The LVB is not trustable if the node with EX fails and * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1. */ if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to return; * to invalid the LVB here. */ The 2nd round: Node 1 Node2 RSB1(become master from recovery) ocfs2_setattr() ocfs2_inode_lock(NULL->EX) /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */ ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */ ocfs2_truncate_file() mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */ The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin is uesed. Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com Signed-off-by:
Eric Ren <zren@suse.com> Reviewed-by:
Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Ross Zwisler authored
Currently dax_mapping_entry_mkclean() fails to clean and write protect the pmd_t of a DAX PMD entry during an *sync operation. This can result in data loss in the following sequence: 1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the pmd_t dirty and writeable 2) fsync, flushing out PMD data and cleaning the radix tree entry. We currently fail to mark the pmd_t as clean and write protected. 3) more mmap writes to the PMD. These don't cause any page faults since the pmd_t is dirty and writeable. The radix tree entry remains clean. 4) fsync, which fails to flush the dirty PMD data because the radix tree entry was clean. 5) crash - dirty data that should have been fsync'd as part of 4) could still have been in the processor cache, and is lost. Fix this by marking the pmd_t clean and write protected in dax_mapping_entry_mkclean(), which is called as part of the fsync operation 2). This will cause the writes in step 3) above to generate page faults where we'll re-dirty the PMD radix tree entry, resulting in flushes in the fsync that happens in step 4). Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush") Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.com Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-