1. 25 Jan, 2017 4 commits
    • Coly Li's avatar
      romfs: use different way to generate fsid for BLOCK or MTD · f598f82e
      Coly Li authored
      Commit 8a59f5d2 ("fs/romfs: return f_fsid for statfs(2)") generates
      a 64bit id from sb->s_bdev->bd_dev.  This is only correct when romfs is
      defined with CONFIG_ROMFS_ON_BLOCK.  If romfs is only defined with
      CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev
      will triger an oops.
      
      Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y,
      both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined.
      Therefore when calling huge_encode_dev() to generate a 64bit id, I use
      the follow order to choose parameter,
      
      - CONFIG_ROMFS_ON_BLOCK defined
        use sb->s_bdev->bd_dev
      - CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined
        use sb->s_dev when,
      - both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined
        leave id as 0
      
      When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev
      is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index,
      otherwise sb->s_dev is 0.
      
      This is a try-best effort to generate a uniq file system ID, if all the
      above conditions are not meet, f_fsid of this romfs instance will be 0.
      Generally only one romfs can be built on single MTD block device, this
      method is enough to identify multiple romfs instances in a computer.
      
      Link: http://lkml.kernel.org/r/1482928596-115155-1-git-send-email-colyli@suse.de
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reported-by: default avatarNong Li <nongli1031@gmail.com>
      Tested-by: default avatarNong Li <nongli1031@gmail.com>
      Cc: Richard Weinberger <richard.weinberger@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f598f82e
    • Eric Dumazet's avatar
      proc: add a schedule point in proc_pid_readdir() · 3ba4bcee
      Eric Dumazet authored
      We have seen proc_pid_readdir() invocations holding cpu for more than 50
      ms.  Add a cond_resched() to be gentle with other tasks.
      
      [akpm@linux-foundation.org: coding style fix]
      Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ba4bcee
    • Andrea Arcangeli's avatar
      userfaultfd: fix SIGBUS resulting from false rwsem wakeups · 15a77c6f
      Andrea Arcangeli authored
      With >=32 CPUs the userfaultfd selftest triggered a graceful but
      unexpected SIGBUS because VM_FAULT_RETRY was returned by
      handle_userfault() despite the UFFDIO_COPY wasn't completed.
      
      This seems caused by rwsem waking the thread blocked in
      handle_userfault() and we can't run up_read() before the wait_event
      sequence is complete.
      
      Keeping the wait_even sequence identical to the first one, would require
      running userfaultfd_must_wait() again to know if the loop should be
      repeated, and it would also require retaking the rwsem and revalidating
      the whole vma status.
      
      It seems simpler to wait the targeted wakeup so that if false wakeups
      materialize we still wait for our specific wakeup event, unless of
      course there are signals or the uffd was released.
      
      Debug code collecting the stack trace of the wakeup showed this:
      
        $ ./userfaultfd 100 99999
        nr_pages: 25600, nr_pages_per_cpu: 800
        bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
        bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
      
          save_stack_trace+0x2b/0x50
          try_to_wake_up+0x2a6/0x580
          wake_up_q+0x32/0x70
          rwsem_wake+0xe0/0x120
          call_rwsem_wake+0x1b/0x30
          up_write+0x3b/0x40
          vm_mmap_pgoff+0x9c/0xc0
          SyS_mmap_pgoff+0x1a9/0x240
          SyS_mmap+0x22/0x30
          entry_SYSCALL_64_fastpath+0x1f/0xbd
          0xffffffffffffffff
          FAULT_FLAG_ALLOW_RETRY missing 70
        CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
        Call Trace:
          dump_stack+0xb8/0x112
          handle_userfault+0x572/0x650
          handle_mm_fault+0x12cb/0x1520
          __do_page_fault+0x175/0x500
          trace_do_page_fault+0x61/0x270
          do_async_page_fault+0x19/0x90
          async_page_fault+0x25/0x30
      
      This always happens when the main userfault selftest thread is running
      clone() while glibc runs either mprotect or mmap (both taking mmap_sem
      down_write()) to allocate the thread stack of the background threads,
      while locking/userfault threads already run at full throttle and are
      susceptible to false wakeups that may cause handle_userfault() to return
      before than expected (which results in graceful SIGBUS at the next
      attempt).
      
      This was reproduced only with >=32 CPUs because the loop to start the
      thread where clone() is too quick with fewer CPUs, while with 32 CPUs
      there's already significant activity on ~32 locking and userfault
      threads when the last background threads are started with clone().
      
      This >=32 CPUs SMP race condition is likely reproducible only with the
      selftest because of the much heavier userfault load it generates if
      compared to real apps.
      
      We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
      patch floating around that provides it also hidden this problem but in
      reality only is successfully at hiding the problem.
      
      False wakeups could still happen again the second time
      handle_userfault() is invoked, even if it's a so rare race condition
      that getting false wakeups twice in a row is impossible to reproduce.
      This full fix is needed for correctness, the only alternative would be
      to allow VM_FAULT_RETRY to be returned infinitely.  With this fix the WP
      support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
      of returning it infinite times to avoid the SIGBUS).
      
      Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarShubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15a77c6f
    • Ross Zwisler's avatar
      dax: fix build warnings with FS_DAX and !FS_IOMAP · 6affb9d7
      Ross Zwisler authored
      As reported by Arnd:
      
        https://lkml.org/lkml/2017/1/10/756
      
      Compiling with the following configuration:
      
        # CONFIG_EXT2_FS is not set
        # CONFIG_EXT4_FS is not set
        # CONFIG_XFS_FS is not set
        # CONFIG_FS_IOMAP depends on the above filesystems, as is not set
        CONFIG_FS_DAX=y
      
      generates build warnings about unused functions in fs/dax.c:
      
        fs/dax.c:878:12: warning: `dax_insert_mapping' defined but not used [-Wunused-function]
         static int dax_insert_mapping(struct address_space *mapping,
                    ^~~~~~~~~~~~~~~~~~
        fs/dax.c:572:12: warning: `copy_user_dax' defined but not used [-Wunused-function]
         static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
                    ^~~~~~~~~~~~~
        fs/dax.c:542:12: warning: `dax_load_hole' defined but not used [-Wunused-function]
         static int dax_load_hole(struct address_space *mapping, void **entry,
                    ^~~~~~~~~~~~~
        fs/dax.c:312:14: warning: `grab_mapping_entry' defined but not used [-Wunused-function]
         static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
                      ^~~~~~~~~~~~~~~~~~
      
      Now that the struct buffer_head based DAX fault paths and I/O path have
      been removed we really depend on iomap support being present for DAX.
      Make this explicit by selecting FS_IOMAP if we compile in DAX support.
      
      This allows us to remove conditional selections of FS_IOMAP when FS_DAX
      was present for ext2 and ext4, and to remove an #ifdef in fs/dax.c.
      
      Link: http://lkml.kernel.org/r/1484087383-29478-1-git-send-email-ross.zwisler@linux.intel.com
      
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6affb9d7
  2. 18 Jan, 2017 6 commits
  3. 17 Jan, 2017 13 commits
  4. 15 Jan, 2017 2 commits
    • Dave Kleikamp's avatar
      coredump: Ensure proper size of sparse core files · 4d22c75d
      Dave Kleikamp authored
      
      If the last section of a core file ends with an unmapped or zero page,
      the size of the file does not correspond with the last dump_skip() call.
      gdb complains that the file is truncated and can be confusing to users.
      
      After all of the vma sections are written, make sure that the file size
      is no smaller than the current file position.
      
      This problem can be demonstrated with gdb's bigcore testcase on the
      sparc architecture.
      Signed-off-by: default avatarDave Kleikamp <dave.kleikamp@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      4d22c75d
    • Shaohua Li's avatar
      aio: fix lock dep warning · a12f1ae6
      Shaohua Li authored
      
      lockdep reports a warnning. file_start_write/file_end_write only
      acquire/release the lock for regular files. So checking the files in aio
      side too.
      
      [  453.532141] ------------[ cut here ]------------
      [  453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
      [  453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0)
      [  453.533011] Modules linked in:
      [  453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964
      [  453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
      [  453.533011]  ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000
      [  453.533011]  ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008
      [  453.533011]  ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef
      [  453.533011] Call Trace:
      [  453.533011]  [<ffffffff8196cffb>] dump_stack+0x67/0x9c
      [  453.533011]  [<ffffffff81091ee1>] __warn+0x111/0x130
      [  453.533011]  [<ffffffff81091f97>] warn_slowpath_fmt+0x97/0xb0
      [  453.533011]  [<ffffffff81091f00>] ? __warn+0x130/0x130
      [  453.533011]  [<ffffffff8191b789>] ? blk_finish_plug+0x29/0x60
      [  453.533011]  [<ffffffff811205d4>] lock_release+0x434/0x670
      [  453.533011]  [<ffffffff8198af94>] ? import_single_range+0xd4/0x110
      [  453.533011]  [<ffffffff81322195>] ? rw_verify_area+0x65/0x140
      [  453.533011]  [<ffffffff813aa696>] ? aio_write+0x1f6/0x280
      [  453.533011]  [<ffffffff813aa6c9>] aio_write+0x229/0x280
      [  453.533011]  [<ffffffff813aa4a0>] ? aio_complete+0x640/0x640
      [  453.533011]  [<ffffffff8111df20>] ? debug_check_no_locks_freed+0x1a0/0x1a0
      [  453.533011]  [<ffffffff8114793a>] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
      [  453.533011]  [<ffffffff81147985>] ? debug_lockdep_rcu_enabled+0x35/0x40
      [  453.533011]  [<ffffffff812a92be>] ? __might_fault+0x7e/0xf0
      [  453.533011]  [<ffffffff813ac9bc>] do_io_submit+0x94c/0xb10
      [  453.533011]  [<ffffffff813ac2ae>] ? do_io_submit+0x23e/0xb10
      [  453.533011]  [<ffffffff813ac070>] ? SyS_io_destroy+0x270/0x270
      [  453.533011]  [<ffffffff8111d7b3>] ? mark_held_locks+0x23/0xc0
      [  453.533011]  [<ffffffff8100201a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
      [  453.533011]  [<ffffffff813acb90>] SyS_io_submit+0x10/0x20
      [  453.533011]  [<ffffffff824f96aa>] entry_SYSCALL_64_fastpath+0x18/0xad
      [  453.533011]  [<ffffffff81119190>] ? trace_hardirqs_off_caller+0xc0/0x110
      [  453.533011] ---[ end trace b2fbe664d1cc0082 ]---
      
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a12f1ae6
  5. 13 Jan, 2017 3 commits
  6. 12 Jan, 2017 9 commits
  7. 11 Jan, 2017 3 commits
    • Jan Kara's avatar
      xfs: Timely free truncated dirty pages · 0a417b8d
      Jan Kara authored
      Commit 99579cce
      
       "xfs: skip dirty pages in ->releasepage()" started
      to skip dirty pages in xfs_vm_releasepage() which also has the effect
      that if a dirty page is truncated, it does not get freed by
      block_invalidatepage() and is lingering in LRU list waiting for reclaim.
      So a simple loop like:
      
      while true; do
      	dd if=/dev/zero of=file bs=1M count=100
      	rm file
      done
      
      will keep using more and more memory until we hit low watermarks and
      start pagecache reclaim which will eventually reclaim also the truncate
      pages. Keeping these truncated (and thus never usable) pages in memory
      is just a waste of memory, is unnecessarily stressing page cache
      reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
      prematurely.
      
      So instead of just skipping dirty pages in xfs_vm_releasepage(), return
      to old behavior of skipping them only if they have delalloc or unwritten
      buffers and fix the spurious warnings by warning only if the page is
      clean.
      
      CC: stable@vger.kernel.org
      CC: Brian Foster <bfoster@redhat.com>
      CC: Vlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarPetr Tůma <petr.tuma@d3s.mff.cuni.cz>
      Fixes: 99579cce
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0a417b8d
    • Eric Ren's avatar
      ocfs2: fix crash caused by stale lvb with fsdlm plugin · e7ee2c08
      Eric Ren authored
      The crash happens rather often when we reset some cluster nodes while
      nodes contend fiercely to do truncate and append.
      
      The crash backtrace is below:
      
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
         ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: Beginning quota recovery on device (253,18) for slot 2
         ocfs2: Finishing quota recovery on device (253,18) for slot 2
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
         ------------[ cut here ]------------
         kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
         invalid opcode: 0000 [#1] SMP
         Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
         Supported: No, Unsupported modules are loaded
         CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
         task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
         RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
         RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
         RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
         RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
         RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
         R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
         R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
         FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
         CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
         Call Trace:
           ocfs2_setattr+0x698/0xa90 [ocfs2]
           notify_change+0x1ae/0x380
           do_truncate+0x5e/0x90
           do_sys_ftruncate.constprop.11+0x108/0x160
           entry_SYSCALL_64_fastpath+0x12/0x6d
         Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
         RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
      
      It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
      not equal to the disk i_size.  We mistakenly trust the LVB because the
      underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
      DLM_SBF_VALNOTVALID properly for us.  But, why?
      
      The current code tries to downconvert lock without DLM_LKF_VALBLK flag
      to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
      if the lock resource type needs LVB.  This is not the right way for
      fsdlm.
      
      The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
      DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
      DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
      this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
      failure happens.
      
      The following diagram briefly illustrates how this crash happens:
      
      RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
      
      The 1st round:
      
                   Node1                                    Node2
      RSB1: PR
                                                        RSB1(master): NULL->EX
      ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
        ocfs2_dlm_lock(no DLM_LKF_VALBLK)
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      
      dlm_lock(no DLM_LKF_VALBLK)
        convert_lock(overwrite lkb->lkb_exflags
                     with no DLM_LKF_VALBLK)
      
      RSB1: NULL                                        RSB1: EX
                                                        reset Node2
      dlm_recover_rsbs()
        recover_lvb()
      
      /* The LVB is not trustable if the node with EX fails and
       * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
       */
      
       if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
                 return;                   * to invalid the LVB here.
                                           */
      
      The 2nd round:
      
               Node 1                                Node2
      RSB1(become master from recovery)
      
      ocfs2_setattr()
        ocfs2_inode_lock(NULL->EX)
          /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
          ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
        ocfs2_truncate_file()
            mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */
      
      The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
      for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
      is uesed.
      
      Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
      
      Signed-off-by: default avatarEric Ren <zren@suse.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7ee2c08
    • Ross Zwisler's avatar
      dax: wrprotect pmd_t in dax_mapping_entry_mkclean · f729c8c9
      Ross Zwisler authored
      Currently dax_mapping_entry_mkclean() fails to clean and write protect
      the pmd_t of a DAX PMD entry during an *sync operation.  This can result
      in data loss in the following sequence:
      
      1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
         pmd_t dirty and writeable
      2) fsync, flushing out PMD data and cleaning the radix tree entry. We
         currently fail to mark the pmd_t as clean and write protected.
      3) more mmap writes to the PMD.  These don't cause any page faults since
         the pmd_t is dirty and writeable.  The radix tree entry remains clean.
      4) fsync, which fails to flush the dirty PMD data because the radix tree
         entry was clean.
      5) crash - dirty data that should have been fsync'd as part of 4) could
         still have been in the processor cache, and is lost.
      
      Fix this by marking the pmd_t clean and write protected in
      dax_mapping_entry_mkclean(), which is called as part of the fsync
      operation 2).  This will cause the writes in step 3) above to generate
      page faults where we'll re-dirty the PMD radix tree entry, resulting in
      flushes in the fsync that happens in step 4).
      
      Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush")
      Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.com
      
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f729c8c9