Merge drm-upstream/drm-next into drm-intel-next-queued

Need MST sideband message transaction to power up/down nodes. Signed-off-by: Jani Nikula <jani.nikula@intel.com>

Merge drm-upstream/drm-next into drm-intel-next-queued
Need MST sideband message transaction to power up/down nodes. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
32f35b86 · Jani Nikula · ae7617f0 · 754270c7 · 32f35b86 · 32f35b86
Commit 32f35b86 authored 7 years ago by Jani Nikula
20 changed files
--- a/CREDITS
+++ b/CREDITS
@@ -2090,7 +2090,7 @@ S: Kuala Lumpur, Malaysia

 N: Mohit Kumar
 D: ST Microelectronics SPEAr13xx PCI host bridge driver
-D: Synopsys Designware PCI host bridge driver
+D: Synopsys DesignWare PCI host bridge driver

 N: Gabor Kuti
 E: seasons@falcon.sch.bme.hu
@@ -2606,11 +2606,9 @@ E: tmolina@cablespeed.com
 D: bug fixes, documentation, minor hackery

 N: Paul Moore
-E: paul.moore@hp.com
-D: NetLabel author
-S: Hewlett-Packard
-S: 110 Spit Brook Road
-S: Nashua, NH 03062
+E: paul@paul-moore.com
+W: http://www.paul-moore.com
+D: NetLabel, SELinux, audit

 N: James Morris
 E: jmorris@namei.org

--- a/Documentation/ABI/stable/sysfs-bus-nvmem
+++ b/Documentation/ABI/stable/sysfs-bus-nvmem
+What:		/sys/bus/nvmem/devices/.../nvmem
+Date:		July 2015
+KernelVersion:  4.2
+Contact:	Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
+Description:
+		This file allows user to read/write the raw NVMEM contents.
+		Permissions for write to this file depends on the nvmem
+		provider configuration.
+
+		ex:
+		hexdump /sys/bus/nvmem/devices/qfprom0/nvmem
+
+		0000000 0000 0000 0000 0000 0000 0000 0000 0000
+		*
+		00000a0 db10 2240 0000 e000 0c00 0c00 0000 0c00
+		0000000 0000 0000 0000 0000 0000 0000 0000 0000
+		...
+		*
+		0001000
--- a/Documentation/ABI/stable/sysfs-driver-dma-ioatdma
+++ b/Documentation/ABI/stable/sysfs-driver-dma-ioatdma
+What:           sys/devices/pciXXXX:XX/0000:XX:XX.X/dma/dma<n>chan<n>/quickdata/cap
+Date:           December 3, 2009
+KernelVersion:  2.6.32
+Contact:        dmaengine@vger.kernel.org
+Description:	Capabilities the DMA supports.Currently there are DMA_PQ, DMA_PQ_VAL,
+		DMA_XOR,DMA_XOR_VAL,DMA_INTERRUPT.
+
+What:           sys/devices/pciXXXX:XX/0000:XX:XX.X/dma/dma<n>chan<n>/quickdata/ring_active
+Date:           December 3, 2009
+KernelVersion:  2.6.32
+Contact:        dmaengine@vger.kernel.org
+Description:	The number of descriptors active in the ring.
+
+What:           sys/devices/pciXXXX:XX/0000:XX:XX.X/dma/dma<n>chan<n>/quickdata/ring_size
+Date:           December 3, 2009
+KernelVersion:  2.6.32
+Contact:        dmaengine@vger.kernel.org
+Description:	Descriptor ring size, total number of descriptors available.
+
+What:           sys/devices/pciXXXX:XX/0000:XX:XX.X/dma/dma<n>chan<n>/quickdata/version
+Date:           December 3, 2009
+KernelVersion:  2.6.32
+Contact:        dmaengine@vger.kernel.org
+Description:	Version of ioatdma device.
+
+What:           sys/devices/pciXXXX:XX/0000:XX:XX.X/dma/dma<n>chan<n>/quickdata/intr_coalesce
+Date:           August 8, 2017
+KernelVersion:  4.14
+Contact:        dmaengine@vger.kernel.org
+Description:	Tune-able interrupt delay value per channel basis.
--- a/Documentation/ABI/testing/configfs-usb-gadget-rndis
+++ b/Documentation/ABI/testing/configfs-usb-gadget-rndis
@@ -12,3 +12,6 @@ Description:
 				Ethernet over USB link
 		dev_addr	- MAC address of device's end of this
 				Ethernet over USB link
+		class		- USB interface class, default is 02 (hex)
+		subclass	- USB interface subclass, default is 06 (hex)
+		protocol	- USB interface protocol, default is 00 (hex)
--- a/Documentation/ABI/testing/ppc-memtrace
+++ b/Documentation/ABI/testing/ppc-memtrace
+What:		/sys/kernel/debug/powerpc/memtrace
+Date:		Aug 2017
+KernelVersion:	4.14
+Contact:	linuxppc-dev@lists.ozlabs.org
+Description:	This folder contains the relevant debugfs files for the
+		hardware trace macro to use. CONFIG_PPC64_HARDWARE_TRACING
+		must be set.
+
+What:		/sys/kernel/debug/powerpc/memtrace/enable
+Date:		Aug 2017
+KernelVersion:	4.14
+Contact:	linuxppc-dev@lists.ozlabs.org
+Description:	Write an integer containing the size in bytes of the memory
+		you want removed from each NUMA node to this file - it must be
+		aligned to the memblock size. This amount of RAM will be removed
+		from the kernel mappings and the following debugfs files will be
+		created. This can only be successfully done once per boot. Once
+		memory is successfully removed from each node, the following
+		files are created.
+
+What:		/sys/kernel/debug/powerpc/memtrace/<node-id>
+Date:		Aug 2017
+KernelVersion:	4.14
+Contact:	linuxppc-dev@lists.ozlabs.org
+Description:	This directory contains information about the removed memory
+		from the specific NUMA node.
+
+What:		/sys/kernel/debug/powerpc/memtrace/<node-id>/size
+Date:		Aug 2017
+KernelVersion:	4.14
+Contact:	linuxppc-dev@lists.ozlabs.org
+Description:	This contains the size of the memory removed from the node.
+
+What:		/sys/kernel/debug/powerpc/memtrace/<node-id>/start
+Date:		Aug 2017
+KernelVersion:	4.14
+Contact:	linuxppc-dev@lists.ozlabs.org
+Description:	This contains the start address of the removed memory.
+
+What:		/sys/kernel/debug/powerpc/memtrace/<node-id>/trace
+Date:		Aug 2017
+KernelVersion:	4.14
+Contact:	linuxppc-dev@lists.ozlabs.org
+Description:	This is where the hardware trace macro will output the trace
+		it generates.
--- a/Documentation/ABI/testing/procfs-smaps_rollup
+++ b/Documentation/ABI/testing/procfs-smaps_rollup
+What:		/proc/pid/smaps_rollup
+Date:		August 2017
+Contact:	Daniel Colascione <dancol@google.com>
+Description:
+		This file provides pre-summed memory information for a
+		process.  The format is identical to /proc/pid/smaps,
+		except instead of an entry for each VMA in a process,
+		smaps_rollup has a single entry (tagged "[rollup]")
+		for which each field is the sum of the corresponding
+		fields from all the maps in /proc/pid/smaps.
+		For more details, see the procfs man page.
+
+		Typical output looks like this:
+
+		00100000-ff709000 ---p 00000000 00:00 0		 [rollup]
+		Rss:		     884 kB
+		Pss:		     385 kB
+		Shared_Clean:	     696 kB
+		Shared_Dirty:	       0 kB
+		Private_Clean:	     120 kB
+		Private_Dirty:	      68 kB
+		Referenced:	     884 kB
+		Anonymous:	      68 kB
+		LazyFree:	       0 kB
+		AnonHugePages:	       0 kB
+		ShmemPmdMapped:	       0 kB
+		Shared_Hugetlb:	       0 kB
+		Private_Hugetlb:       0 kB
+		Swap:		       0 kB
+		SwapPss:	       0 kB
+		Locked:		     385 kB
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -90,3 +90,11 @@ Description:
 		device's debugging info useful for kernel developers. Its
 		format is not documented intentionally and may change
 		anytime without any notice.
+
+What:		/sys/block/zram<id>/backing_dev
+Date:		June 2017
+Contact:	Minchan Kim <minchan@kernel.org>
+Description:
+		The backing_dev file is read-write and set up backing
+		device for zram to write incompressible pages.
+		For using, user should enable CONFIG_ZRAM_WRITEBACK.
--- a/Documentation/ABI/testing/sysfs-bus-iio
+++ b/Documentation/ABI/testing/sysfs-bus-iio
@@ -119,6 +119,15 @@ Description:
 		unique to allow association with event codes. Units after
 		application of scale and offset are milliamps.

+What:		/sys/bus/iio/devices/iio:deviceX/in_powerY_raw
+KernelVersion:	4.5
+Contact:	linux-iio@vger.kernel.org
+Description:
+		Raw (unscaled no bias removal etc.) power measurement from
+		channel Y. The number must always be specified and
+		unique to allow association with event codes. Units after
+		application of scale and offset are milliwatts.
+
 What:		/sys/bus/iio/devices/iio:deviceX/in_capacitanceY_raw
 KernelVersion:	3.2
 Contact:	linux-iio@vger.kernel.org

--- a/Documentation/ABI/testing/sysfs-bus-iio-lptimer-stm32
+++ b/Documentation/ABI/testing/sysfs-bus-iio-lptimer-stm32
+What:		/sys/bus/iio/devices/iio:deviceX/in_count0_preset
+KernelVersion:	4.13
+Contact:	fabrice.gasnier@st.com
+Description:
+		Reading returns the current preset value. Writing sets the
+		preset value. Encoder counts continuously from 0 to preset
+		value, depending on direction (up/down).
+
+What:		/sys/bus/iio/devices/iio:deviceX/in_count_quadrature_mode_available
+KernelVersion:	4.13
+Contact:	fabrice.gasnier@st.com
+Description:
+		Reading returns the list possible quadrature modes.
+
+What:		/sys/bus/iio/devices/iio:deviceX/in_count0_quadrature_mode
+KernelVersion:	4.13
+Contact:	fabrice.gasnier@st.com
+Description:
+		Configure the device counter quadrature modes:
+		- non-quadrature:
+			Encoder IN1 input servers as the count input (up
+			direction).
+		- quadrature:
+			Encoder IN1 and IN2 inputs are mixed to get direction
+			and count.
+
+What:		/sys/bus/iio/devices/iio:deviceX/in_count_polarity_available
+KernelVersion:	4.13
+Contact:	fabrice.gasnier@st.com
+Description:
+		Reading returns the list possible active edges.
+
+What:		/sys/bus/iio/devices/iio:deviceX/in_count0_polarity
+KernelVersion:	4.13
+Contact:	fabrice.gasnier@st.com
+Description:
+		Configure the device encoder/counter active edge:
+		- rising-edge
+		- falling-edge
+		- both-edges
+
+		In non-quadrature mode, device counts up on active edge.
+		In quadrature mode, encoder counting scenarios are as follows:
+		----------------------------------------------------------------
+		| Active  | Level on |      IN1 signal    |     IN2 signal     |
+		| edge    | opposite |------------------------------------------
+		|         | signal   |  Rising  | Falling |  Rising  | Falling |
+		----------------------------------------------------------------
+		| Rising  | High ->  |   Down   |    -    |    Up    |    -    |
+		| edge    | Low  ->  |    Up    |    -    |   Down   |    -    |
+		----------------------------------------------------------------
+		| Falling | High ->  |    -     |    Up   |    -     |   Down  |
+		| edge    | Low  ->  |    -     |   Down  |    -     |    Up   |
+		----------------------------------------------------------------
+		| Both    | High ->  |   Down   |    Up   |    Up    |   Down  |
+		| edges   | Low  ->  |    Up    |   Down  |   Down   |    Up   |
+		----------------------------------------------------------------
--- a/Documentation/ABI/testing/sysfs-bus-thunderbolt
+++ b/Documentation/ABI/testing/sysfs-bus-thunderbolt
@@ -45,6 +45,8 @@ Contact:	thunderbolt-software@lists.01.org
 Description:	When a devices supports Thunderbolt secure connect it will
 		have this attribute. Writing 32 byte hex string changes
 		authorization to use the secure connection method instead.
+		Writing an empty string clears the key and regular connection
+		method can be used again.

 What:		/sys/bus/thunderbolt/devices/.../device
 Date:		Sep 2017

--- a/Documentation/ABI/testing/sysfs-bus-usb-lvstest
+++ b/Documentation/ABI/testing/sysfs-bus-usb-lvstest
@@ -45,3 +45,16 @@ Contact:	Pratyush Anand <pratyush.anand@gmail.com>
 Description:
 		Write to this node to issue "U3 exit" for Link Layer
 		Validation device. It is needed for TD.7.36.
+
+What:		/sys/bus/usb/devices/.../enable_compliance
+Date:		July 2017
+Description:
+		Write to this node to set the port to compliance mode to test
+		with Link Layer Validation device. It is needed for TD.7.34.
+
+What:		/sys/bus/usb/devices/.../warm_reset
+Date:		July 2017
+Description:
+		Write to this node to issue "Warm Reset" for Link Layer Validation
+		device. It may be needed to properly reset an xHCI 1.1 host port if
+		compliance mode needed to be explicitly enabled.
--- a/Documentation/ABI/testing/sysfs-driver-altera-cvp
+++ b/Documentation/ABI/testing/sysfs-driver-altera-cvp
+What:		/sys/bus/pci/drivers/altera-cvp/chkcfg
+Date:		May 2017
+Kernel Version:	4.13
+Contact:	Anatolij Gustschin <agust@denx.de>
+Description:
+		Contains either 1 or 0 and controls if configuration
+		error checking in altera-cvp driver is turned on or
+		off.
--- a/Documentation/ABI/testing/sysfs-firmware-opal-powercap
+++ b/Documentation/ABI/testing/sysfs-firmware-opal-powercap
+What:		/sys/firmware/opal/powercap
+Date:		August 2017
+Contact:	Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>
+Description:	Powercap directory for Powernv (P8, P9) servers
+
+		Each folder in this directory contains a
+		power-cappable component.
+
+What:		/sys/firmware/opal/powercap/system-powercap
+		/sys/firmware/opal/powercap/system-powercap/powercap-min
+		/sys/firmware/opal/powercap/system-powercap/powercap-max
+		/sys/firmware/opal/powercap/system-powercap/powercap-current
+Date:		August 2017
+Contact:	Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>
+Description:	System powercap directory and attributes applicable for
+		Powernv (P8, P9) servers
+
+		This directory provides powercap information. It
+		contains below sysfs attributes:
+
+		- powercap-min : This file provides the minimum
+		  possible powercap in Watt units
+
+		- powercap-max : This file provides the maximum
+		  possible powercap in Watt units
+
+		- powercap-current : This file provides the current
+		  powercap set on the system. Writing to this file
+		  creates a request for setting a new-powercap. The
+		  powercap requested must be between powercap-min
+		  and powercap-max.
--- a/Documentation/ABI/testing/sysfs-firmware-opal-psr
+++ b/Documentation/ABI/testing/sysfs-firmware-opal-psr
+What:		/sys/firmware/opal/psr
+Date:		August 2017
+Contact:	Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>
+Description:	Power-Shift-Ratio directory for Powernv P9 servers
+
+		Power-Shift-Ratio allows to provide hints the firmware
+		to shift/throttle power between different entities in
+		the system. Each attribute in this directory indicates
+		a settable PSR.
+
+What:		/sys/firmware/opal/psr/cpu_to_gpu_X
+Date:		August 2017
+Contact:	Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>
+Description:	PSR sysfs attributes for Powernv P9 servers
+
+		Power-Shift-Ratio between CPU and GPU for a given chip
+		with chip-id X. This file gives the ratio (0-100)
+		which is used by OCC for power-capping.
--- a/Documentation/ABI/testing/sysfs-fs-f2fs
+++ b/Documentation/ABI/testing/sysfs-fs-f2fs
@@ -57,6 +57,15 @@ Contact:	"Jaegeuk Kim" <jaegeuk.kim@samsung.com>
 Description:
 		 Controls the issue rate of small discard commands.

+What:          /sys/fs/f2fs/<disk>/discard_granularity
+Date:          July 2017
+Contact:       "Chao Yu" <yuchao0@huawei.com>
+Description:
+		Controls discard granularity of inner discard thread, inner thread
+		will not issue discards with size that is smaller than granularity.
+		The unit size is one block, now only support configuring in range
+		of [1, 512].
+
 What:		/sys/fs/f2fs/<disk>/max_victim_search
 Date:		January 2014
 Contact:	"Jaegeuk Kim" <jaegeuk.kim@samsung.com>
@@ -130,3 +139,15 @@ Date:		June 2017
 Contact:	"Chao Yu" <yuchao0@huawei.com>
 Description:
 		 Controls current reserved blocks in system.
+
+What:		/sys/fs/f2fs/<disk>/gc_urgent
+Date:		August 2017
+Contact:	"Jaegeuk Kim" <jaegeuk@kernel.org>
+Description:
+		 Do background GC agressively
+
+What:		/sys/fs/f2fs/<disk>/gc_urgent_sleep_time
+Date:		August 2017
+Contact:	"Jaegeuk Kim" <jaegeuk@kernel.org>
+Description:
+		 Controls sleep time of GC urgent mode
--- a/Documentation/ABI/testing/sysfs-kernel-mm-swap
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-swap
+What:		/sys/kernel/mm/swap/
+Date:		August 2017
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for swapping
+
+What:		/sys/kernel/mm/swap/vma_ra_enabled
+Date:		August 2017
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Enable/disable VMA based swap readahead.
+
+		If set to true, the VMA based swap readahead algorithm
+		will be used for swappable anonymous pages mapped in a
+		VMA, and the global swap readahead algorithm will be
+		still used for tmpfs etc. other users.  If set to
+		false, the global swap readahead algorithm will be
+		used for all swappable pages.
+
+What:		/sys/kernel/mm/swap/vma_ra_max_order
+Date:		August 2017
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	The max readahead size in order for VMA based swap readahead
+
+		VMA based swap readahead algorithm will readahead at
+		most 1 << max_order pages for each readahead.  The
+		real readahead size for each readahead will be scaled
+		according to the estimation algorithm.
--- a/Documentation/ABI/testing/sysfs-power
+++ b/Documentation/ABI/testing/sysfs-power
@@ -273,3 +273,15 @@ Description:

 		This output is useful for system wakeup diagnostics of spurious
 		wakeup interrupts.
+
+What:		/sys/power/pm_debug_messages
+Date:		July 2017
+Contact:	Rafael J. Wysocki <rjw@rjwysocki.net>
+Description:
+		The /sys/power/pm_debug_messages file controls the printing
+		of debug messages from the system suspend/hiberbation
+		infrastructure to the kernel log.
+
+		Writing a "1" to this file enables the debug messages and
+		writing a "0" (default) to it disables them.  Reads from
+		this file return the current value.
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -515,14 +515,15 @@ API at all.
 ::

 	void *
-	dma_alloc_noncoherent(struct device *dev, size_t size,
-			      dma_addr_t *dma_handle, gfp_t flag)
+	dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle,
+			gfp_t flag, unsigned long attrs)

-Identical to dma_alloc_coherent() except that the platform will
-choose to return either consistent or non-consistent memory as it sees
-fit.  By using this API, you are guaranteeing to the platform that you
-have all the correct and necessary sync points for this memory in the
-driver should it choose to return non-consistent memory.
+Identical to dma_alloc_coherent() except that when the
+DMA_ATTR_NON_CONSISTENT flags is passed in the attrs argument, the
+platform will choose to return either consistent or non-consistent memory
+as it sees fit.  By using this API, you are guaranteeing to the platform
+that you have all the correct and necessary sync points for this memory
+in the driver should it choose to return non-consistent memory.

 Note: where the platform can return consistent memory, it will
 guarantee that the sync points become nops.
@@ -535,12 +536,13 @@ that simply cannot make consistent memory.
 ::

 	void
-	dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr,
-			     dma_addr_t dma_handle)
+	dma_free_attrs(struct device *dev, size_t size, void *cpu_addr,
+		       dma_addr_t dma_handle, unsigned long attrs)

-Free memory allocated by the nonconsistent API.  All parameters must
-be identical to those passed in (and returned by
-dma_alloc_noncoherent()).
+Free memory allocated by the dma_alloc_attrs().  All parameters common
+parameters must identical to those otherwise passed to dma_fre_coherent,
+and the attrs argument must be identical to the attrs passed to
+dma_alloc_attrs().

 ::

@@ -564,8 +566,8 @@ memory or doing partial flushes.
 	dma_cache_sync(struct device *dev, void *vaddr, size_t size,
 		       enum dma_data_direction direction)

-Do a partial sync of memory that was allocated by
-dma_alloc_noncoherent(), starting at virtual address vaddr and
+Do a partial sync of memory that was allocated by dma_alloc_attrs() with
+the DMA_ATTR_NON_CONSISTENT flag starting at virtual address vaddr and
 continuing on for size.  Again, you *must* observe the cache line
 boundaries when doing this.

@@ -590,34 +592,11 @@ size is the size of the area (must be multiples of PAGE_SIZE).

 flags can be ORed together and are:

- DMA_MEMORY_MAP - request that the memory returned from
-  dma_alloc_coherent() be directly writable.
-
- DMA_MEMORY_IO - request that the memory returned from
-  dma_alloc_coherent() be addressable using read()/write()/memcpy_toio() etc.
-
-One or both of these flags must be present.
-
- DMA_MEMORY_INCLUDES_CHILDREN - make the declared memory be allocated by
-  dma_alloc_coherent of any child devices of this one (for memory residing
-  on a bridge).
-
 - DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions.
  Do not allow dma_alloc_coherent() to fall back to system memory when
  it's out of memory in the declared region.

-The return value will be either DMA_MEMORY_MAP or DMA_MEMORY_IO and
-must correspond to a passed in flag (i.e. no returning DMA_MEMORY_IO
-if only DMA_MEMORY_MAP were passed in) for success or zero for
-failure.
-
-Note, for DMA_MEMORY_IO returns, all subsequent memory returned by
-dma_alloc_coherent() may no longer be accessed directly, but instead
-must be accessed using the correct bus functions.  If your driver
-isn't prepared to handle this contingency, it should not specify
-DMA_MEMORY_IO in the input flags.
-
-As a simplification for the platforms, only **one** such region of
+As a simplification for the platforms, only *one* such region of
 memory may be declared per device.

 For reasons of efficiency, most platforms choose to track the declared

--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -22,6 +22,8 @@ ifeq ($(HAVE_SPHINX),0)

 .DEFAULT:
 	$(warning The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed and in PATH, or set the SPHINXBUILD make variable to point to the full path of the '$(SPHINXBUILD)' executable.)
+	@echo
+	@./scripts/sphinx-pre-install
 	@echo "  SKIP    Sphinx $@ target."

 else # HAVE_SPHINX
@@ -95,16 +97,6 @@ endif # HAVE_SPHINX
 # The following targets are independent of HAVE_SPHINX, and the rules should
 # work or silently pass without Sphinx.

-# no-ops for the Sphinx toolchain
-sgmldocs:
-	@:
-psdocs:
-	@:
-mandocs:
-	@:
-installmandocs:
-	@:
-
 cleandocs:
 	$(Q)rm -rf $(BUILDDIR)
 	$(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/media clean

--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2080,6 +2080,8 @@ Some of the relevant points of interest are as follows:
 <li>	<a href="#Scheduler and RCU">Scheduler and RCU</a>.
 <li>	<a href="#Tracing and RCU">Tracing and RCU</a>.
 <li>	<a href="#Energy Efficiency">Energy Efficiency</a>.
+<li>	<a href="#Scheduling-Clock Interrupts and RCU">
+	Scheduling-Clock Interrupts and RCU</a>.
 <li>	<a href="#Memory Efficiency">Memory Efficiency</a>.
 <li>	<a href="#Performance, Scalability, Response Time, and Reliability">
 	Performance, Scalability, Response Time, and Reliability</a>.
@@ -2532,6 +2534,134 @@ I learned of many of these requirements via angry phone calls:
 Flaming me on the Linux-kernel mailing list was apparently not
 sufficient to fully vent their ire at RCU's energy-efficiency bugs!

+<h3><a name="Scheduling-Clock Interrupts and RCU">
+Scheduling-Clock Interrupts and RCU</a></h3>
+
+<p>
+The kernel transitions between in-kernel non-idle execution, userspace
+execution, and the idle loop.
+Depending on kernel configuration, RCU handles these states differently:
+
+<table border=3>
+<tr><th><tt>HZ</tt> Kconfig</th>
+	<th>In-Kernel</th>
+		<th>Usermode</th>
+			<th>Idle</th></tr>
+<tr><th align="left"><tt>HZ_PERIODIC</tt></th>
+	<td>Can rely on scheduling-clock interrupt.</td>
+		<td>Can rely on scheduling-clock interrupt and its
+		    detection of interrupt from usermode.</td>
+			<td>Can rely on RCU's dyntick-idle detection.</td></tr>
+<tr><th align="left"><tt>NO_HZ_IDLE</tt></th>
+	<td>Can rely on scheduling-clock interrupt.</td>
+		<td>Can rely on scheduling-clock interrupt and its
+		    detection of interrupt from usermode.</td>
+			<td>Can rely on RCU's dyntick-idle detection.</td></tr>
+<tr><th align="left"><tt>NO_HZ_FULL</tt></th>
+	<td>Can only sometimes rely on scheduling-clock interrupt.
+	    In other cases, it is necessary to bound kernel execution
+	    times and/or use IPIs.</td>
+		<td>Can rely on RCU's dyntick-idle detection.</td>
+			<td>Can rely on RCU's dyntick-idle detection.</td></tr>
+</table>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+	Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the
+	scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt>
+	and <tt>NO_HZ_IDLE</tt> do?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+	Because, as a performance optimization, <tt>NO_HZ_FULL</tt>
+	does not necessarily re-enable the scheduling-clock interrupt
+	on entry to each and every system call.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+However, RCU must be reliably informed as to whether any given
+CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>,
+also whether that CPU is executing in usermode, as discussed
+<a href="#Energy Efficiency">earlier</a>.
+It also requires that the scheduling-clock interrupt be enabled when
+RCU needs it to be:
+
+<ol>
+<li>	If a CPU is either idle or executing in usermode, and RCU believes
+	it is non-idle, the scheduling-clock tick had better be running.
+	Otherwise, you will get RCU CPU stall warnings.  Or at best,
+	very long (11-second) grace periods, with a pointless IPI waking
+	the CPU from time to time.
+<li>	If a CPU is in a portion of the kernel that executes RCU read-side
+	critical sections, and RCU believes this CPU to be idle, you will get
+	random memory corruption.  <b>DON'T DO THIS!!!</b>
+
+	<br>This is one reason to test with lockdep, which will complain
+	about this sort of thing.
+<li>	If a CPU is in a portion of the kernel that is absolutely
+	positively no-joking guaranteed to never execute any RCU read-side
+	critical sections, and RCU believes this CPU to to be idle,
+	no problem.  This sort of thing is used by some architectures
+	for light-weight exception handlers, which can then avoid the
+	overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt>
+	at exception entry and exit, respectively.
+	Some go further and avoid the entireties of <tt>irq_enter()</tt>
+	and <tt>irq_exit()</tt>.
+
+	<br>Just make very sure you are running some of your tests with
+	<tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths
+	was in fact joking about not doing RCU read-side critical sections.
+<li>	If a CPU is executing in the kernel with the scheduling-clock
+	interrupt disabled and RCU believes this CPU to be non-idle,
+	and if the CPU goes idle (from an RCU perspective) every few
+	jiffies, no problem.  It is usually OK for there to be the
+	occasional gap between idle periods of up to a second or so.
+
+	<br>If the gap grows too long, you get RCU CPU stall warnings.
+<li>	If a CPU is either idle or executing in usermode, and RCU believes
+	it to be idle, of course no problem.
+<li>	If a CPU is executing in the kernel, the kernel code
+	path is passing through quiescent states at a reasonable
+	frequency (preferably about once per few jiffies, but the
+	occasional excursion to a second or so is usually OK) and the
+	scheduling-clock interrupt is enabled, of course no problem.
+
+	<br>If the gap between a successive pair of quiescent states grows
+	too long, you get RCU CPU stall warnings.
+</ol>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+	But what if my driver has a hardware interrupt handler
+	that can run for many seconds?
+	I cannot invoke <tt>schedule()</tt> from an hardware
+	interrupt handler, after all!
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+	One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt>
+	every so often.
+	But given that long-running interrupt handlers can cause
+	other problems, not least for response time, shouldn't you
+	work to keep your interrupt handler's runtime within reasonable
+	bounds?
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+But as long as RCU is properly informed of kernel state transitions between
+in-kernel execution, usermode execution, and idle, and as long as the
+scheduling-clock interrupt is enabled when RCU needs it to be, you
+can rest assured that the bugs you encounter will be in some other
+part of RCU or some other part of the kernel!
+
 <h3><a name="Memory Efficiency">Memory Efficiency</a></h3>

 <p>