linux / backbone-sources

SSH Git

To clone this repository:

git clone git@git.backbone.ws:linux/backbone-sources.git

To push to this repository:

# Add a new remote
git remote add origin git@git.backbone.ws:linux/backbone-sources.git

# Push the master branch to the newly added origin, and configure
# this remote and branch as the default:
git push -u origin master

# From now on you can push master to the "origin" remote with:
git push

Diffs from 1381798 to 23c2e00

Commits

Jan Alexander Steffens (heftig) ZEN: Implement zen-tune v4.12 055344c 5 months ago
Jan Alexander Steffens (heftig) ZEN: Allow setting the number of available virtual TTYs 3c08840 5 months ago
Jan Alexander Steffens (heftig) ZEN: Add ZEN branding 81681f8 5 months ago
Jan Alexander Steffens (heftig) ZEN: Add a choice of boot logos 8e3e8de 5 months ago
Jan Alexander Steffens (heftig) ZEN: Add Thinkpad SMAPI driver 67273bb 5 months ago
Jan Alexander Steffens (heftig) ZEN: Add Ubuntu ureadahead support bdce98a 5 months ago
Jan Alexander Steffens (heftig) ZEN: Add VHBA driver d4a5269 5 months ago
Jan Alexander Steffens (heftig) ZEN: Add support for some old Dell clickpads d11341c 5 months ago
Jan Alexander Steffens (heftig) ZEN: Add exFAT support 12180aa 5 months ago
Jan Alexander Steffens (heftig) ZEN: adbhid: Support absolute mode in adb-base trackpads eae7e4a 5 months ago
Jan Alexander Steffens (heftig) ZEN: joydev: Add a parameter to remap axes to buttons 3fe823b 5 months ago
Jan Alexander Steffens (heftig) ZEN: Enable additional CPU Optimizations for GCC v4.9+ / Kernel v3.15+ 811ab89 5 months ago
Jan Alexander Steffens (heftig) ZEN: Add a CONFIG option that sets -O3 9a828e2 5 months ago
Jan Alexander Steffens (heftig) ZEN: Allow TCP YeAH as default congestion control a704dd8 5 months ago
Jan Alexander Steffens (heftig) Merge remote-tracking branch 'github/4.12/misc' into 4.12/master 977d82e 5 months ago
Jan Alexander Steffens (heftig) Merge remote-tracking branch 'github/4.12/zen-tune' into 4.12/master dd55d5b 5 months ago
Jan Alexander Steffens (heftig) fixup! ZEN: Add support for some old Dell clickpads 9d1ddcd 5 months ago
Jan Alexander Steffens (heftig) Merge branch '4.12/misc' into 4.12/master 88040e5 5 months ago
avatar Kolan Sh EXTRAVERSION updated. 772aae5 5 months ago
Jan Alexander Steffens (heftig) bfq-iosched: fix NULL ioc check in bfq_get_rq_private 348db85 5 months ago
Jan Alexander Steffens (heftig) block, bfq: update wr_busy_queues if needed on a queue split daa073d 5 months ago
Jan Alexander Steffens (heftig) block, bfq: don't change ioprio class for a bfq_queue on a service tree b49d82c 5 months ago
Jan Alexander Steffens (heftig) Merge branch '4.12/misc' into 4.12/master 96e967f 5 months ago
avatar Kolan Sh Merge remote-tracking branch 'zen-kernel/4.12/master' into 4.12 bdde6dd 5 months ago
Jan Alexander Steffens (heftig) zen: Ignore BIOS request to opt-out of x2apic support 415a32d 4 months ago
Jan Alexander Steffens (heftig) Merge branch '4.12/zen-tune' into 4.12/master d1fe5a2 4 months ago
avatar Kolan Sh Linux 4.12 merged 75b3695 4 months ago
avatar Kolan Sh Merge remote-tracking branch 'zen-kernel/4.12/master' into 4.12 e7b22df 4 months ago
Jan Alexander Steffens (heftig) Merge tag 'v4.12.1' into 4.12/master 0b6bd04 4 months ago
Jan Alexander Steffens (heftig) Revert "zen: Ignore BIOS request to opt-out of x2apic support" a7e2631 4 months ago
Jan Alexander Steffens (heftig) Merge branch '4.12/zen-tune' into 4.12/master 8f481cf 4 months ago
Jan Alexander Steffens (heftig) bfq: dispatch request to prevent queue stalling after the request completion d501136 4 months ago
Jan Alexander Steffens (heftig) Merge branch '4.12/misc' into 4.12/master dde60eb 4 months ago
Jan Alexander Steffens (heftig) Merge tag 'v4.12.2' into 4.12/master 9fea11a 4 months ago
avatar Kolan Sh Linux 4.12 merged 07c1bee 4 months ago
avatar Kolan Sh Zen merged into 4.12 102089e 4 months ago
Jan Alexander Steffens (heftig) Merge tag 'v4.12.3' into 4.12/master d669478 4 months ago
avatar Kolan Sh Linux 4.12 merged 475af42 4 months ago
Jan Alexander Steffens (heftig) block: disable runtime-pm for blk-mq 85d895f 4 months ago
Jan Alexander Steffens (heftig) bfq-mq: consider also in_service_entity to state whether an entity is backlogged 0131e0e 4 months ago
Jan Alexander Steffens (heftig) bfq-mq: reset in_service_entity if the pointed entity f6e2f9f 4 months ago
Jan Alexander Steffens (heftig) Merge branch '4.12/misc' into 4.12/master a123ecd 4 months ago
avatar Kolan Sh Linux 4.12 merged 3d94583 4 months ago
avatar Kolan Sh Zen merged into 4.12 dde5f6e 4 months ago
Jan Alexander Steffens (heftig) Merge tag 'v4.12.4' into 4.12/master 8f864fa 4 months ago
Jan Alexander Steffens (heftig) bfq-mq: Revert ad-hoc fixes 416a5a8 4 months ago
Jan Alexander Steffens (heftig) block, bfq: reset in_service_entity if it becomes idle b4da4d5 4 months ago
Jan Alexander Steffens (heftig) block, bfq: consider also in_service_entity to state whether an entity is active 7fda307 4 months ago
Jan Alexander Steffens (heftig) Merge branch '4.12/misc' into 4.12/master 29bc26d 4 months ago
avatar Kolan Sh Zen merged into 4.12 2fc1f08 4 months ago
avatar Kolan Sh Linux 4.12 merged ed27cb2 3 months ago
Jan Alexander Steffens (heftig) Merge tag 'v4.12.5' into 4.12/master ed12d0f 3 months ago
avatar Kolan Sh Linux 4.12 merged c24a25c 3 months ago
Jan Alexander Steffens (heftig) Merge tag 'v4.12.6' into 4.12/master 5ad3f9e 3 months ago
Jan Alexander Steffens (heftig) muqss: Merge MuQSS version 0.157 b0f3edd 3 months ago
Jan Alexander Steffens (heftig) muqss: Tweak configs fc16533 3 months ago
Jan Alexander Steffens (heftig) muqss: Add zen-tune tweaks 88ea217 3 months ago
Jan Alexander Steffens (heftig) Merge remote-tracking branch 'github/4.12/muqss' into 4.12/master 355cbdf 3 months ago
avatar Kolan Sh Linux 4.12 merged 1d3250a 3 months ago
avatar Kolan Sh Zen merged into 4.12 f3d98f9 3 months ago
Jan Alexander Steffens (heftig) Merge tag 'v4.12.7' into 4.12/master ef6519b 3 months ago
Jan Alexander Steffens (heftig) muqss: Merge MuQSS version 0.160 ac67409 3 months ago
Jan Alexander Steffens (heftig) Merge branch '4.12/muqss' into 4.12/master 988a41d 3 months ago
Jan Alexander Steffens (heftig) Merge tag 'v4.12.8' into 4.12/master ca8605a 3 months ago
avatar Kolan Sh Linux 4.12 merged f7a6551 3 months ago
avatar Kolan Sh Zen merged into 4.12 b063f17 3 months ago
Jan Alexander Steffens (heftig) bonding: ratelimit failed speed/duplex update warning 2ba9951 3 months ago
Jan Alexander Steffens (heftig) bonding: require speed/duplex only for 802.3ad, alb and tlb af3a867 3 months ago
Jan Alexander Steffens (heftig) Merge branch '4.12/misc' into 4.12/master 1dd72ee 3 months ago
Jan Alexander Steffens (heftig) mm: Revert x86_64 and arm64 ELF_ET_DYN_BASE base 88073e2 3 months ago
Steven Barrett Add BFQ-v8r12 ccfab11 3 months ago
Steven Barrett block, bfq: improve and refactor throughput-boosting logic 2d38ec0 3 months ago
Steven Barrett Add extra checks related to entity scheduling 6ee4872 3 months ago
Steven Barrett block, bfq: reset in_service_entity if it becomes idle 4efd3ea 3 months ago
Steven Barrett block, bfq: consider also in_service_entity to state whether an entity is active c440fa8 3 months ago
Steven Barrett FIRST BFQ-MQ COMMIT: Copy bfq-sq-iosched.c as bfq-mq-iosched.c 1ab9ee0 3 months ago
Steven Barrett Add config and build bits for bfq-mq-iosched 3f043d9 3 months ago
Steven Barrett Increase max policies for io controller 5650694 3 months ago
Steven Barrett Copy header file bfq.h as bfq-mq.h c591312 3 months ago
Steven Barrett Move thinktime from bic to bfqq 383a002 3 months ago
Steven Barrett Embed bfq-ioc.c and add locking on request queue 0d982a7 3 months ago
Steven Barrett Modify interface and operation to comply with blk-mq-sched d18e0a8 3 months ago
Steven Barrett Add checks and extra log messages - Part I 990c0bd 3 months ago
Steven Barrett Add lock check in bfq_allow_bio_merge c63a585 3 months ago
Steven Barrett bfq-mq: execute exit_icq operations immediately e234cf5 3 months ago
Steven Barrett Unnest request-queue and ioc locks from scheduler locks 224e19a 3 months ago
Steven Barrett Add checks and extra log messages - Part II 291a69b 3 months ago
Steven Barrett Fix unbalanced increment of rq_in_driver 56d2e50 3 months ago
Steven Barrett Add checks and extra log messages - Part III 783644a 3 months ago
Steven Barrett TESTING: Check wrong invocation of merge and put_rq_priv functions 445d51d 3 months ago
Steven Barrett Complete support for cgroups 4fec16d 3 months ago
Steven Barrett Remove all get and put of I/O contexts f5ca0bd 3 months ago
Steven Barrett BUGFIX: Remove unneeded and deadlock-causing lock in request_merged e82fb50 3 months ago
Steven Barrett Fix wrong unlikely 06f03bf 3 months ago
Steven Barrett Change cgroup params prefix to bfq-mq for bfq-mq 347712b 3 months ago
Steven Barrett Add tentative extra tests on groups, reqs and queues 6638af5 3 months ago
Steven Barrett block, bfq-mq: access and cache blkg data only when safe 2684436 3 months ago
Steven Barrett bfq-mq: fix macro name in conditional invocation of policy_unregister 7efdeea 3 months ago
Steven Barrett Port of "blk-mq-sched: unify request finished methods" f76492a 3 months ago
Steven Barrett Port of "bfq-iosched: fix NULL ioc check in bfq_get_rq_private" 828dce8 3 months ago
Steven Barrett Port of "blk-mq-sched: unify request prepare methods" f777b51 3 months ago
Steven Barrett Add list of bfq instances to documentation e53e0ce 3 months ago
Steven Barrett bfq-sq: fix prefix of names of cgroups parameters d13341c 3 months ago
Steven Barrett Add to documentation that bfq-mq and bfq-sq contain last fixes too fabbf15 3 months ago
Steven Barrett Improve most frequently used no-logging path 146c778 3 months ago
Steven Barrett bfq-sq: fix commit "Remove all get and put of I/O contexts" in branch bfq-mq 7bbe038 3 months ago
Steven Barrett bfq-sq-mq: make lookup_next_entity push up vtime on expirations 79afc04 3 months ago
Steven Barrett bfq-sq-mq: remove direct switch to an entity in higher class f5b1f51 3 months ago
Steven Barrett bfq-sq-mq: guarantee update_next_in_service always returns an eligible entity d872557 3 months ago
Steven Barrett Merge remote-tracking branch 'origin/4.12/bfq' into 4.12/master 5320bc7 3 months ago
Jan Alexander Steffens (heftig) Merge tag 'v4.12.9' into 4.12/master 047ddf1 3 months ago
avatar Kolan Sh Linux 4.12 merged e8c4ae3 3 months ago
avatar Kolan Sh Zen merged into 4.12 062f731 3 months ago
avatar Kolan Sh Linux 4.12 merged 478a09c 3 months ago
avatar Kolan Sh Linux 4.12 merged fa582c4 2 months ago
Greg Kroah-Hartman usb: quirks: add delay init quirk for Corsair Strafe RGB keyboard 572bcfc 2 months ago
Greg Kroah-Hartman USB: serial: option: add support for D-Link DWM-157 C1 c8ff3d1 2 months ago
Greg Kroah-Hartman usb: Add device quirk for Logitech HD Pro Webcam C920-C 0e8e379 2 months ago
Greg Kroah-Hartman usb:xhci:Fix regression when ATI chipsets detected c927f42 2 months ago
Greg Kroah-Hartman USB: musb: fix external abort on suspend 68596cc 2 months ago
Greg Kroah-Hartman ANDROID: binder: add padding to binder_fd_array_object. 74ffccf 2 months ago
Greg Kroah-Hartman ANDROID: binder: add hwbinder,vndbinder to BINDER_DEVICES. ffdb5b9 2 months ago
Greg Kroah-Hartman USB: core: Avoid race of async_completed() w/ usbdev_release() ed68c93 2 months ago
Greg Kroah-Hartman staging/rts5208: fix incorrect shift to extract upper nybble 70bfcf9 2 months ago
Greg Kroah-Hartman iio: adc: ti-ads1015: fix incorrect data rate setting update 1d7fadc 2 months ago
Greg Kroah-Hartman iio: adc: ti-ads1015: fix scale information for ADS1115 6c5595e 2 months ago
Greg Kroah-Hartman iio: adc: ti-ads1015: enable conversion when CONFIG_PM is not set 6c164a8 2 months ago
Greg Kroah-Hartman iio: adc: ti-ads1015: avoid getting stale result after runtime resume 303d31e 2 months ago
Greg Kroah-Hartman iio: adc: ti-ads1015: don't return invalid value from buffer setup callbacks 00202de 2 months ago
Greg Kroah-Hartman iio: adc: ti-ads1015: add adequate wait time to get correct conversion 6c6c3c6 2 months ago
Greg Kroah-Hartman driver core: bus: Fix a potential double free af61751 2 months ago
Greg Kroah-Hartman HID: wacom: Do not completely map WACOM_HID_WD_TOUCHRINGSTATUS usage bbe1a3b 2 months ago
Greg Kroah-Hartman binder: free memory on error 693ef09 2 months ago
Greg Kroah-Hartman crypto: caam/qi - fix compilation with CONFIG_DEBUG_FORCE_WEAK_PER_CPU=y aa57cf5 2 months ago
Greg Kroah-Hartman crypto: caam/qi - fix compilation with DEBUG enabled ba89dc8 2 months ago
Greg Kroah-Hartman fpga: altera-hps2fpga: fix multiple init of l3_remap_lock 055be59 2 months ago
Greg Kroah-Hartman intel_th: pci: Add Cannon Lake PCH-H support d219237 2 months ago
Greg Kroah-Hartman intel_th: pci: Add Cannon Lake PCH-LP support 270f0aa 2 months ago
Greg Kroah-Hartman ath10k: fix memory leak in rx ring buffer allocation d49ea1b 2 months ago
Greg Kroah-Hartman Input: trackpoint - assume 3 buttons when buttons detection fails a47814b 2 months ago
Greg Kroah-Hartman rtlwifi: rtl_pci_probe: Fix fail path of _rtl_pci_find_adapter 7d20c55 2 months ago
Greg Kroah-Hartman Bluetooth: Add support of 13d3:3494 RTL8723BE device cbe865a 2 months ago
Greg Kroah-Hartman iwlwifi: pci: add new PCI ID for 7265D 9856969 2 months ago
Greg Kroah-Hartman dlm: avoid double-free on error path in dlm_device_{register,unregister} 0bfb078 2 months ago
Greg Kroah-Hartman mwifiex: correct channel stat buffer overflows f7fb789 2 months ago
Greg Kroah-Hartman MCB: add support for SC31 to mcb-lpc d859d5a 2 months ago
Greg Kroah-Hartman s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs 2ce0e04 2 months ago
Greg Kroah-Hartman s390/mm: fix BUG_ON in crst_table_upgrade e3b9fb2 2 months ago
Greg Kroah-Hartman drm/nouveau/pci/msi: disable MSI on big-endian platforms by default 75bc569 2 months ago
Greg Kroah-Hartman drm/nouveau: Fix error handling in nv50_disp_atomic_commit daf316a 2 months ago
Greg Kroah-Hartman workqueue: Fix flag collision f21c4ee 2 months ago
Greg Kroah-Hartman ahci: don't use MSI for devices with the silly Intel NVMe remapping scheme e5298cd 2 months ago
Greg Kroah-Hartman cs5536: add support for IDE controller variant 1054309 2 months ago
Greg Kroah-Hartman scsi: sg: protect against races between mmap() and SG_SET_RESERVED_SIZE b0f24dc 2 months ago
Greg Kroah-Hartman scsi: sg: recheck MMAP_IO request length with lock held aee0b37 2 months ago
Greg Kroah-Hartman of/device: Prevent buffer overflow in of_device_modalias() 3ef5220 2 months ago
Greg Kroah-Hartman rtlwifi: Fix memory leak when firmware request fails 21da5e3 2 months ago
Greg Kroah-Hartman rtlwifi: Fix fallback firmware loading ce4ef93 2 months ago
Greg Kroah-Hartman Linux 4.12.12 6ff98e8 2 months ago
avatar Kolan Sh Linux 4.12 merged 23c2e00 2 months ago

Summary

  • Documentation/block/bfq-iosched.txt (23) ------+++++++++++++++++
  • Documentation/scheduler/sched-BFS.txt (351) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • Documentation/scheduler/sched-MuQSS.txt (347) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • Documentation/sysctl/kernel.txt (37) +++++++++++++++++++++++++++++++++++++
  • Documentation/tp_smapi.txt (267) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • Makefile (8) --++++++
  • arch/powerpc/platforms/cell/spufs/sched.c (5) -----
  • arch/s390/include/asm/pgtable.h (2) -+
  • arch/s390/mm/gmap.c (39) -------++++++++++++++++++++++++++++++++
  • arch/s390/mm/mmap.c (6) --++++
  • arch/x86/Kconfig (18) -+++++++++++++++++
  • arch/x86/Kconfig.cpu (224) -------------------------------+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • arch/x86/Makefile (33) ---++++++++++++++++++++++++++++++
  • arch/x86/Makefile_32.cpu (23) --+++++++++++++++++++++
  • arch/x86/include/asm/module.h (38) ++++++++++++++++++++++++++++++++++++++
  • block/Kconfig.iosched (50) ++++++++++++++++++++++++++++++++++++++++++++++++++
  • block/Makefile (2) ++
  • block/bfq-cgroup-included.c
  • block/bfq-ioc.c (36) ++++++++++++++++++++++++++++++++++++
  • block/bfq-iosched.c (53) -----------------++++++++++++++++++++++++++++++++++++
  • block/bfq-iosched.h (25) ------+++++++++++++++++++
  • block/bfq-mq-iosched.c
  • block/bfq-mq.h
  • block/bfq-sched.c
  • block/bfq-sq-iosched.c (5393) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • block/bfq-wf2q.c (185) ----------------------------------------------------------------------+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • block/bfq.h
  • block/elevator.c (4) ++++
  • drivers/android/Kconfig (2) -+
  • drivers/android/binder.c (8) --++++++
  • drivers/ata/ahci.c (9) -++++++++
  • drivers/ata/pata_amd.c (1) +
  • drivers/ata/pata_cs5536.c (1) +
  • drivers/base/bus.c (2) -+
  • drivers/bluetooth/btusb.c (1) +
  • drivers/cpufreq/cpufreq_ondemand.c (12) --++++++++++
  • drivers/crypto/caam/caamalg.c (66) ---------------------------------------------------+++++++++++++++
  • drivers/crypto/caam/caamalg_qi.c (6) ---+++
  • drivers/crypto/caam/error.c (40) ++++++++++++++++++++++++++++++++++++++++
  • drivers/crypto/caam/error.h (4) ++++
  • drivers/crypto/caam/qi.c (2) -+
  • drivers/fpga/altera-hps2fpga.c (4) ---+
  • drivers/gpu/drm/nouveau/nv50_display.c (7) --+++++
  • drivers/gpu/drm/nouveau/nvkm/subdev/pci/base.c (4) ++++
  • drivers/hid/wacom_wac.c (8) -+++++++
  • drivers/hwtracing/intel_th/pci.c (10) ++++++++++
  • drivers/iio/adc/ti-ads1015.c (123) ---------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • drivers/input/joydev.c (37) ----+++++++++++++++++++++++++++++++++
  • drivers/input/mouse/synaptics.c (4) -+++
  • drivers/input/mouse/synaptics.h (1) +
  • drivers/input/mouse/trackpoint.c (4) --++
  • drivers/macintosh/Kconfig (7) +++++++
  • drivers/macintosh/adbhid.c (83) --------+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • drivers/mcb/mcb-lpc.c (15) +++++++++++++++
  • drivers/net/wireless/ath/ath10k/core.c (12) ------++++++
  • drivers/net/wireless/intel/iwlwifi/pcie/drv.c (1) +
  • drivers/net/wireless/marvell/mwifiex/cfg80211.c (2) -+
  • drivers/net/wireless/marvell/mwifiex/scan.c (6) ++++++
  • drivers/net/wireless/realtek/rtlwifi/pci.c (4) --++
  • drivers/net/wireless/realtek/rtlwifi/rtl8188ee/sw.c (2) ++
  • drivers/net/wireless/realtek/rtlwifi/rtl8192ce/sw.c (2) ++
  • drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c (4) ++++
  • drivers/net/wireless/realtek/rtlwifi/rtl8192de/sw.c (2) ++
  • drivers/net/wireless/realtek/rtlwifi/rtl8192ee/sw.c (2) ++
  • drivers/net/wireless/realtek/rtlwifi/rtl8192se/sw.c (2) ++
  • drivers/net/wireless/realtek/rtlwifi/rtl8723ae/sw.c (2) ++
  • drivers/net/wireless/realtek/rtlwifi/rtl8723be/sw.c (15) ----------+++++
  • drivers/net/wireless/realtek/rtlwifi/rtl8821ae/sw.c (19) ----------+++++++++
  • drivers/of/device.c (2) ++
  • drivers/platform/x86/Kconfig (19) +++++++++++++++++++
  • drivers/platform/x86/Makefile (2) ++
  • drivers/platform/x86/hdaps.c
  • drivers/platform/x86/thinkpad_ec.c
  • drivers/platform/x86/tp_smapi.c
  • drivers/scsi/Kconfig (2) ++
  • drivers/scsi/Makefile (1) +
  • drivers/scsi/sg.c (19) -----++++++++++++++
  • drivers/scsi/vhba/Kconfig (9) +++++++++
  • drivers/scsi/vhba/Makefile (4) ++++
  • drivers/scsi/vhba/vhba.c
  • drivers/staging/rts5208/rtsx_scsi.c (2) -+
  • drivers/tty/Kconfig (13) +++++++++++++
  • drivers/usb/core/devio.c (4) --++
  • drivers/usb/core/quirks.c (6) -+++++
  • drivers/usb/host/pci-quirks.c (35) -----------------++++++++++++++++++
  • drivers/usb/musb/musb_core.c (18) --------++++++++++
  • drivers/usb/serial/option.c (1) +
  • drivers/video/logo/Kconfig (95) --------------+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • drivers/video/logo/Makefile (12) ++++++++++++
  • drivers/video/logo/logo.c (190) -------------------------------------------------------------------+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • drivers/video/logo/logo_arch_clut224.ppm (43204) 
  • drivers/video/logo/logo_bsd_clut224.ppm
  • drivers/video/logo/logo_debian_clut224.ppm
  • drivers/video/logo/logo_exherbo_clut224.ppm
  • drivers/video/logo/logo_fbsd_clut224.ppm
  • drivers/video/logo/logo_fedoraglossy_clut224.ppm
  • drivers/video/logo/logo_fedorasimple_clut224.ppm
  • drivers/video/logo/logo_gentoo_clut224.ppm
  • drivers/video/logo/logo_oldzen_clut224.ppm
  • drivers/video/logo/logo_slackware_clut224.ppm
  • drivers/video/logo/logo_tits_clut224.ppm
  • drivers/video/logo/logo_zen_clut224.ppm
  • fs/Kconfig (1) +
  • fs/Makefile (1) +
  • fs/dlm/user.c (4) ++++
  • fs/exec.c (6) -+++++
  • fs/exfat/Kconfig (39) +++++++++++++++++++++++++++++++++++++++
  • fs/exfat/LICENSE (339) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/Makefile (54) ++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/README.md (98) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/dkms.conf (7) +++++++
  • fs/exfat/exfat-km.mk (11) +++++++++++
  • fs/exfat/exfat_api.c
  • fs/exfat/exfat_api.h (206) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_bitmap.c (63) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_bitmap.h (55) +++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_blkdev.c (197) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_blkdev.h (73) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_cache.c
  • fs/exfat/exfat_cache.h (85) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_config.h (69) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_core.c (5138) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_core.h
  • fs/exfat/exfat_data.c (77) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_data.h (58) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_nls.c (448) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_nls.h (91) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_oal.c (196) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_oal.h (74) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_super.c
  • fs/exfat/exfat_super.h (171) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_upcase.c (405) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • fs/exfat/exfat_version.h (19) +++++++++++++++++++
  • fs/open.c (4) ++++
  • fs/proc/base.c (2) -+
  • include/linux/blkdev.h (10) -+++++++++
  • include/linux/init_task.h (78) ---+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • include/linux/ioprio.h (2) ++
  • include/linux/linux_logo.h (12) ++++++++++++
  • include/linux/pci_ids.h (1) +
  • include/linux/sched.h (60) -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • include/linux/sched/nohz.h (4) --++
  • include/linux/sched/prio.h (12) ++++++++++++
  • include/linux/sched/task.h (2) -+
  • include/linux/skip_list.h (33) +++++++++++++++++++++++++++++++++
  • include/linux/thinkpad_ec.h (47) +++++++++++++++++++++++++++++++++++++++++++++++
  • include/linux/workqueue.h (2) -+
  • include/trace/events/fs.h (53) +++++++++++++++++++++++++++++++++++++++++++++++++++++
  • include/uapi/linux/android/binder.h (2) ++
  • include/uapi/linux/sched.h (9) -++++++++
  • include/uapi/linux/vt.h (15) -++++++++++++++
  • init/Kconfig (61) ---++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • init/main.c (3) -++
  • kernel/Makefile (2) -+
  • kernel/configs/android-base.config (1) +
  • kernel/delayacct.c (2) -+
  • kernel/exit.c (4) --++
  • kernel/kthread.c (30) -+++++++++++++++++++++++++++++
  • kernel/livepatch/transition.c (8) -+++++++
  • kernel/sched/Makefile (13) ----+++++++++
  • kernel/sched/MuQSS.c
  • kernel/sched/MuQSS.h
  • kernel/sched/cpufreq_schedutil.c (12) ++++++++++++
  • kernel/sched/cputime.c (22) ---------------------+
  • kernel/sched/fair.c (25) +++++++++++++++++++++++++
  • kernel/sched/idle.c (13) ---++++++++++
  • kernel/sched/sched.h (30) -+++++++++++++++++++++++++++++
  • kernel/skip_list.c (148) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • kernel/sysctl.c (52) ---+++++++++++++++++++++++++++++++++++++++++++++++++
  • kernel/time/clockevents.c (5) +++++
  • kernel/time/posix-cpu-timers.c (8) ----++++
  • kernel/time/timer.c (7) --+++++
  • kernel/trace/trace_selftest.c (5) +++++
  • lib/Kconfig.debug (3) -++
  • mm/page-writeback.c (8) ++++++++
  • net/ipv4/Kconfig (4) ++++
  • scripts/mkcompile_h (4) --++
11 11 groups (switching back to time distribution when needed to keep
12 12 throughput high).
13 13
14 If bfq-mq patches have been applied, then the following three
15 instances of BFQ are available (otherwise only the first instance):
16 - bfq: mainline version of BFQ, for blk-mq
17 - bfq-mq: development version of BFQ for blk-mq; this version contains
18 also all latest features and fixes not yet landed in mainline, plus many
19 safety checks
20 - bfq: BFQ for legacy blk; also this version contains latest features
21 and fixes, as well as safety checks
22
14 23 In its default configuration, BFQ privileges latency over
15 24 throughput. So, when needed for achieving a lower latency, BFQ builds
16 25 schedules that may lead to a lower throughput. If your main or only
34 34 to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
35 35 multi-queue devices.
36 36
37 The table of contents follow. Impatients can just jump to Section 3.
37 The table of contents follows. Impatients can just jump to Section 3.
38 38
39 39 CONTENTS
40 40
519 519 BFQ must of course be the active scheduler for that device.
520 520
521 521 Within each group directory, the names of the files associated with
522 BFQ-specific cgroup parameters and stats begin with the "bfq."
523 prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
524 BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
525 parameter to set the weight of a group with BFQ is blkio.bfq.weight
522 BFQ-specific cgroup parameters and stats begin with the "bfq.",
523 "bfq-sq." or "bfq-mq." prefix, depending on which instance of bfq you
524 want to use. So, with cgroups-v1 or cgroups-v2, the full prefix for
525 BFQ-specific files is "blkio.bfqX." or "io.bfqX.", where X can be ""
526 (i.e., null string), "-sq" or "-mq". For example, the group parameter
527 to set the weight of a group with the mainline BFQ is blkio.bfq.weight
526 528 or io.bfq.weight.
527 529
528 530 Parameters to set
532 532
533 533 For each group, there is only the following parameter to set.
534 534
535 weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
535 weight (namely blkio.bfqX.weight or io.bfqX.weight): the weight of the
536 536 group inside its parent. Available values: 1..10000 (default 100). The
537 537 linear mapping between ioprio and weights, described at the beginning
538 538 of the tunable section, is still valid, but all weights higher than
1 BFS - The Brain Fuck Scheduler by Con Kolivas.
2
3 Goals.
4
5 The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to
6 completely do away with the complex designs of the past for the cpu process
7 scheduler and instead implement one that is very simple in basic design.
8 The main focus of BFS is to achieve excellent desktop interactivity and
9 responsiveness without heuristics and tuning knobs that are difficult to
10 understand, impossible to model and predict the effect of, and when tuned to
11 one workload cause massive detriment to another.
12
13
14 Design summary.
15
16 BFS is best described as a single runqueue, O(n) lookup, earliest effective
17 virtual deadline first design, loosely based on EEVDF (earliest eligible virtual
18 deadline first) and my previous Staircase Deadline scheduler. Each component
19 shall be described in order to understand the significance of, and reasoning for
20 it. The codebase when the first stable version was released was approximately
21 9000 lines less code than the existing mainline linux kernel scheduler (in
22 2.6.31). This does not even take into account the removal of documentation and
23 the cgroups code that is not used.
24
25 Design reasoning.
26
27 The single runqueue refers to the queued but not running processes for the
28 entire system, regardless of the number of CPUs. The reason for going back to
29 a single runqueue design is that once multiple runqueues are introduced,
30 per-CPU or otherwise, there will be complex interactions as each runqueue will
31 be responsible for the scheduling latency and fairness of the tasks only on its
32 own runqueue, and to achieve fairness and low latency across multiple CPUs, any
33 advantage in throughput of having CPU local tasks causes other disadvantages.
34 This is due to requiring a very complex balancing system to at best achieve some
35 semblance of fairness across CPUs and can only maintain relatively low latency
36 for tasks bound to the same CPUs, not across them. To increase said fairness
37 and latency across CPUs, the advantage of local runqueue locking, which makes
38 for better scalability, is lost due to having to grab multiple locks.
39
40 A significant feature of BFS is that all accounting is done purely based on CPU
41 used and nowhere is sleep time used in any way to determine entitlement or
42 interactivity. Interactivity "estimators" that use some kind of sleep/run
43 algorithm are doomed to fail to detect all interactive tasks, and to falsely tag
44 tasks that aren't interactive as being so. The reason for this is that it is
45 close to impossible to determine that when a task is sleeping, whether it is
46 doing it voluntarily, as in a userspace application waiting for input in the
47 form of a mouse click or otherwise, or involuntarily, because it is waiting for
48 another thread, process, I/O, kernel activity or whatever. Thus, such an
49 estimator will introduce corner cases, and more heuristics will be required to
50 cope with those corner cases, introducing more corner cases and failed
51 interactivity detection and so on. Interactivity in BFS is built into the design
52 by virtue of the fact that tasks that are waking up have not used up their quota
53 of CPU time, and have earlier effective deadlines, thereby making it very likely
54 they will preempt any CPU bound task of equivalent nice level. See below for
55 more information on the virtual deadline mechanism. Even if they do not preempt
56 a running task, because the rr interval is guaranteed to have a bound upper
57 limit on how long a task will wait for, it will be scheduled within a timeframe
58 that will not cause visible interface jitter.
59
60
61 Design details.
62
63 Task insertion.
64
65 BFS inserts tasks into each relevant queue as an O(1) insertion into a double
66 linked list. On insertion, *every* running queue is checked to see if the newly
67 queued task can run on any idle queue, or preempt the lowest running task on the
68 system. This is how the cross-CPU scheduling of BFS achieves significantly lower
69 latency per extra CPU the system has. In this case the lookup is, in the worst
70 case scenario, O(n) where n is the number of CPUs on the system.
71
72 Data protection.
73
74 BFS has one single lock protecting the process local data of every task in the
75 global queue. Thus every insertion, removal and modification of task data in the
76 global runqueue needs to grab the global lock. However, once a task is taken by
77 a CPU, the CPU has its own local data copy of the running process' accounting
78 information which only that CPU accesses and modifies (such as during a
79 timer tick) thus allowing the accounting data to be updated lockless. Once a
80 CPU has taken a task to run, it removes it from the global queue. Thus the
81 global queue only ever has, at most,
82
83 (number of tasks requesting cpu time) - (number of logical CPUs) + 1
84
85 tasks in the global queue. This value is relevant for the time taken to look up
86 tasks during scheduling. This will increase if many tasks with CPU affinity set
87 in their policy to limit which CPUs they're allowed to run on if they outnumber
88 the number of CPUs. The +1 is because when rescheduling a task, the CPU's
89 currently running task is put back on the queue. Lookup will be described after
90 the virtual deadline mechanism is explained.
91
92 Virtual deadline.
93
94 The key to achieving low latency, scheduling fairness, and "nice level"
95 distribution in BFS is entirely in the virtual deadline mechanism. The one
96 tunable in BFS is the rr_interval, or "round robin interval". This is the
97 maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
98 tasks of the same nice level will be running for, or looking at it the other
99 way around, the longest duration two tasks of the same nice level will be
100 delayed for. When a task requests cpu time, it is given a quota (time_slice)
101 equal to the rr_interval and a virtual deadline. The virtual deadline is
102 offset from the current time in jiffies by this equation:
103
104 jiffies + (prio_ratio * rr_interval)
105
106 The prio_ratio is determined as a ratio compared to the baseline of nice -20
107 and increases by 10% per nice level. The deadline is a virtual one only in that
108 no guarantee is placed that a task will actually be scheduled by this time, but
109 it is used to compare which task should go next. There are three components to
110 how a task is next chosen. First is time_slice expiration. If a task runs out
111 of its time_slice, it is descheduled, the time_slice is refilled, and the
112 deadline reset to that formula above. Second is sleep, where a task no longer
113 is requesting CPU for whatever reason. The time_slice and deadline are _not_
114 adjusted in this case and are just carried over for when the task is next
115 scheduled. Third is preemption, and that is when a newly waking task is deemed
116 higher priority than a currently running task on any cpu by virtue of the fact
117 that it has an earlier virtual deadline than the currently running task. The
118 earlier deadline is the key to which task is next chosen for the first and
119 second cases. Once a task is descheduled, it is put back on the queue, and an
120 O(n) lookup of all queued-but-not-running tasks is done to determine which has
121 the earliest deadline and that task is chosen to receive CPU next.
122
123 The CPU proportion of different nice tasks works out to be approximately the
124
125 (prio_ratio difference)^2
126
127 The reason it is squared is that a task's deadline does not change while it is
128 running unless it runs out of time_slice. Thus, even if the time actually
129 passes the deadline of another task that is queued, it will not get CPU time
130 unless the current running task deschedules, and the time "base" (jiffies) is
131 constantly moving.
132
133 Task lookup.
134
135 BFS has 103 priority queues. 100 of these are dedicated to the static priority
136 of realtime tasks, and the remaining 3 are, in order of best to worst priority,
137 SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority
138 scheduling). When a task of these priorities is queued, a bitmap of running
139 priorities is set showing which of these priorities has tasks waiting for CPU
140 time. When a CPU is made to reschedule, the lookup for the next task to get
141 CPU time is performed in the following way:
142
143 First the bitmap is checked to see what static priority tasks are queued. If
144 any realtime priorities are found, the corresponding queue is checked and the
145 first task listed there is taken (provided CPU affinity is suitable) and lookup
146 is complete. If the priority corresponds to a SCHED_ISO task, they are also
147 taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds
148 to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this
149 stage, every task in the runlist that corresponds to that priority is checked
150 to see which has the earliest set deadline, and (provided it has suitable CPU
151 affinity) it is taken off the runqueue and given the CPU. If a task has an
152 expired deadline, it is taken and the rest of the lookup aborted (as they are
153 chosen in FIFO order).
154
155 Thus, the lookup is O(n) in the worst case only, where n is as described
156 earlier, as tasks may be chosen before the whole task list is looked over.
157
158
159 Scalability.
160
161 The major limitations of BFS will be that of scalability, as the separate
162 runqueue designs will have less lock contention as the number of CPUs rises.
163 However they do not scale linearly even with separate runqueues as multiple
164 runqueues will need to be locked concurrently on such designs to be able to
165 achieve fair CPU balancing, to try and achieve some sort of nice-level fairness
166 across CPUs, and to achieve low enough latency for tasks on a busy CPU when
167 other CPUs would be more suited. BFS has the advantage that it requires no
168 balancing algorithm whatsoever, as balancing occurs by proxy simply because
169 all CPUs draw off the global runqueue, in priority and deadline order. Despite
170 the fact that scalability is _not_ the prime concern of BFS, it both shows very
171 good scalability to smaller numbers of CPUs and is likely a more scalable design
172 at these numbers of CPUs.
173
174 It also has some very low overhead scalability features built into the design
175 when it has been deemed their overhead is so marginal that they're worth adding.
176 The first is the local copy of the running process' data to the CPU it's running
177 on to allow that data to be updated lockless where possible. Then there is
178 deference paid to the last CPU a task was running on, by trying that CPU first
179 when looking for an idle CPU to use the next time it's scheduled. Finally there
180 is the notion of cache locality beyond the last running CPU. The sched_domains
181 information is used to determine the relative virtual "cache distance" that
182 other CPUs have from the last CPU a task was running on. CPUs with shared
183 caches, such as SMT siblings, or multicore CPUs with shared caches, are treated
184 as cache local. CPUs without shared caches are treated as not cache local, and
185 CPUs on different NUMA nodes are treated as very distant. This "relative cache
186 distance" is used by modifying the virtual deadline value when doing lookups.
187 Effectively, the deadline is unaltered between "cache local" CPUs, doubled for
188 "cache distant" CPUs, and quadrupled for "very distant" CPUs. The reasoning
189 behind the doubling of deadlines is as follows. The real cost of migrating a
190 task from one CPU to another is entirely dependant on the cache footprint of
191 the task, how cache intensive the task is, how long it's been running on that
192 CPU to take up the bulk of its cache, how big the CPU cache is, how fast and
193 how layered the CPU cache is, how fast a context switch is... and so on. In
194 other words, it's close to random in the real world where we do more than just
195 one sole workload. The only thing we can be sure of is that it's not free. So
196 BFS uses the principle that an idle CPU is a wasted CPU and utilising idle CPUs
197 is more important than cache locality, and cache locality only plays a part
198 after that. Doubling the effective deadline is based on the premise that the
199 "cache local" CPUs will tend to work on the same tasks up to double the number
200 of cache local CPUs, and once the workload is beyond that amount, it is likely
201 that none of the tasks are cache warm anywhere anyway. The quadrupling for NUMA
202 is a value I pulled out of my arse.
203
204 When choosing an idle CPU for a waking task, the cache locality is determined
205 according to where the task last ran and then idle CPUs are ranked from best
206 to worst to choose the most suitable idle CPU based on cache locality, NUMA
207 node locality and hyperthread sibling business. They are chosen in the
208 following preference (if idle):
209
210 * Same core, idle or busy cache, idle threads
211 * Other core, same cache, idle or busy cache, idle threads.
212 * Same node, other CPU, idle cache, idle threads.
213 * Same node, other CPU, busy cache, idle threads.
214 * Same core, busy threads.
215 * Other core, same cache, busy threads.
216 * Same node, other CPU, busy threads.
217 * Other node, other CPU, idle cache, idle threads.
218 * Other node, other CPU, busy cache, idle threads.
219 * Other node, other CPU, busy threads.
220
221 This shows the SMT or "hyperthread" awareness in the design as well which will
222 choose a real idle core first before a logical SMT sibling which already has
223 tasks on the physical CPU.
224
225 Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark.
226 However this benchmarking was performed on an earlier design that was far less
227 scalable than the current one so it's hard to know how scalable it is in terms
228 of both CPUs (due to the global runqueue) and heavily loaded machines (due to
229 O(n) lookup) at this stage. Note that in terms of scalability, the number of
230 _logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x)
231 quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark
232 results are very promising indeed, without needing to tweak any knobs, features
233 or options. Benchmark contributions are most welcome.
234
235
236 Features
237
238 As the initial prime target audience for BFS was the average desktop user, it
239 was designed to not need tweaking, tuning or have features set to obtain benefit
240 from it. Thus the number of knobs and features has been kept to an absolute
241 minimum and should not require extra user input for the vast majority of cases.
242 There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval
243 and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition
244 to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is
245 support for CGROUPS. The average user should neither need to know what these
246 are, nor should they need to be using them to have good desktop behaviour.
247
248 rr_interval
249
250 There is only one "scheduler" tunable, the round robin interval. This can be
251 accessed in
252
253 /proc/sys/kernel/rr_interval
254
255 The value is in milliseconds, and the default value is set to 6 on a
256 uniprocessor machine, and automatically set to a progressively higher value on
257 multiprocessor machines. The reasoning behind increasing the value on more CPUs
258 is that the effective latency is decreased by virtue of there being more CPUs on
259 BFS (for reasons explained above), and increasing the value allows for less
260 cache contention and more throughput. Valid values are from 1 to 1000
261 Decreasing the value will decrease latencies at the cost of decreasing
262 throughput, while increasing it will improve throughput, but at the cost of
263 worsening latencies. The accuracy of the rr interval is limited by HZ resolution
264 of the kernel configuration. Thus, the worst case latencies are usually slightly
265 higher than this actual value. The default value of 6 is not an arbitrary one.
266 It is based on the fact that humans can detect jitter at approximately 7ms, so
267 aiming for much lower latencies is pointless under most circumstances. It is
268 worth noting this fact when comparing the latency performance of BFS to other
269 schedulers. Worst case latencies being higher than 7ms are far worse than
270 average latencies not being in the microsecond range.
271
272 Isochronous scheduling.
273
274 Isochronous scheduling is a unique scheduling policy designed to provide
275 near-real-time performance to unprivileged (ie non-root) users without the
276 ability to starve the machine indefinitely. Isochronous tasks (which means
277 "same time") are set using, for example, the schedtool application like so:
278
279 schedtool -I -e amarok
280
281 This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works
282 is that it has a priority level between true realtime tasks and SCHED_NORMAL
283 which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie,
284 if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval
285 rate). However if ISO tasks run for more than a tunable finite amount of time,
286 they are then demoted back to SCHED_NORMAL scheduling. This finite amount of
287 time is the percentage of _total CPU_ available across the machine, configurable
288 as a percentage in the following "resource handling" tunable (as opposed to a
289 scheduler tunable):
290
291 /proc/sys/kernel/iso_cpu
292
293 and is set to 70% by default. It is calculated over a rolling 5 second average
294 Because it is the total CPU available, it means that on a multi CPU machine, it
295 is possible to have an ISO task running as realtime scheduling indefinitely on
296 just one CPU, as the other CPUs will be available. Setting this to 100 is the
297 equivalent of giving all users SCHED_RR access and setting it to 0 removes the
298 ability to run any pseudo-realtime tasks.
299
300 A feature of BFS is that it detects when an application tries to obtain a
301 realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the
302 appropriate privileges to use those policies. When it detects this, it will
303 give the task SCHED_ISO policy instead. Thus it is transparent to the user.
304 Because some applications constantly set their policy as well as their nice
305 level, there is potential for them to undo the override specified by the user
306 on the command line of setting the policy to SCHED_ISO. To counter this, once
307 a task has been set to SCHED_ISO policy, it needs superuser privileges to set
308 it back to SCHED_NORMAL. This will ensure the task remains ISO and all child
309 processes and threads will also inherit the ISO policy.
310
311 Idleprio scheduling.
312
313 Idleprio scheduling is a scheduling policy designed to give out CPU to a task
314 _only_ when the CPU would be otherwise idle. The idea behind this is to allow
315 ultra low priority tasks to be run in the background that have virtually no
316 effect on the foreground tasks. This is ideally suited to distributed computing
317 clients (like setiathome, folding, mprime etc) but can also be used to start
318 a video encode or so on without any slowdown of other tasks. To avoid this
319 policy from grabbing shared resources and holding them indefinitely, if it
320 detects a state where the task is waiting on I/O, the machine is about to
321 suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As
322 per the Isochronous task management, once a task has been scheduled as IDLEPRIO,
323 it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can
324 be set to start as SCHED_IDLEPRIO with the schedtool command like so:
325
326 schedtool -D -e ./mprime
327
328 Subtick accounting.
329
330 It is surprisingly difficult to get accurate CPU accounting, and in many cases,
331 the accounting is done by simply determining what is happening at the precise
332 moment a timer tick fires off. This becomes increasingly inaccurate as the
333 timer tick frequency (HZ) is lowered. It is possible to create an application
334 which uses almost 100% CPU, yet by being descheduled at the right time, records
335 zero CPU usage. While the main problem with this is that there are possible
336 security implications, it is also difficult to determine how much CPU a task
337 really does use. BFS tries to use the sub-tick accounting from the TSC clock,
338 where possible, to determine real CPU usage. This is not entirely reliable, but
339 is far more likely to produce accurate CPU usage data than the existing designs
340 and will not show tasks as consuming no CPU usage when they actually are. Thus,
341 the amount of CPU reported as being used by BFS will more accurately represent
342 how much CPU the task itself is using (as is shown for example by the 'time'
343 application), so the reported values may be quite different to other schedulers.
344 Values reported as the 'load' are more prone to problems with this design, but
345 per process values are closer to real usage. When comparing throughput of BFS
346 to other designs, it is important to compare the actual completed work in terms
347 of total wall clock time taken and total work done, rather than the reported
348 "cpu usage".
349
350
351 Con Kolivas <kernel@kolivas.org> Fri Aug 27 2010
1 MuQSS - The Multiple Queue Skiplist Scheduler by Con Kolivas.
2
3 MuQSS is a per-cpu runqueue variant of the original BFS scheduler with
4 one 8 level skiplist per runqueue, and fine grained locking for much more
5 scalability.
6
7
8 Goals.
9
10 The goal of the Multiple Queue Skiplist Scheduler, referred to as MuQSS from
11 here on (pronounced mux) is to completely do away with the complex designs of
12 the past for the cpu process scheduler and instead implement one that is very
13 simple in basic design. The main focus of MuQSS is to achieve excellent desktop
14 interactivity and responsiveness without heuristics and tuning knobs that are
15 difficult to understand, impossible to model and predict the effect of, and when
16 tuned to one workload cause massive detriment to another, while still being
17 scalable to many CPUs and processes.
18
19
20 Design summary.
21
22 MuQSS is best described as per-cpu multiple runqueue, O(log n) insertion, O(1)
23 lookup, earliest effective virtual deadline first tickless design, loosely based
24 on EEVDF (earliest eligible virtual deadline first) and my previous Staircase
25 Deadline scheduler, and evolved from the single runqueue O(n) BFS scheduler.
26 Each component shall be described in order to understand the significance of,
27 and reasoning for it.
28
29
30 Design reasoning.
31
32 In BFS, the use of a single runqueue across all CPUs meant that each CPU would
33 need to scan the entire runqueue looking for the process with the earliest
34 deadline and schedule that next, regardless of which CPU it originally came
35 from. This made BFS deterministic with respect to latency and provided
36 guaranteed latencies dependent on number of processes and CPUs. The single
37 runqueue, however, meant that all CPUs would compete for the single lock
38 protecting it, which would lead to increasing lock contention as the number of
39 CPUs rose and appeared to limit scalability of common workloads beyond 16
40 logical CPUs. Additionally, the O(n) lookup of the runqueue list obviously
41 increased overhead proportionate to the number of queued proecesses and led to
42 cache thrashing while iterating over the linked list.
43
44 MuQSS is an evolution of BFS, designed to maintain the same scheduling
45 decision mechanism and be virtually deterministic without relying on the
46 constrained design of the single runqueue by splitting out the single runqueue
47 to be per-CPU and use skiplists instead of linked lists.
48
49 The original reason for going back to a single runqueue design for BFS was that
50 once multiple runqueues are introduced, per-CPU or otherwise, there will be
51 complex interactions as each runqueue will be responsible for the scheduling
52 latency and fairness of the tasks only on its own runqueue, and to achieve
53 fairness and low latency across multiple CPUs, any advantage in throughput of
54 having CPU local tasks causes other disadvantages. This is due to requiring a
55 very complex balancing system to at best achieve some semblance of fairness
56 across CPUs and can only maintain relatively low latency for tasks bound to the
57 same CPUs, not across them. To increase said fairness and latency across CPUs,
58 the advantage of local runqueue locking, which makes for better scalability, is
59 lost due to having to grab multiple locks.
60
61 MuQSS works around the problems inherent in multiple runqueue designs by
62 making its skip lists priority ordered and through novel use of lockless
63 examination of each other runqueue it can decide if it should take the earliest
64 deadline task from another runqueue for latency reasons, or for CPU balancing
65 reasons. It still does not have a balancing system, choosing to allow the
66 next task scheduling decision and task wakeup CPU choice to allow balancing to
67 happen by virtue of its choices.
68
69
70 Design details.
71
72 Custom skip list implementation:
73
74 To avoid the overhead of building up and tearing down skip list structures,
75 the variant used by MuQSS has a number of optimisations making it specific for
76 its use case in the scheduler. It uses static arrays of 8 'levels' instead of
77 building up and tearing down structures dynamically. This makes each runqueue
78 only scale O(log N) up to 64k tasks. However as there is one runqueue per CPU
79 it means that it scales O(log N) up to 64k x number of logical CPUs which is
80 far beyond the realistic task limits each CPU could handle. By being 8 levels
81 it also makes the array exactly one cacheline in size. Additionally, each
82 skip list node is bidirectional making insertion and removal amortised O(1),
83 being O(k) where k is 1-8. Uniquely, we are only ever interested in the very
84 first entry in each list at all times with MuQSS, so there is never a need to
85 do a search and thus look up is always O(1). In interactive mode, the queues
86 will be searched beyond their first entry if the first task is not suitable
87 for affinity or SMT nice reasons.
88
89 Task insertion:
90
91 MuQSS inserts tasks into a per CPU runqueue as an O(log N) insertion into
92 a custom skip list as described above (based on the original design by William
93 Pugh). Insertion is ordered in such a way that there is never a need to do a
94 search by ordering tasks according to static priority primarily, and then
95 virtual deadline at the time of insertion.
96
97 Niffies:
98
99 Niffies are a monotonic forward moving timer not unlike the "jiffies" but are
100 of nanosecond resolution. Niffies are calculated per-runqueue from the high
101 resolution TSC timers, and in order to maintain fairness are synchronised
102 between CPUs whenever both runqueues are locked concurrently.
103
104 Virtual deadline:
105
106 The key to achieving low latency, scheduling fairness, and "nice level"
107 distribution in MuQSS is entirely in the virtual deadline mechanism. The one
108 tunable in MuQSS is the rr_interval, or "round robin interval". This is the
109 maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
110 tasks of the same nice level will be running for, or looking at it the other
111 way around, the longest duration two tasks of the same nice level will be
112 delayed for. When a task requests cpu time, it is given a quota (time_slice)
113 equal to the rr_interval and a virtual deadline. The virtual deadline is
114 offset from the current time in niffies by this equation:
115
116 niffies + (prio_ratio * rr_interval)
117
118 The prio_ratio is determined as a ratio compared to the baseline of nice -20
119 and increases by 10% per nice level. The deadline is a virtual one only in that
120 no guarantee is placed that a task will actually be scheduled by this time, but
121 it is used to compare which task should go next. There are three components to
122 how a task is next chosen. First is time_slice expiration. If a task runs out
123 of its time_slice, it is descheduled, the time_slice is refilled, and the
124 deadline reset to that formula above. Second is sleep, where a task no longer
125 is requesting CPU for whatever reason. The time_slice and deadline are _not_
126 adjusted in this case and are just carried over for when the task is next
127 scheduled. Third is preemption, and that is when a newly waking task is deemed
128 higher priority than a currently running task on any cpu by virtue of the fact
129 that it has an earlier virtual deadline than the currently running task. The
130 earlier deadline is the key to which task is next chosen for the first and
131 second cases.
132
133 The CPU proportion of different nice tasks works out to be approximately the
134
135 (prio_ratio difference)^2
136
137 The reason it is squared is that a task's deadline does not change while it is
138 running unless it runs out of time_slice. Thus, even if the time actually
139 passes the deadline of another task that is queued, it will not get CPU time
140 unless the current running task deschedules, and the time "base" (niffies) is
141 constantly moving.
142
143 Task lookup:
144
145 As tasks are already pre-ordered according to anticipated scheduling order in
146 the skip lists, lookup for the next suitable task per-runqueue is always a
147 matter of simply selecting the first task in the 0th level skip list entry.
148 In order to maintain optimal latency and fairness across CPUs, MuQSS does a
149 novel examination of every other runqueue in cache locality order, choosing the
150 best task across all runqueues. This provides near-determinism of how long any
151 task across the entire system may wait before receiving CPU time. The other
152 runqueues are first examine lockless and then trylocked to minimise the
153 potential lock contention if they are likely to have a suitable better task.
154 Each other runqueue lock is only held for as long as it takes to examine the
155 entry for suitability. In "interactive" mode, the default setting, MuQSS will
156 look for the best deadline task across all CPUs, while in !interactive mode,
157 it will only select a better deadline task from another CPU if it is more
158 heavily laden than the current one.
159
160 Lookup is therefore O(k) where k is number of CPUs.
161
162
163 Latency.
164
165 Through the use of virtual deadlines to govern the scheduling order of normal
166 tasks, queue-to-activation latency per runqueue is guaranteed to be bound by
167 the rr_interval tunable which is set to 6ms by default. This means that the
168 longest a CPU bound task will wait for more CPU is proportional to the number
169 of running tasks and in the common case of 0-2 running tasks per CPU, will be
170 under the 7ms threshold for human perception of jitter. Additionally, as newly
171 woken tasks will have an early deadline from their previous runtime, the very
172 tasks that are usually latency sensitive will have the shortest interval for
173 activation, usually preempting any existing CPU bound tasks.
174
175 Tickless expiry:
176
177 A feature of MuQSS is that it is not tied to the resolution of the chosen tick
178 rate in Hz, instead depending entirely on the high resolution timers where
179 possible for sub-millisecond accuracy on timeouts regarless of the underlying
180 tick rate. This allows MuQSS to be run with the low overhead of low Hz rates
181 such as 100 by default, benefiting from the improved throughput and lower
182 power usage it provides. Another advantage of this approach is that in
183 combination with the Full No HZ option, which disables ticks on running task
184 CPUs instead of just idle CPUs, the tick can be disabled at all times
185 regardless of how many tasks are running instead of being limited to just one
186 running task. Note that this option is NOT recommended for regular desktop
187 users.
188
189
190 Scalability and balancing.
191
192 Unlike traditional approaches where balancing is a combination of CPU selection
193 at task wakeup and intermittent balancing based on a vast array of rules set
194 according to architecture, busyness calculations and special case management,
195 MuQSS indirectly balances on the fly at task wakeup and next task selection.
196 During initialisation, MuQSS creates a cache coherency ordered list of CPUs for
197 each logical CPU and uses this to aid task/CPU selection when CPUs are busy.
198 Additionally it selects any idle CPUs, if they are available, at any time over
199 busy CPUs according to the following preference:
200
201 * Same thread, idle or busy cache, idle or busy threads
202 * Other core, same cache, idle or busy cache, idle threads.
203 * Same node, other CPU, idle cache, idle threads.
204 * Same node, other CPU, busy cache, idle threads.
205 * Other core, same cache, busy threads.
206 * Same node, other CPU, busy threads.
207 * Other node, other CPU, idle cache, idle threads.
208 * Other node, other CPU, busy cache, idle threads.
209 * Other node, other CPU, busy threads.
210
211 Mux is therefore SMT, MC and Numa aware without the need for extra
212 intermittent balancing to maintain CPUs busy and make the most of cache
213 coherency.
214
215
216 Features
217
218 As the initial prime target audience for MuQSS was the average desktop user, it
219 was designed to not need tweaking, tuning or have features set to obtain benefit
220 from it. Thus the number of knobs and features has been kept to an absolute
221 minimum and should not require extra user input for the vast majority of cases.
222 There are 3 optional tunables, and 2 extra scheduling policies. The rr_interval,
223 interactive, and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO
224 policies. In addition to this, MuQSS also uses sub-tick accounting. What MuQSS
225 does _not_ now feature is support for CGROUPS. The average user should neither
226 need to know what these are, nor should they need to be using them to have good
227 desktop behaviour. However since some applications refuse to work without
228 cgroups, one can enable them with MuQSS as a stub and the filesystem will be
229 created which will allow the applications to work.
230
231 rr_interval:
232
233 /proc/sys/kernel/rr_interval
234
235 The value is in milliseconds, and the default value is set to 6. Valid values
236 are from 1 to 1000 Decreasing the value will decrease latencies at the cost of
237 decreasing throughput, while increasing it will improve throughput, but at the
238 cost of worsening latencies. It is based on the fact that humans can detect
239 jitter at approximately 7ms, so aiming for much lower latencies is pointless
240 under most circumstances. It is worth noting this fact when comparing the
241 latency performance of MuQSS to other schedulers. Worst case latencies being
242 higher than 7ms are far worse than average latencies not being in the
243 microsecond range.
244
245 interactive:
246
247 /proc/sys/kernel/interactive
248
249 The value is a simple boolean of 1 for on and 0 for off and is set to on by
250 default. Disabling this will disable the near-determinism of MuQSS when
251 selecting the next task by not examining all CPUs for the earliest deadline
252 task, or which CPU to wake to, instead prioritising CPU balancing for improved
253 throughput. Latency will still be bound by rr_interval, but on a per-CPU basis
254 instead of across the whole system.
255
256 Isochronous scheduling:
257
258 Isochronous scheduling is a unique scheduling policy designed to provide
259 near-real-time performance to unprivileged (ie non-root) users without the
260 ability to starve the machine indefinitely. Isochronous tasks (which means
261 "same time") are set using, for example, the schedtool application like so:
262
263 schedtool -I -e amarok
264
265 This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works
266 is that it has a priority level between true realtime tasks and SCHED_NORMAL
267 which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie,
268 if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval
269 rate). However if ISO tasks run for more than a tunable finite amount of time,
270 they are then demoted back to SCHED_NORMAL scheduling. This finite amount of
271 time is the percentage of CPU available per CPU, configurable as a percentage in
272 the following "resource handling" tunable (as opposed to a scheduler tunable):
273
274 iso_cpu:
275
276 /proc/sys/kernel/iso_cpu
277
278 and is set to 70% by default. It is calculated over a rolling 5 second average
279 Because it is the total CPU available, it means that on a multi CPU machine, it
280 is possible to have an ISO task running as realtime scheduling indefinitely on
281 just one CPU, as the other CPUs will be available. Setting this to 100 is the
282 equivalent of giving all users SCHED_RR access and setting it to 0 removes the
283 ability to run any pseudo-realtime tasks.
284
285 A feature of MuQSS is that it detects when an application tries to obtain a
286 realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the
287 appropriate privileges to use those policies. When it detects this, it will
288 give the task SCHED_ISO policy instead. Thus it is transparent to the user.
289
290
291 Idleprio scheduling:
292
293 Idleprio scheduling is a scheduling policy designed to give out CPU to a task
294 _only_ when the CPU would be otherwise idle. The idea behind this is to allow
295 ultra low priority tasks to be run in the background that have virtually no
296 effect on the foreground tasks. This is ideally suited to distributed computing
297 clients (like setiathome, folding, mprime etc) but can also be used to start a
298 video encode or so on without any slowdown of other tasks. To avoid this policy
299 from grabbing shared resources and holding them indefinitely, if it detects a
300 state where the task is waiting on I/O, the machine is about to suspend to ram
301 and so on, it will transiently schedule them as SCHED_NORMAL. Once a task has
302 been scheduled as IDLEPRIO, it cannot be put back to SCHED_NORMAL without
303 superuser privileges since it is effectively a lower scheduling policy. Tasks
304 can be set to start as SCHED_IDLEPRIO with the schedtool command like so:
305
306 schedtool -D -e ./mprime
307
308 Subtick accounting:
309
310 It is surprisingly difficult to get accurate CPU accounting, and in many cases,
311 the accounting is done by simply determining what is happening at the precise
312 moment a timer tick fires off. This becomes increasingly inaccurate as the timer
313 tick frequency (HZ) is lowered. It is possible to create an application which
314 uses almost 100% CPU, yet by being descheduled at the right time, records zero
315 CPU usage. While the main problem with this is that there are possible security
316 implications, it is also difficult to determine how much CPU a task really does
317 use. Mux uses sub-tick accounting from the TSC clock to determine real CPU
318 usage. Thus, the amount of CPU reported as being used by MuQSS will more
319 accurately represent how much CPU the task itself is using (as is shown for
320 example by the 'time' application), so the reported values may be quite
321 different to other schedulers. When comparing throughput of MuQSS to other
322 designs, it is important to compare the actual completed work in terms of total
323 wall clock time taken and total work done, rather than the reported "cpu usage".
324
325 Symmetric MultiThreading (SMT) aware nice:
326
327 SMT, a.k.a. hyperthreading, is a very common feature on modern CPUs. While the
328 logical CPU count rises by adding thread units to each CPU core, allowing more
329 than one task to be run simultaneously on the same core, the disadvantage of it
330 is that the CPU power is shared between the tasks, not summating to the power
331 of two CPUs. The practical upshot of this is that two tasks running on
332 separate threads of the same core run significantly slower than if they had one
333 core each to run on. While smart CPU selection allows each task to have a core
334 to itself whenever available (as is done on MuQSS), it cannot offset the
335 slowdown that occurs when the cores are all loaded and only a thread is left.
336 Most of the time this is harmless as the CPU is effectively overloaded at this
337 point and the extra thread is of benefit. However when running a niced task in
338 the presence of an un-niced task (say nice 19 v nice 0), the nice task gets
339 precisely the same amount of CPU power as the unniced one. MuQSS has an
340 optional configuration feature known as SMT-NICE which selectively idles the
341 secondary niced thread for a period proportional to the nice difference,
342 allowing CPU distribution according to nice level to be maintained, at the
343 expense of a small amount of extra overhead. If this is configured in on a
344 machine without SMT threads, the overhead is minimal.
345
346
347 Con Kolivas <kernel@kolivas.org> Sat, 29th October 2016
39 39 - hung_task_timeout_secs
40 40 - hung_task_warnings
41 41 - kexec_load_disabled
42 - iso_cpu
42 43 - kptr_restrict
43 44 - l2cr [ PPC only ]
44 45 - modprobe ==> Documentation/debugging-modules.txt
73 73 - randomize_va_space
74 74 - real-root-dev ==> Documentation/admin-guide/initrd.rst
75 75 - reboot-cmd [ SPARC only ]
76 - rr_interval
76 77 - rtsig-max
77 78 - rtsig-nr
78 79 - sem
94 94 - unknown_nmi_panic
95 95 - watchdog
96 96 - watchdog_thresh
97 - yield_type
97 98 - version
98 99
99 100 ==============================================================
397 397
398 398 ==============================================================
399 399
400 iso_cpu: (MuQSS CPU scheduler only).
401
402 This sets the percentage cpu that the unprivileged SCHED_ISO tasks can
403 run effectively at realtime priority, averaged over a rolling five
404 seconds over the -whole- system, meaning all cpus.
405
406 Set to 70 (percent) by default.
407
408 ==============================================================
409
400 410 l2cr: (PPC only)
401 411
402 412 This flag controls the L2 cache of G3 processor boards. If
823 823
824 824 ==============================================================
825 825
826 rr_interval: (MuQSS CPU scheduler only)
827
828 This is the smallest duration that any cpu process scheduling unit
829 will run for. Increasing this value can increase throughput of cpu
830 bound tasks substantially but at the expense of increased latencies
831 overall. Conversely decreasing it will decrease average and maximum
832 latencies but at the expense of throughput. This value is in
833 milliseconds and the default value chosen depends on the number of
834 cpus available at scheduler initialisation with a minimum of 6.
835
836 Valid values are from 1-1000.
837
838 ==============================================================
839
826 840 rtsig-max & rtsig-nr:
827 841
828 842 The file rtsig-max can be used to tune the maximum number
1073 1073
1074 1074 The softlockup threshold is (2 * watchdog_thresh). Setting this
1075 1075 tunable to zero will disable lockup detection altogether.
1076
1077 ==============================================================
1078
1079 yield_type: (MuQSS CPU scheduler only)
1080
1081 This determines what type of yield calls to sched_yield will perform.
1082
1083 0: No yield.
1084 1: Yield only to better priority/deadline tasks. (default)
1085 2: Expire timeslice and recalculate deadline.
1076 1086
1077 1087 ==============================================================
1 tp_smapi version 0.40
2 IBM ThinkPad hardware functions driver
3
4 Author: Shem Multinymous <multinymous@gmail.com>
5 Project: http://sourceforge.net/projects/tpctl
6 Wiki: http://thinkwiki.org/wiki/tp_smapi
7 List: linux-thinkpad@linux-thinkpad.org
8 (http://mailman.linux-thinkpad.org/mailman/listinfo/linux-thinkpad)
9
10 Description
11 -----------
12
13 ThinkPad laptops include a proprietary interface called SMAPI BIOS
14 (System Management Application Program Interface) which provides some
15 hardware control functionality that is not accessible by other means.
16
17 This driver exposes some features of the SMAPI BIOS through a sysfs
18 interface. It is suitable for newer models, on which SMAPI is invoked
19 through IO port writes. Older models use a different SMAPI interface;
20 for those, try the "thinkpad" module from the "tpctl" package.
21
22 WARNING:
23 This driver uses undocumented features and direct hardware access.
24 It thus cannot be guaranteed to work, and may cause arbitrary damage
25 (especially on models it wasn't tested on).
26
27
28 Module parameters
29 -----------------
30
31 thinkpad_ec module:
32 force_io=1 lets thinkpad_ec load on some recent ThinkPad models
33 (e.g., T400 and T500) whose BIOS's ACPI DSDT reserves the ports we need.
34 tp_smapi module:
35 debug=1 enables verbose dmesg output.
36
37
38 Usage
39 -----
40
41 Control of battery charging thresholds (in percents of current full charge
42 capacity):
43
44 # echo 40 > /sys/devices/platform/smapi/BAT0/start_charge_thresh
45 # echo 70 > /sys/devices/platform/smapi/BAT0/stop_charge_thresh
46 # cat /sys/devices/platform/smapi/BAT0/*_charge_thresh
47
48 (This is useful since Li-Ion batteries wear out much faster at very
49 high or low charge levels. The driver will also keeps the thresholds
50 across suspend-to-disk with AC disconnected; this isn't done
51 automatically by the hardware.)
52
53 Inhibiting battery charging for 17 minutes (overrides thresholds):
54
55 # echo 17 > /sys/devices/platform/smapi/BAT0/inhibit_charge_minutes
56 # echo 0 > /sys/devices/platform/smapi/BAT0/inhibit_charge_minutes # stop
57 # cat /sys/devices/platform/smapi/BAT0/inhibit_charge_minutes
58
59 (This can be used to control which battery is charged when using an
60 Ultrabay battery.)
61
62 Forcing battery discharging even if AC power available:
63
64 # echo 1 > /sys/devices/platform/smapi/BAT0/force_discharge # start discharge
65 # echo 0 > /sys/devices/platform/smapi/BAT0/force_discharge # stop discharge
66 # cat /sys/devices/platform/smapi/BAT0/force_discharge
67
68 (When AC is connected, forced discharging will automatically stop
69 when battery is fully depleted -- this is useful for calibration.
70 Also, this attribute can be used to control which battery is discharged
71 when both a system battery and an Ultrabay battery are connected.)
72
73 Misc read-only battery status attributes (see note about HDAPS below):
74
75 /sys/devices/platform/smapi/BAT0/installed # 0 or 1
76 /sys/devices/platform/smapi/BAT0/state # idle/charging/discharging
77 /sys/devices/platform/smapi/BAT0/cycle_count # integer counter
78 /sys/devices/platform/smapi/BAT0/current_now # instantaneous current
79 /sys/devices/platform/smapi/BAT0/current_avg # last minute average
80 /sys/devices/platform/smapi/BAT0/power_now # instantaneous power
81 /sys/devices/platform/smapi/BAT0/power_avg # last minute average
82 /sys/devices/platform/smapi/BAT0/last_full_capacity # in mWh
83 /sys/devices/platform/smapi/BAT0/remaining_percent # remaining percent of energy (set by calibration)
84 /sys/devices/platform/smapi/BAT0/remaining_percent_error # error range of remaing_percent (not reset by calibration)
85 /sys/devices/platform/smapi/BAT0/remaining_running_time # in minutes, by last minute average power
86 /sys/devices/platform/smapi/BAT0/remaining_running_time_now # in minutes, by instantenous power
87 /sys/devices/platform/smapi/BAT0/remaining_charging_time # in minutes
88 /sys/devices/platform/smapi/BAT0/remaining_capacity # in mWh
89 /sys/devices/platform/smapi/BAT0/design_capacity # in mWh
90 /sys/devices/platform/smapi/BAT0/voltage # in mV
91 /sys/devices/platform/smapi/BAT0/design_voltage # in mV
92 /sys/devices/platform/smapi/BAT0/charging_max_current # max charging current
93 /sys/devices/platform/smapi/BAT0/charging_max_voltage # max charging voltage
94 /sys/devices/platform/smapi/BAT0/group{0,1,2,3}_voltage # see below
95 /sys/devices/platform/smapi/BAT0/manufacturer # string
96 /sys/devices/platform/smapi/BAT0/model # string
97 /sys/devices/platform/smapi/BAT0/barcoding # string
98 /sys/devices/platform/smapi/BAT0/chemistry # string
99 /sys/devices/platform/smapi/BAT0/serial # integer
100 /sys/devices/platform/smapi/BAT0/manufacture_date # YYYY-MM-DD
101 /sys/devices/platform/smapi/BAT0/first_use_date # YYYY-MM-DD
102 /sys/devices/platform/smapi/BAT0/temperature # in milli-Celsius
103 /sys/devices/platform/smapi/BAT0/dump # see below
104 /sys/devices/platform/smapi/ac_connected # 0 or 1
105
106 The BAT0/group{0,1,2,3}_voltage attribute refers to the separate cell groups
107 in each battery. For example, on the ThinkPad 600, X3x, T4x and R5x models,
108 the battery contains 3 cell groups in series, where each group consisting of 2
109 or 3 cells connected in parallel. The voltage of each group is given by these
110 attributes, and their sum (roughly) equals the "voltage" attribute.
111 (The effective performance of the battery is determined by the weakest group,
112 i.e., the one those voltage changes most rapidly during dis/charging.)
113
114 The "BAT0/dump" attribute gives a a hex dump of the raw status data, which
115 contains additional data now in the above (if you can figure it out). Some
116 unused values are autodetected and replaced by "--":
117
118 In all of the above, replace BAT0 with BAT1 to address the 2nd battery (e.g.
119 in the UltraBay).
120
121
122 Raw SMAPI calls:
123
124 /sys/devices/platform/smapi/smapi_request
125 This performs raw SMAPI calls. It uses a bad interface that cannot handle
126 multiple simultaneous access. Don't touch it, it's for development only.
127 If you did touch it, you would so something like
128 # echo '211a 100 0 0' > /sys/devices/platform/smapi/smapi_request
129 # cat /sys/devices/platform/smapi/smapi_request
130 and notice that in the output "211a 34b b2 0 0 0 'OK'", the "4b" in the 2nd
131 value, converted to decimal is 75: the current charge stop threshold.
132
133
134 Model-specific status
135 ---------------------
136
137 Works (at least partially) on the following ThinkPad model:
138 * A30
139 * G41
140 * R40, R50p, R51, R52
141 * T23, T40, T40p, T41, T41p, T42, T42p, T43, T43p, T60
142 * X24, X31, X32, X40, X41, X60
143 * Z60t, Z61m
144
145 Not all functions are available on all models; for detailed status, see:
146 http://thinkwiki.org/wiki/tp_smapi
147
148 Please report success/failure by e-mail or on the Wiki.
149 If you get a "not implemented" or "not supported" message, your laptop
150 probably just can't do that (at least not via the SMAPI BIOS).
151 For negative reports, follow the bug reporting guidelines below.
152 If you send me the necessary technical data (i.e., SMAPI function
153 interfaces), I will support additional models.
154
155
156 Additional HDAPS features
157 -------------------------
158
159 The modified hdaps driver has several improvements on the one in mainline
160 (beyond resolving the conflict with thinkpad_ec and tp_smapi):
161
162 - Fixes reliability and improves support for recent ThinkPad models
163 (especially *60 and newer). Unlike the mainline driver, the modified hdaps
164 correctly follows the Embedded Controller communication protocol.
165
166 - Extends the "invert" parameter to cover all possible axis orientations.
167 The possible values are as follows.
168 Let X,Y denote the hardware readouts.
169 Let R denote the laptop's roll (tilt left/right).
170 Let P denote the laptop's pitch (tilt forward/backward).
171 invert=0: R= X P= Y (same as mainline)
172 invert=1: R=-X P=-Y (same as mainline)
173 invert=2: R=-X P= Y (new)
174 invert=3: R= X P=-Y (new)
175 invert=4: R= Y P= X (new)
176 invert=5: R=-Y P=-X (new)
177 invert=6: R=-Y P= X (new)
178 invert=7: R= Y P=-X (new)
179 It's probably easiest to just try all 8 possibilities and see which yields
180 correct results (e.g., in the hdaps-gl visualisation).
181
182 - Adds a whitelist which automatically sets the correct axis orientation for
183 some models. If the value for your model is wrong or missing, you can override
184 it using the "invert" parameter. Please also update the tables at
185 http://www.thinkwiki.org/wiki/tp_smapi and
186 http://www.thinkwiki.org/wiki/List_of_DMI_IDs
187 and submit a patch for the whitelist in hdaps.c.
188
189 - Provides new attributes:
190 /sys/devices/platform/hdaps/sampling_rate:
191 This determines the frequency at which the host queries the embedded
192 controller for accelerometer data (and informs the hdaps input devices).
193 Default=50.
194 /sys/devices/platform/hdaps/oversampling_ratio:
195 When set to X, the embedded controller is told to do physical accelerometer
196 measurements at a rate that is X times higher than the rate at which
197 the driver reads those measurements (i.e., X*sampling_rate). This
198 makes the readouts from the embedded controller more fresh, and is also
199 useful for the running average filter (see next). Default=5
200 /sys/devices/platform/hdaps/running_avg_filter_order:
201 When set to X, reported readouts will be the average of the last X physical
202 accelerometer measurements. Current firmware allows 1<=X<=8. Setting to a
203 high value decreases readout fluctuations. The averaging is handled by the
204 embedded controller, so no CPU resources are used. Higher values make the
205 readouts smoother, since it averages out both sensor noise (good) and abrupt
206 changes (bad). Default=2.
207
208 - Provides a second input device, which publishes the raw accelerometer
209 measurements (without the fuzzing needed for joystick emulation). This input
210 device can be matched by a udev rule such as the following (all on one line):
211 KERNEL=="event[0-9]*", ATTRS{phys}=="hdaps/input1",
212 ATTRS{modalias}=="input:b0019v1014p5054e4801-*",
213 SYMLINK+="input/hdaps/accelerometer-event
214
215 A new version of the hdapsd userspace daemon, which uses the input device
216 interface instead of polling sysfs, is available seprately. Using this reduces
217 the total interrupts per second generated by hdaps+hdapsd (on tickless kernels)
218 to 50, down from a value that fluctuates between 50 and 100. Set the
219 sampling_rate sysfs attribute to a lower value to further reduce interrupts,
220 at the expense of response latency.
221
222 Licensing note: all my changes to the HDAPS driver are licensed under the
223 GPL version 2 or, at your option and to the extent allowed by derivation from
224 prior works, any later version. My version of hdaps is derived work from the
225 mainline version, which at the time of writing is available only under
226 GPL version 2.
227
228 Bug reporting
229 -------------
230
231 Mail <multinymous@gmail.com>. Please include:
232 * Details about your model,
233 * Relevant "dmesg" output. Make sure thinkpad_ec and tp_smapi are loaded with
234 the "debug=1" parameter (e.g., use "make load HDAPS=1 DEBUG=1").
235 * Output of "dmidecode | grep -C5 Product"
236 * Does the failed functionality works under Windows?
237
238
239 More about SMAPI
240 ----------------
241
242 For hints about what may be possible via the SMAPI BIOS and how, see:
243
244 * IBM Technical Reference Manual for the ThinkPad 770
245 (http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=PFAN-3TUQQD)
246 * Exported symbols in PWRMGRIF.DLL or TPPWRW32.DLL (e.g., use "objdump -x").
247 * drivers/char/mwave/smapi.c in the Linux kernel tree.*
248 * The "thinkpad" SMAPI module (http://tpctl.sourceforge.net).
249 * The SMAPI_* constants in tp_smapi.c.
250
251 Note that in the above Technical Reference and in the "thinkpad" module,
252 SMAPI is invoked through a function call to some physical address. However,
253 the interface used by tp_smapi and the above mwave drive, and apparently
254 required by newer ThinkPad, is different: you set the parameters up in the
255 CPU's registers and write to ports 0xB2 (the APM control port) and 0x4F; this
256 triggers an SMI (System Management Interrupt), causing the CPU to enter
257 SMM (System Management Mode) and run the BIOS firmware; the results are
258 returned in the CPU's registers. It is not clear what is the relation between
259 the two variants of SMAPI, though the assignment of error codes seems to be
260 similar.
261
262 In addition, the embedded controller on ThinkPad laptops has a non-standard
263 interface at IO ports 0x1600-0x161F (mapped to LCP channel 3 of the H8S chip).
264 The interface provides various system management services (currently known:
265 battery information and accelerometer readouts). For more information see the
266 thinkpad_ec module and the H8S hardware documentation:
267 http://documentation.renesas.com/eng/products/mpumcu/rej09b0300_2140bhm.pdf
1 1 VERSION = 4
2 2 PATCHLEVEL = 12
3 SUBLEVEL = 11
4 EXTRAVERSION =
3 SUBLEVEL = 12
4 EXTRAVERSION = -backbone
5 5 NAME = Fearless Coyote
6 6
7 7 # *DOCUMENTATION*
639 639 KBUILD_CFLAGS += $(call cc-option,-Oz,-Os)
640 640 KBUILD_CFLAGS += $(call cc-disable-warning,maybe-uninitialized,)
641 641 else
642 ifdef CONFIG_CC_OPTIMIZE_HARDER
643 KBUILD_CFLAGS += -O3 $(call cc-disable-warning,maybe-uninitialized,)
644 else
642 645 ifdef CONFIG_PROFILE_ALL_BRANCHES
643 646 KBUILD_CFLAGS += -O2 $(call cc-disable-warning,maybe-uninitialized,)
644 647 else
645 648 KBUILD_CFLAGS += -O2
649 endif
646 650 endif
647 651 endif
648 652
65 65 static struct timer_list spuloadavg_timer;
66 66
67 67 /*
68 * Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
69 */
70 #define NORMAL_PRIO 120
71
72 /*
73 68 * Frequency of the spu scheduler tick. By default we do one SPU scheduler
74 69 * tick for every 10 CPU scheduler ticks.
75 70 */
502 502 * In the case that a guest uses storage keys
503 503 * faults should no longer be backed by zero pages
504 504 */
505 #define mm_forbids_zeropage mm_use_skey
505 #define mm_forbids_zeropage mm_has_pgste
506 506 static inline int mm_use_skey(struct mm_struct *mm)
507 507 {
508 508 #ifdef CONFIG_PGSTE
2118 2118 }
2119 2119
2120 2120 /*
2121 * Remove all empty zero pages from the mapping for lazy refaulting
2122 * - This must be called after mm->context.has_pgste is set, to avoid
2123 * future creation of zero pages
2124 * - This must be called after THP was enabled
2125 */
2126 static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
2127 unsigned long end, struct mm_walk *walk)
2128 {
2129 unsigned long addr;
2130
2131 for (addr = start; addr != end; addr += PAGE_SIZE) {
2132 pte_t *ptep;
2133 spinlock_t *ptl;
2134
2135 ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
2136 if (is_zero_pfn(pte_pfn(*ptep)))
2137 ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID));
2138 pte_unmap_unlock(ptep, ptl);
2139 }
2140 return 0;
2141 }
2142
2143 static inline void zap_zero_pages(struct mm_struct *mm)
2144 {
2145 struct mm_walk walk = { .pmd_entry = __zap_zero_pages };
2146
2147 walk.mm = mm;
2148 walk_page_range(0, TASK_SIZE, &walk);
2149 }
2150
2151 /*
2121 2152 * switch on pgstes for its userspace process (for kvm)
2122 2153 */
2123 2154 int s390_enable_sie(void)
2165 2165 mm->context.has_pgste = 1;
2166 2166 /* split thp mappings and disable thp for future mappings */
2167 2167 thp_split_mm(mm);
2168 zap_zero_pages(mm);
2168 2169 up_write(&mm->mmap_sem);
2169 2170 return 0;
2170 2171 }
2178 2178 static int __s390_enable_skey(pte_t *pte, unsigned long addr,
2179 2179 unsigned long next, struct mm_walk *walk)
2180 2180 {
2181 /*
2182 * Remove all zero page mappings,
2183 * after establishing a policy to forbid zero page mappings
2184 * following faults for that page will get fresh anonymous pages
2185 */
2186 if (is_zero_pfn(pte_pfn(*pte)))
2187 ptep_xchg_direct(walk->mm, addr, pte, __pte(_PAGE_INVALID));
2188 2181 /* Clear storage key */
2189 2182 ptep_zap_key(walk->mm, addr, pte);
2190 2183 return 0;
119 119 return addr;
120 120
121 121 check_asce_limit:
122 if (addr + len > current->mm->context.asce_limit) {
122 if (addr + len > current->mm->context.asce_limit &&
123 addr + len <= TASK_SIZE) {
123 124 rc = crst_table_upgrade(mm);
124 125 if (rc)
125 126 return (unsigned long) rc;
184 184 }
185 185
186 186 check_asce_limit:
187 if (addr + len > current->mm->context.asce_limit) {
187 if (addr + len > current->mm->context.asce_limit &&
188 addr + len <= TASK_SIZE) {
188 189 rc = crst_table_upgrade(mm);
189 190 if (rc)
190 191 return (unsigned long) rc;
937 937 depends on SMP
938 938 ---help---
939 939 SMT scheduler support improves the CPU scheduler's decision making
940 when dealing with Intel Pentium 4 chips with HyperThreading at a
940 when dealing with Intel P4/Core 2 chips with HyperThreading at a
941 941 cost of slightly increased overhead in some places. If unsure say
942 942 N here.
943
944 config SMT_NICE
945 bool "SMT (Hyperthreading) aware nice priority and policy support"
946 depends on SCHED_MUQSS && SCHED_SMT
947 default y
948 ---help---
949 Enabling Hyperthreading on Intel CPUs decreases the effectiveness
950 of the use of 'nice' levels and different scheduling policies
951 (e.g. realtime) due to sharing of CPU power between hyperthreads.
952 SMT nice support makes each logical CPU aware of what is running on
953 its hyperthread siblings, maintaining appropriate distribution of
954 CPU according to nice levels and scheduling policies at the expense
955 of slightly increased overhead.
956
957 If unsure say Y here.
958
943 959
944 960 config SCHED_MC
945 961 def_bool y
115 115 config MPENTIUM4
116 116 bool "Pentium-4/Celeron(P4-based)/Pentium-4 M/older Xeon"
117 117 depends on X86_32
118 select X86_P6_NOP
118 119 ---help---
119 120 Select this for Intel Pentium 4 chips. This includes the
120 121 Pentium 4, Pentium D, P4-based Celeron and Xeon, and
148 148 -Paxville
149 149 -Dempsey
150 150
151
152 151 config MK6
153 bool "K6/K6-II/K6-III"
152 bool "AMD K6/K6-II/K6-III"
154 153 depends on X86_32
155 154 ---help---
156 155 Select this for an AMD K6-family processor. Enables use of
157 157 flags to GCC.
158 158
159 159 config MK7
160 bool "Athlon/Duron/K7"
160 bool "AMD Athlon/Duron/K7"
161 161 depends on X86_32
162 162 ---help---
163 163 Select this for an AMD Athlon K7-family processor. Enables use of
165 165 flags to GCC.
166 166
167 167 config MK8
168 bool "Opteron/Athlon64/Hammer/K8"
168 bool "AMD Opteron/Athlon64/Hammer/K8"
169 169 ---help---
170 170 Select this for an AMD Opteron or Athlon64 Hammer-family processor.
171 171 Enables use of some extended instructions, and passes appropriate
172 172 optimization flags to GCC.
173 173
174 config MK8SSE3
175 bool "AMD Opteron/Athlon64/Hammer/K8 with SSE3"
176 ---help---
177 Select this for improved AMD Opteron or Athlon64 Hammer-family processors.
178 Enables use of some extended instructions, and passes appropriate
179 optimization flags to GCC.
180
181 config MK10
182 bool "AMD 61xx/7x50/PhenomX3/X4/II/K10"
183 ---help---
184 Select this for an AMD 61xx Eight-Core Magny-Cours, Athlon X2 7x50,
185 Phenom X3/X4/II, Athlon II X2/X3/X4, or Turion II-family processor.
186 Enables use of some extended instructions, and passes appropriate
187 optimization flags to GCC.
188
189 config MBARCELONA
190 bool "AMD Barcelona"
191 ---help---
192 Select this for AMD Family 10h Barcelona processors.
193
194 Enables -march=barcelona
195
196 config MBOBCAT
197 bool "AMD Bobcat"
198 ---help---
199 Select this for AMD Family 14h Bobcat processors.
200
201 Enables -march=btver1
202
203 config MJAGUAR
204 bool "AMD Jaguar"
205 ---help---
206 Select this for AMD Family 16h Jaguar processors.
207
208 Enables -march=btver2
209
210 config MBULLDOZER
211 bool "AMD Bulldozer"
212 ---help---
213 Select this for AMD Family 15h Bulldozer processors.
214
215 Enables -march=bdver1
216
217 config MPILEDRIVER
218 bool "AMD Piledriver"
219 ---help---
220 Select this for AMD Family 15h Piledriver processors.
221
222 Enables -march=bdver2
223
224 config MSTEAMROLLER
225 bool "AMD Steamroller"
226 ---help---
227 Select this for AMD Family 15h Steamroller processors.
228
229 Enables -march=bdver3
230
231 config MEXCAVATOR
232 bool "AMD Excavator"
233 ---help---
234 Select this for AMD Family 15h Excavator processors.
235
236 Enables -march=bdver4
237
238 config MZEN
239 bool "AMD Zen"
240 ---help---
241 Select this for AMD Family 17h Zen processors.
242
243 Enables -march=znver1
244
174 245 config MCRUSOE
175 246 bool "Crusoe"
176 247 depends on X86_32
323 323
324 324 config MPSC
325 325 bool "Intel P4 / older Netburst based Xeon"
326 select X86_P6_NOP
326 327 depends on X86_64
327 328 ---help---
328 329 Optimize for Intel Pentium 4, Pentium D and older Nocona/Dempsey
333 333 using the cpu family field
334 334 in /proc/cpuinfo. Family 15 is an older Xeon, Family 6 a newer one.
335 335
336 config MATOM
337 bool "Intel Atom"
338 select X86_P6_NOP
339 ---help---
340
341 Select this for the Intel Atom platform. Intel Atom CPUs have an
342 in-order pipelining architecture and thus can benefit from
343 accordingly optimized code. Use a recent GCC with specific Atom
344 support in order to fully benefit from selecting this option.
345
336 346 config MCORE2
337 bool "Core 2/newer Xeon"
347 bool "Intel Core 2"
348 select X86_P6_NOP
338 349 ---help---
339 350
340 351 Select this for Intel Core 2 and newer Core 2 Xeons (Xeon 51xx and
353 353 family in /proc/cpuinfo. Newer ones have 6 and older ones 15
354 354 (not a typo)
355 355
356 config MATOM
357 bool "Intel Atom"
356 Enables -march=core2
357
358 config MNEHALEM
359 bool "Intel Nehalem"
360 select X86_P6_NOP
358 361 ---help---
359 362
360 Select this for the Intel Atom platform. Intel Atom CPUs have an
361 in-order pipelining architecture and thus can benefit from
362 accordingly optimized code. Use a recent GCC with specific Atom
363 support in order to fully benefit from selecting this option.
363 Select this for 1st Gen Core processors in the Nehalem family.
364 364
365 Enables -march=nehalem
366
367 config MWESTMERE
368 bool "Intel Westmere"
369 select X86_P6_NOP
370 ---help---
371
372 Select this for the Intel Westmere formerly Nehalem-C family.
373
374 Enables -march=westmere
375
376 config MSILVERMONT
377 bool "Intel Silvermont"
378 select X86_P6_NOP
379 ---help---
380
381 Select this for the Intel Silvermont platform.
382
383 Enables -march=silvermont
384
385 config MSANDYBRIDGE
386 bool "Intel Sandy Bridge"
387 select X86_P6_NOP
388 ---help---
389
390 Select this for 2nd Gen Core processors in the Sandy Bridge family.
391
392 Enables -march=sandybridge
393
394 config MIVYBRIDGE
395 bool "Intel Ivy Bridge"
396 select X86_P6_NOP
397 ---help---
398
399 Select this for 3rd Gen Core processors in the Ivy Bridge family.
400
401 Enables -march=ivybridge
402
403 config MHASWELL
404 bool "Intel Haswell"
405 select X86_P6_NOP
406 ---help---
407
408 Select this for 4th Gen Core processors in the Haswell family.
409
410 Enables -march=haswell
411
412 config MBROADWELL
413 bool "Intel Broadwell"
414 select X86_P6_NOP
415 ---help---
416
417 Select this for 5th Gen Core processors in the Broadwell family.
418
419 Enables -march=broadwell
420
421 config MSKYLAKE
422 bool "Intel Skylake"
423 select X86_P6_NOP
424 ---help---
425
426 Select this for 6th Gen Core processors in the Skylake family.
427
428 Enables -march=skylake
429
365 430 config GENERIC_CPU
366 431 bool "Generic-x86-64"
367 432 depends on X86_64
434 434 Generic x86-64 CPU.
435 435 Run equally well on all x86-64 CPUs.
436 436
437 config MNATIVE
438 bool "Native optimizations autodetected by GCC"
439 ---help---
440
441 GCC 4.2 and above support -march=native, which automatically detects
442 the optimum settings to use based on your processor. -march=native
443 also detects and applies additional settings beyond -march specific
444 to your CPU, (eg. -msse4). Unless you have a specific reason not to
445 (e.g. distcc cross-compiling), you should probably be using
446 -march=native rather than anything listed below.
447
448 Enables -march=native
449
437 450 endchoice
438 451
439 452 config X86_GENERIC
471 471 config X86_L1_CACHE_SHIFT
472 472 int
473 473 default "7" if MPENTIUM4 || MPSC
474 default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU
474 default "6" if MK7 || MK8 || MK8SSE3 || MK10 || MBARCELONA || MBOBCAT || MBULLDOZER || MPILEDRIVER || MSTEAMROLLER || MEXCAVATOR || MZEN || MJAGUAR || MPENTIUMM || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MNATIVE || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU
475 475 default "4" if MELAN || M486 || MGEODEGX1
476 476 default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX
477 477
502 502
503 503 config X86_INTEL_USERCOPY
504 504 def_bool y
505 depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK8 || MK7 || MEFFICEON || MCORE2
505 depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK8 || MK8SSE3 || MK7 || MEFFICEON || MCORE2 || MK10 || MBARCELONA || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MNATIVE
506 506
507 507 config X86_USE_PPRO_CHECKSUM
508 508 def_bool y
509 depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MATOM
509 depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MK10 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MK8SSE3 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MATOM || MNATIVE
510 510
511 511 config X86_USE_3DNOW
512 512 def_bool y
513 513 depends on (MCYRIXIII || MK7 || MGEODE_LX) && !UML
514 514
515 #
516 # P6_NOPs are a relatively minor optimization that require a family >=
517 # 6 processor, except that it is broken on certain VIA chips.
518 # Furthermore, AMD chips prefer a totally different sequence of NOPs
519 # (which work on all CPUs). In addition, it looks like Virtual PC
520 # does not understand them.
521 #
522 # As a result, disallow these if we're not compiling for X86_64 (these
523 # NOPs do work on all x86-64 capable chips); the list of processors in
524 # the right-hand clause are the cores that benefit from this optimization.
525 #
526 515 config X86_P6_NOP
527 def_bool y
528 depends on X86_64
529 depends on (MCORE2 || MPENTIUM4 || MPSC)
516 default n
517 bool "Support for P6_NOPs on Intel chips"
518 depends on (MCORE2 || MPENTIUM4 || MPSC || MATOM || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MNATIVE)
519 ---help---
520 P6_NOPs are a relatively minor optimization that require a family >=
521 6 processor, except that it is broken on certain VIA chips.
522 Furthermore, AMD chips prefer a totally different sequence of NOPs
523 (which work on all CPUs). In addition, it looks like Virtual PC
524 does not understand them.
530 525
526 As a result, disallow these if we're not compiling for X86_64 (these
527 NOPs do work on all x86-64 capable chips); the list of processors in
528 the right-hand clause are the cores that benefit from this optimization.
529
530 Say Y if you have Intel CPU newer than Pentium Pro, N otherwise.
531
531 532 config X86_TSC
532 533 def_bool y
533 depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MATOM) || X86_64
534 depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MK8SSE3 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MNATIVE || MATOM) || X86_64
534 535
535 536 config X86_CMPXCHG64
536 537 def_bool y
537 depends on X86_PAE || X86_64 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MATOM
538 depends on X86_PAE || X86_64 || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MATOM || MNATIVE
538 539
539 540 # this should be set for all -march=.. options where the compiler
540 541 # generates cmov.
541 542 config X86_CMOV
542 543 def_bool y
543 depends on (MK8 || MK7 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MATOM || MGEODE_LX)
544 depends on (MK8 || MK8SSE3 || MK10 || MBARCELONA || MBOBCAT || MBULLDOZER || MPILEDRIVER || MSTEAMROLLER || MEXCAVATOR || MZEN || MJAGUAR || MK7 || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MNATIVE || MATOM || MGEODE_LX)
544 545
545 546 config X86_MINIMUM_CPU_FAMILY
546 547 int
104 104 KBUILD_CFLAGS += $(call cc-option,-mskip-rax-setup)
105 105
106 106 # FIXME - should be integrated in Makefile.cpu (Makefile_32.cpu)
107 cflags-$(CONFIG_MNATIVE) += $(call cc-option,-march=native)
107 108 cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8)
109 cflags-$(CONFIG_MK8SSE3) += $(call cc-option,-march=k8-sse3,-mtune=k8)
110 cflags-$(CONFIG_MK10) += $(call cc-option,-march=amdfam10)
111 cflags-$(CONFIG_MBARCELONA) += $(call cc-option,-march=barcelona)
112 cflags-$(CONFIG_MBOBCAT) += $(call cc-option,-march=btver1)
113 cflags-$(CONFIG_MJAGUAR) += $(call cc-option,-march=btver2)
114 cflags-$(CONFIG_MBULLDOZER) += $(call cc-option,-march=bdver1)
115 cflags-$(CONFIG_MPILEDRIVER) += $(call cc-option,-march=bdver2)
116 cflags-$(CONFIG_MSTEAMROLLER) += $(call cc-option,-march=bdver3)
117 cflags-$(CONFIG_MEXCAVATOR) += $(call cc-option,-march=bdver4)
118 cflags-$(CONFIG_MZEN) += $(call cc-option,-march=znver1)
108 119 cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona)
109 120
110 121 cflags-$(CONFIG_MCORE2) += \
111 $(call cc-option,-march=core2,$(call cc-option,-mtune=generic))
112 cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \
113 $(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
122 $(call cc-option,-march=core2,$(call cc-option,-mtune=core2))
123 cflags-$(CONFIG_MNEHALEM) += \
124 $(call cc-option,-march=nehalem,$(call cc-option,-mtune=nehalem))
125 cflags-$(CONFIG_MWESTMERE) += \
126 $(call cc-option,-march=westmere,$(call cc-option,-mtune=westmere))
127 cflags-$(CONFIG_MSILVERMONT) += \
128 $(call cc-option,-march=silvermont,$(call cc-option,-mtune=silvermont))
129 cflags-$(CONFIG_MSANDYBRIDGE) += \
130 $(call cc-option,-march=sandybridge,$(call cc-option,-mtune=sandybridge))
131 cflags-$(CONFIG_MIVYBRIDGE) += \
132 $(call cc-option,-march=ivybridge,$(call cc-option,-mtune=ivybridge))
133 cflags-$(CONFIG_MHASWELL) += \
134 $(call cc-option,-march=haswell,$(call cc-option,-mtune=haswell))
135 cflags-$(CONFIG_MBROADWELL) += \
136 $(call cc-option,-march=broadwell,$(call cc-option,-mtune=broadwell))
137 cflags-$(CONFIG_MSKYLAKE) += \
138 $(call cc-option,-march=skylake,$(call cc-option,-mtune=skylake))
139 cflags-$(CONFIG_MATOM) += $(call cc-option,-march=bonnell) \
140 $(call cc-option,-mtune=bonnell,$(call cc-option,-mtune=generic))
114 141 cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic)
115 142 KBUILD_CFLAGS += $(cflags-y)
116 143
23 23 # Please note, that patches that add -march=athlon-xp and friends are pointless.
24 24 # They make zero difference whatsosever to performance at this time.
25 25 cflags-$(CONFIG_MK7) += -march=athlon
26 cflags-$(CONFIG_MNATIVE) += $(call cc-option,-march=native)
26 27 cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8,-march=athlon)
28 cflags-$(CONFIG_MK8SSE3) += $(call cc-option,-march=k8-sse3,-march=athlon)
29 cflags-$(CONFIG_MK10) += $(call cc-option,-march=amdfam10,-march=athlon)
30 cflags-$(CONFIG_MBARCELONA) += $(call cc-option,-march=barcelona,-march=athlon)
31 cflags-$(CONFIG_MBOBCAT) += $(call cc-option,-march=btver1,-march=athlon)
32 cflags-$(CONFIG_MJAGUAR) += $(call cc-option,-march=btver2,-march=athlon)
33 cflags-$(CONFIG_MBULLDOZER) += $(call cc-option,-march=bdver1,-march=athlon)
34 cflags-$(CONFIG_MPILEDRIVER) += $(call cc-option,-march=bdver2,-march=athlon)
35 cflags-$(CONFIG_MSTEAMROLLER) += $(call cc-option,-march=bdver3,-march=athlon)
36 cflags-$(CONFIG_MEXCAVATOR) += $(call cc-option,-march=bdver4,-march=athlon)
37 cflags-$(CONFIG_MZEN) += $(call cc-option,-march=znver1,-march=athlon)
27 38 cflags-$(CONFIG_MCRUSOE) += -march=i686 $(align)-functions=0 $(align)-jumps=0 $(align)-loops=0
28 39 cflags-$(CONFIG_MEFFICEON) += -march=i686 $(call tune,pentium3) $(align)-functions=0 $(align)-jumps=0 $(align)-loops=0
29 40 cflags-$(CONFIG_MWINCHIPC6) += $(call cc-option,-march=winchip-c6,-march=i586)
43 43 cflags-$(CONFIG_MVIAC3_2) += $(call cc-option,-march=c3-2,-march=i686)
44 44 cflags-$(CONFIG_MVIAC7) += -march=i686
45 45 cflags-$(CONFIG_MCORE2) += -march=i686 $(call tune,core2)
46 cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom,$(call cc-option,-march=core2,-march=i686)) \
47 $(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
46 cflags-$(CONFIG_MNEHALEM) += -march=i686 $(call tune,nehalem)
47 cflags-$(CONFIG_MWESTMERE) += -march=i686 $(call tune,westmere)
48 cflags-$(CONFIG_MSILVERMONT) += -march=i686 $(call tune,silvermont)
49 cflags-$(CONFIG_MSANDYBRIDGE) += -march=i686 $(call tune,sandybridge)
50 cflags-$(CONFIG_MIVYBRIDGE) += -march=i686 $(call tune,ivybridge)
51 cflags-$(CONFIG_MHASWELL) += -march=i686 $(call tune,haswell)
52 cflags-$(CONFIG_MBROADWELL) += -march=i686 $(call tune,broadwell)
53 cflags-$(CONFIG_MSKYLAKE) += -march=i686 $(call tune,skylake)
54 cflags-$(CONFIG_MATOM) += $(call cc-option,-march=bonnell,$(call cc-option,-march=core2,-march=i686)) \
55 $(call cc-option,-mtune=bonnell,$(call cc-option,-mtune=generic))
48 56
49 57 # AMD Elan support
50 58 cflags-$(CONFIG_MELAN) += -march=i486
15 15 #define MODULE_PROC_FAMILY "586MMX "
16 16 #elif defined CONFIG_MCORE2
17 17 #define MODULE_PROC_FAMILY "CORE2 "
18 #elif defined CONFIG_MNATIVE
19 #define MODULE_PROC_FAMILY "NATIVE "
20 #elif defined CONFIG_MNEHALEM
21 #define MODULE_PROC_FAMILY "NEHALEM "
22 #elif defined CONFIG_MWESTMERE
23 #define MODULE_PROC_FAMILY "WESTMERE "
24 #elif defined CONFIG_MSILVERMONT
25 #define MODULE_PROC_FAMILY "SILVERMONT "
26 #elif defined CONFIG_MSANDYBRIDGE
27 #define MODULE_PROC_FAMILY "SANDYBRIDGE "
28 #elif defined CONFIG_MIVYBRIDGE
29 #define MODULE_PROC_FAMILY "IVYBRIDGE "
30 #elif defined CONFIG_MHASWELL
31 #define MODULE_PROC_FAMILY "HASWELL "
32 #elif defined CONFIG_MBROADWELL
33 #define MODULE_PROC_FAMILY "BROADWELL "
34 #elif defined CONFIG_MSKYLAKE
35 #define MODULE_PROC_FAMILY "SKYLAKE "
18 36 #elif defined CONFIG_MATOM
19 37 #define MODULE_PROC_FAMILY "ATOM "
20 38 #elif defined CONFIG_M686
51 51 #define MODULE_PROC_FAMILY "K7 "
52 52 #elif defined CONFIG_MK8
53 53 #define MODULE_PROC_FAMILY "K8 "
54 #elif defined CONFIG_MK8SSE3
55 #define MODULE_PROC_FAMILY "K8SSE3 "
56 #elif defined CONFIG_MK10
57 #define MODULE_PROC_FAMILY "K10 "
58 #elif defined CONFIG_MBARCELONA
59 #define MODULE_PROC_FAMILY "BARCELONA "
60 #elif defined CONFIG_MBOBCAT
61 #define MODULE_PROC_FAMILY "BOBCAT "
62 #elif defined CONFIG_MBULLDOZER
63 #define MODULE_PROC_FAMILY "BULLDOZER "
64 #elif defined CONFIG_MPILEDRIVER
65 #define MODULE_PROC_FAMILY "PILEDRIVER "
66 #elif defined CONFIG_MSTEAMROLLER
67 #define MODULE_PROC_FAMILY "STEAMROLLER "
68 #elif defined CONFIG_MJAGUAR
69 #define MODULE_PROC_FAMILY "JAGUAR "
70 #elif defined CONFIG_MEXCAVATOR
71 #define MODULE_PROC_FAMILY "EXCAVATOR "
72 #elif defined CONFIG_MZEN
73 #define MODULE_PROC_FAMILY "ZEN "
54 74 #elif defined CONFIG_MELAN
55 75 #define MODULE_PROC_FAMILY "ELAN "
56 76 #elif defined CONFIG_MCRUSOE
39 39 ---help---
40 40 Enable group IO scheduling in CFQ.
41 41
42 config IOSCHED_BFQ_SQ
43 tristate "BFQ-SQ I/O scheduler"
44 default n
45 ---help---
46 The BFQ-SQ I/O scheduler (for legacy blk: SQ stands for
47 SingleQueue) distributes bandwidth among all processes
48 according to their weights, regardless of the device
49 parameters and with any workload. It also guarantees a low
50 latency to interactive and soft real-time applications.
51 Details in Documentation/block/bfq-iosched.txt
52
53 config BFQ_SQ_GROUP_IOSCHED
54 bool "BFQ-SQ hierarchical scheduling support"
55 depends on IOSCHED_BFQ_SQ && BLK_CGROUP
56 default n
57 ---help---
58
59 Enable hierarchical scheduling in BFQ-SQ, using the blkio
60 (cgroups-v1) or io (cgroups-v2) controller.
61
42 62 choice
43 63
44 64 prompt "Default I/O scheduler"
73 73 config DEFAULT_CFQ
74 74 bool "CFQ" if IOSCHED_CFQ=y
75 75
76 config DEFAULT_BFQ_SQ
77 bool "BFQ-SQ" if IOSCHED_BFQ_SQ=y
78 help
79 Selects BFQ-SQ as the default I/O scheduler which will be
80 used by default for all block devices.
81 The BFQ-SQ I/O scheduler aims at distributing the bandwidth
82 as desired, independently of the disk parameters and with
83 any workload. It also tries to guarantee low latency to
84 interactive and soft real-time applications.
85
76 86 config DEFAULT_NOOP
77 87 bool "No-op"
78 88
92 92 string
93 93 default "deadline" if DEFAULT_DEADLINE
94 94 default "cfq" if DEFAULT_CFQ
95 default "bfq-sq" if DEFAULT_BFQ_SQ
95 96 default "noop" if DEFAULT_NOOP
97
98 config MQ_IOSCHED_BFQ
99 tristate "BFQ-MQ I/O Scheduler"
100 default y
101 ---help---
102 BFQ I/O scheduler for BLK-MQ. BFQ-MQ distributes bandwidth
103 among all processes according to their weights, regardless of
104 the device parameters and with any workload. It also
105 guarantees a low latency to interactive and soft real-time
106 applications. Details in Documentation/block/bfq-iosched.txt
107
108 config MQ_BFQ_GROUP_IOSCHED
109 bool "BFQ-MQ hierarchical scheduling support"
110 depends on MQ_IOSCHED_BFQ && BLK_CGROUP
111 default n
112 ---help---
113
114 Enable hierarchical scheduling in BFQ-MQ, using the blkio
115 (cgroups-v1) or io (cgroups-v2) controller.
96 116
97 117 config MQ_IOSCHED_DEADLINE
98 118 tristate "MQ deadline I/O scheduler"
23 23 obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
24 24 bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
25 25 obj-$(CONFIG_IOSCHED_BFQ) += bfq.o
26 obj-$(CONFIG_IOSCHED_BFQ_SQ) += bfq-sq-iosched.o
27 obj-$(CONFIG_MQ_IOSCHED_BFQ) += bfq-mq-iosched.o
26 28
27 29 obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
28 30 obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
1 /*
2 * BFQ: CGROUPS support.
3 *
4 * Based on ideas and code from CFQ:
5 * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
6 *
7 * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
8 * Paolo Valente <paolo.valente@unimore.it>
9 *
10 * Copyright (C) 2015 Paolo Valente <paolo.valente@unimore.it>
11 *
12 * Copyright (C) 2016 Paolo Valente <paolo.valente@linaro.org>
13 *
14 * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
15 * file.
16 */
17
18 #ifdef BFQ_GROUP_IOSCHED_ENABLED
19
20 /* bfqg stats flags */
21 enum bfqg_stats_flags {
22 BFQG_stats_waiting = 0,
23 BFQG_stats_idling,
24 BFQG_stats_empty,
25 };
26
27 #define BFQG_FLAG_FNS(name) \
28 static void bfqg_stats_mark_##name(struct bfqg_stats *stats) \
29 { \
30 stats->flags |= (1 << BFQG_stats_##name); \
31 } \
32 static void bfqg_stats_clear_##name(struct bfqg_stats *stats) \
33 { \
34 stats->flags &= ~(1 << BFQG_stats_##name); \
35 } \
36 static int bfqg_stats_##name(struct bfqg_stats *stats) \
37 { \
38 return (stats->flags & (1 << BFQG_stats_##name)) != 0; \
39 } \
40
41 BFQG_FLAG_FNS(waiting)
42 BFQG_FLAG_FNS(idling)
43 BFQG_FLAG_FNS(empty)
44 #undef BFQG_FLAG_FNS
45
46 #ifdef BFQ_MQ
47 /* This should be called with the scheduler lock held. */
48 #else
49 /* This should be called with the queue_lock held. */
50 #endif
51 static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
52 {
53 unsigned long long now;
54
55 if (!bfqg_stats_waiting(stats))
56 return;
57
58 now = sched_clock();
59 if (time_after64(now, stats->start_group_wait_time))
60 blkg_stat_add(&stats->group_wait_time,
61 now - stats->start_group_wait_time);
62 bfqg_stats_clear_waiting(stats);
63 }
64
65 #ifdef BFQ_MQ
66 /* This should be called with the scheduler lock held. */
67 #else
68 /* This should be called with the queue_lock held. */
69 #endif
70 static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
71 struct bfq_group *curr_bfqg)
72 {
73 struct bfqg_stats *stats = &bfqg->stats;
74
75 if (bfqg_stats_waiting(stats))
76 return;
77 if (bfqg == curr_bfqg)
78 return;
79 stats->start_group_wait_time = sched_clock();
80 bfqg_stats_mark_waiting(stats);
81 }
82
83 #ifdef BFQ_MQ
84 /* This should be called with the scheduler lock held. */
85 #else
86 /* This should be called with the queue_lock held. */
87 #endif
88 static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
89 {
90 unsigned long long now;
91
92 if (!bfqg_stats_empty(stats))
93 return;
94
95 now = sched_clock();
96 if (time_after64(now, stats->start_empty_time))
97 blkg_stat_add(&stats->empty_time,
98 now - stats->start_empty_time);
99 bfqg_stats_clear_empty(stats);
100 }
101
102 static void bfqg_stats_update_dequeue(struct bfq_group *bfqg)
103 {
104 blkg_stat_add(&bfqg->stats.dequeue, 1);
105 }
106
107 static void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg)
108 {
109 struct bfqg_stats *stats = &bfqg->stats;
110
111 if (blkg_rwstat_total(&stats->queued))
112 return;
113
114 /*
115 * group is already marked empty. This can happen if bfqq got new
116 * request in parent group and moved to this group while being added
117 * to service tree. Just ignore the event and move on.
118 */
119 if (bfqg_stats_empty(stats))
120 return;
121
122 stats->start_empty_time = sched_clock();
123 bfqg_stats_mark_empty(stats);
124 }
125
126 static void bfqg_stats_update_idle_time(struct bfq_group *bfqg)
127 {
128 struct bfqg_stats *stats = &bfqg->stats;
129
130 if (bfqg_stats_idling(stats)) {
131 unsigned long long now = sched_clock();
132
133 if (time_after64(now, stats->start_idle_time))
134 blkg_stat_add(&stats->idle_time,
135 now - stats->start_idle_time);
136 bfqg_stats_clear_idling(stats);
137 }
138 }
139
140 static void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg)
141 {
142 struct bfqg_stats *stats = &bfqg->stats;
143
144 stats->start_idle_time = sched_clock();
145 bfqg_stats_mark_idling(stats);
146 }
147
148 static void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg)
149 {
150 struct bfqg_stats *stats = &bfqg->stats;
151
152 blkg_stat_add(&stats->avg_queue_size_sum,
153 blkg_rwstat_total(&stats->queued));
154 blkg_stat_add(&stats->avg_queue_size_samples, 1);
155 bfqg_stats_update_group_wait_time(stats);
156 }
157
158 static struct blkcg_policy blkcg_policy_bfq;
159
160 /*
161 * blk-cgroup policy-related handlers
162 * The following functions help in converting between blk-cgroup
163 * internal structures and BFQ-specific structures.
164 */
165
166 static struct bfq_group *pd_to_bfqg(struct blkg_policy_data *pd)
167 {
168 return pd ? container_of(pd, struct bfq_group, pd) : NULL;
169 }
170
171 static struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg)
172 {
173 return pd_to_blkg(&bfqg->pd);
174 }
175
176 static struct bfq_group *blkg_to_bfqg(struct blkcg_gq *blkg)
177 {
178 struct blkg_policy_data *pd = blkg_to_pd(blkg, &blkcg_policy_bfq);
179
180 return pd_to_bfqg(pd);
181 }
182
183 /*
184 * bfq_group handlers
185 * The following functions help in navigating the bfq_group hierarchy
186 * by allowing to find the parent of a bfq_group or the bfq_group
187 * associated to a bfq_queue.
188 */
189
190 static struct bfq_group *bfqg_parent(struct bfq_group *bfqg)
191 {
192 struct blkcg_gq *pblkg = bfqg_to_blkg(bfqg)->parent;
193
194 return pblkg ? blkg_to_bfqg(pblkg) : NULL;
195 }
196
197 static struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
198 {
199 struct bfq_entity *group_entity = bfqq->entity.parent;
200
201 return group_entity ? container_of(group_entity, struct bfq_group,
202 entity) :
203 bfqq->bfqd->root_group;
204 }
205
206 /*
207 * The following two functions handle get and put of a bfq_group by
208 * wrapping the related blk-cgroup hooks.
209 */
210
211 static void bfqg_get(struct bfq_group *bfqg)
212 {
213 #ifdef BFQ_MQ
214 bfqg->ref++;
215 #else
216 blkg_get(bfqg_to_blkg(bfqg));
217 #endif
218 }
219
220 static void bfqg_put(struct bfq_group *bfqg)
221 {
222 #ifdef BFQ_MQ
223 bfqg->ref--;
224
225 BUG_ON(bfqg->ref < 0);
226 if (bfqg->ref == 0)
227 kfree(bfqg);
228 #else
229 blkg_put(bfqg_to_blkg(bfqg));
230 #endif
231 }
232
233 #ifdef BFQ_MQ
234 static void bfqg_and_blkg_get(struct bfq_group *bfqg)
235 {
236 /* see comments in bfq_bic_update_cgroup for why refcounting bfqg */
237 bfqg_get(bfqg);
238
239 blkg_get(bfqg_to_blkg(bfqg));
240 }
241
242 static void bfqg_and_blkg_put(struct bfq_group *bfqg)
243 {
244 bfqg_put(bfqg);
245
246 blkg_put(bfqg_to_blkg(bfqg));
247 }
248 #endif
249
250 static void bfqg_stats_update_io_add(struct bfq_group *bfqg,
251 struct bfq_queue *bfqq,
252 unsigned int op)
253 {
254 blkg_rwstat_add(&bfqg->stats.queued, op, 1);
255 bfqg_stats_end_empty_time(&bfqg->stats);
256 if (!(bfqq == ((struct bfq_data *)bfqg->bfqd)->in_service_queue))
257 bfqg_stats_set_start_group_wait_time(bfqg, bfqq_group(bfqq));
258 }
259
260 static void bfqg_stats_update_io_remove(struct bfq_group *bfqg, unsigned int op)
261 {
262 blkg_rwstat_add(&bfqg->stats.queued, op, -1);
263 }
264
265 static void bfqg_stats_update_io_merged(struct bfq_group *bfqg, unsigned int op)
266 {
267 blkg_rwstat_add(&bfqg->stats.merged, op, 1);
268 }
269
270 static void bfqg_stats_update_completion(struct bfq_group *bfqg,
271 uint64_t start_time, uint64_t io_start_time,
272 unsigned int op)
273 {
274 struct bfqg_stats *stats = &bfqg->stats;
275 unsigned long long now = sched_clock();
276
277 if (time_after64(now, io_start_time))
278 blkg_rwstat_add(&stats->service_time, op,
279 now - io_start_time);
280 if (time_after64(io_start_time, start_time))
281 blkg_rwstat_add(&stats->wait_time, op,
282 io_start_time - start_time);
283 }
284
285 /* @stats = 0 */
286 static void bfqg_stats_reset(struct bfqg_stats *stats)
287 {
288 /* queued stats shouldn't be cleared */
289 blkg_rwstat_reset(&stats->merged);
290 blkg_rwstat_reset(&stats->service_time);
291 blkg_rwstat_reset(&stats->wait_time);
292 blkg_stat_reset(&stats->time);
293 blkg_stat_reset(&stats->avg_queue_size_sum);
294 blkg_stat_reset(&stats->avg_queue_size_samples);
295 blkg_stat_reset(&stats->dequeue);
296 blkg_stat_reset(&stats->group_wait_time);
297 blkg_stat_reset(&stats->idle_time);
298 blkg_stat_reset(&stats->empty_time);
299 }
300
301 /* @to += @from */
302 static void bfqg_stats_add_aux(struct bfqg_stats *to, struct bfqg_stats *from)
303 {
304 if (!to || !from)
305 return;
306
307 /* queued stats shouldn't be cleared */
308 blkg_rwstat_add_aux(&to->merged, &from->merged);
309 blkg_rwstat_add_aux(&to->service_time, &from->service_time);
310 blkg_rwstat_add_aux(&to->wait_time, &from->wait_time);
311 blkg_stat_add_aux(&from->time, &from->time);
312 blkg_stat_add_aux(&to->avg_queue_size_sum, &from->avg_queue_size_sum);
313 blkg_stat_add_aux(&to->avg_queue_size_samples,
314 &from->avg_queue_size_samples);
315 blkg_stat_add_aux(&to->dequeue, &from->dequeue);
316 blkg_stat_add_aux(&to->group_wait_time, &from->group_wait_time);
317 blkg_stat_add_aux(&to->idle_time, &from->idle_time);
318 blkg_stat_add_aux(&to->empty_time, &from->empty_time);
319 }
320
321 /*
322 * Transfer @bfqg's stats to its parent's dead_stats so that the ancestors'
323 * recursive stats can still account for the amount used by this bfqg after
324 * it's gone.
325 */
326 static void bfqg_stats_xfer_dead(struct bfq_group *bfqg)
327 {
328 struct bfq_group *parent;
329
330 if (!bfqg) /* root_group */
331 return;
332
333 parent = bfqg_parent(bfqg);
334
335 lockdep_assert_held(bfqg_to_blkg(bfqg)->q->queue_lock);
336
337 if (unlikely(!parent))
338 return;
339
340 bfqg_stats_add_aux(&parent->stats, &bfqg->stats);
341 bfqg_stats_reset(&bfqg->stats);
342 }
343
344 static void bfq_init_entity(struct bfq_entity *entity,
345 struct bfq_group *bfqg)
346 {
347 struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
348
349 entity->weight = entity->new_weight;
350 entity->orig_weight = entity->new_weight;
351 if (bfqq) {
352 bfqq->ioprio = bfqq->new_ioprio;
353 bfqq->ioprio_class = bfqq->new_ioprio_class;
354 #ifdef BFQ_MQ
355 /*
356 * Make sure that bfqg and its associated blkg do not
357 * disappear before entity.
358 */
359 bfqg_and_blkg_get(bfqg);
360 #else
361 bfqg_get(bfqg);
362 #endif
363 }
364 entity->parent = bfqg->my_entity; /* NULL for root group */
365 entity->sched_data = &bfqg->sched_data;
366 }
367
368 static void bfqg_stats_exit(struct bfqg_stats *stats)
369 {
370 blkg_rwstat_exit(&stats->merged);
371 blkg_rwstat_exit(&stats->service_time);
372 blkg_rwstat_exit(&stats->wait_time);
373 blkg_rwstat_exit(&stats->queued);
374 blkg_stat_exit(&stats->time);
375 blkg_stat_exit(&stats->avg_queue_size_sum);
376 blkg_stat_exit(&stats->avg_queue_size_samples);
377 blkg_stat_exit(&stats->dequeue);
378 blkg_stat_exit(&stats->group_wait_time);
379 blkg_stat_exit(&stats->idle_time);
380 blkg_stat_exit(&stats->empty_time);
381 }
382
383 static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp)
384 {
385 if (blkg_rwstat_init(&stats->merged, gfp) ||
386 blkg_rwstat_init(&stats->service_time, gfp) ||
387 blkg_rwstat_init(&stats->wait_time, gfp) ||
388 blkg_rwstat_init(&stats->queued, gfp) ||
389 blkg_stat_init(&stats->time, gfp) ||
390 blkg_stat_init(&stats->avg_queue_size_sum, gfp) ||
391 blkg_stat_init(&stats->avg_queue_size_samples, gfp) ||
392 blkg_stat_init(&stats->dequeue, gfp) ||
393 blkg_stat_init(&stats->group_wait_time, gfp) ||
394 blkg_stat_init(&stats->idle_time, gfp) ||
395 blkg_stat_init(&stats->empty_time, gfp)) {
396 bfqg_stats_exit(stats);
397 return -ENOMEM;
398 }
399
400 return 0;
401 }
402
403 static struct bfq_group_data *cpd_to_bfqgd(struct blkcg_policy_data *cpd)
404 {
405 return cpd ? container_of(cpd, struct bfq_group_data, pd) : NULL;
406 }
407
408 static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg)
409 {
410 return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq));
411 }
412
413 static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
414 {
415 struct bfq_group_data *bgd;
416
417 bgd = kzalloc(sizeof(*bgd), gfp);
418 if (!bgd)
419 return NULL;
420 return &bgd->pd;
421 }
422
423 static void bfq_cpd_init(struct blkcg_policy_data *cpd)
424 {
425 struct bfq_group_data *d = cpd_to_bfqgd(cpd);
426
427 d->weight = cgroup_subsys_on_dfl(io_cgrp_subsys) ?
428 CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL;
429 }
430
431 static void bfq_cpd_free(struct blkcg_policy_data *cpd)
432 {
433 kfree(cpd_to_bfqgd(cpd));
434 }
435
436 static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
437 {
438 struct bfq_group *bfqg;
439
440 bfqg = kzalloc_node(sizeof(*bfqg), gfp, node);
441 if (!bfqg)
442 return NULL;
443
444 if (bfqg_stats_init(&bfqg->stats, gfp)) {
445 kfree(bfqg);
446 return NULL;
447 }
448
449 #ifdef BFQ_MQ
450 /* see comments in bfq_bic_update_cgroup for why refcounting */
451 bfqg_get(bfqg);
452 #endif
453 return &bfqg->pd;
454 }
455
456 static void bfq_pd_init(struct blkg_policy_data *pd)
457 {
458 struct blkcg_gq *blkg;
459 struct bfq_group *bfqg;
460 struct bfq_data *bfqd;
461 struct bfq_entity *entity;
462 struct bfq_group_data *d;
463
464 blkg = pd_to_blkg(pd);
465 BUG_ON(!blkg);
466 bfqg = blkg_to_bfqg(blkg);
467 bfqd = blkg->q->elevator->elevator_data;
468 BUG_ON(bfqg == bfqd->root_group);
469 entity = &bfqg->entity;
470 d = blkcg_to_bfqgd(blkg->blkcg);
471
472 entity->orig_weight = entity->weight = entity->new_weight = d->weight;
473 entity->my_sched_data = &bfqg->sched_data;
474 bfqg->my_entity = entity; /*
475 * the root_group's will be set to NULL
476 * in bfq_init_queue()
477 */
478 bfqg->bfqd = bfqd;
479 bfqg->active_entities = 0;
480 bfqg->rq_pos_tree = RB_ROOT;
481 }
482
483 static void bfq_pd_free(struct blkg_policy_data *pd)
484 {
485 struct bfq_group *bfqg = pd_to_bfqg(pd);
486
487 bfqg_stats_exit(&bfqg->stats);
488 #ifdef BFQ_MQ
489 bfqg_put(bfqg);
490 #else
491 kfree(bfqg);
492 #endif
493 }
494
495 static void bfq_pd_reset_stats(struct blkg_policy_data *pd)
496 {
497 struct bfq_group *bfqg = pd_to_bfqg(pd);
498
499 bfqg_stats_reset(&bfqg->stats);
500 }
501
502 static void bfq_group_set_parent(struct bfq_group *bfqg,
503 struct bfq_group *parent)
504 {
505 struct bfq_entity *entity;
506
507 BUG_ON(!parent);
508 BUG_ON(!bfqg);
509 BUG_ON(bfqg == parent);
510
511 entity = &bfqg->entity;
512 entity->parent = parent->my_entity;
513 entity->sched_data = &parent->sched_data;
514 }
515
516 static struct bfq_group *bfq_lookup_bfqg(struct bfq_data *bfqd,
517 struct blkcg *blkcg)
518 {
519 struct blkcg_gq *blkg;
520
521 blkg = blkg_lookup(blkcg, bfqd->queue);
522 if (likely(blkg))
523 return blkg_to_bfqg(blkg);
524 return NULL;
525 }
526
527 static struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
528 struct blkcg *blkcg)
529 {
530 struct bfq_group *bfqg, *parent;
531 struct bfq_entity *entity;
532
533 bfqg = bfq_lookup_bfqg(bfqd, blkcg);
534
535 if (unlikely(!bfqg))
536 return NULL;
537
538 /*
539 * Update chain of bfq_groups as we might be handling a leaf group
540 * which, along with some of its relatives, has not been hooked yet
541 * to the private hierarchy of BFQ.
542 */
543 entity = &bfqg->entity;
544 for_each_entity(entity) {
545 bfqg = container_of(entity, struct bfq_group, entity);
546 BUG_ON(!bfqg);
547 if (bfqg != bfqd->root_group) {
548 parent = bfqg_parent(bfqg);
549 if (!parent)
550 parent = bfqd->root_group;
551 BUG_ON(!parent);
552 bfq_group_set_parent(bfqg, parent);
553 }
554 }
555
556 return bfqg;
557 }
558
559 static void bfq_pos_tree_add_move(struct bfq_data *bfqd,
560 struct bfq_queue *bfqq);
561
562 static void bfq_bfqq_expire(struct bfq_data *bfqd,
563 struct bfq_queue *bfqq,
564 bool compensate,
565 enum bfqq_expiration reason);
566
567 /**
568 * bfq_bfqq_move - migrate @bfqq to @bfqg.
569 * @bfqd: queue descriptor.
570 * @bfqq: the queue to move.
571 * @bfqg: the group to move to.
572 *
573 * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
574 * it on the new one. Avoid putting the entity on the old group idle tree.
575 *
576 #ifdef BFQ_MQ
577 * Must be called under the scheduler lock, to make sure that the blkg
578 * owning @bfqg does not disappear (see comments in
579 * bfq_bic_update_cgroup on guaranteeing the consistency of blkg
580 * objects).
581 #else
582 * Must be called under the queue lock; the cgroup owning @bfqg must
583 * not disappear (by now this just means that we are called under
584 * rcu_read_lock()).
585 #endif
586 */
587 static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
588 struct bfq_group *bfqg)
589 {
590 struct bfq_entity *entity = &bfqq->entity;
591
592 BUG_ON(!bfq_bfqq_busy(bfqq) && !RB_EMPTY_ROOT(&bfqq->sort_list));
593 BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list) && !entity->on_st);
594 BUG_ON(bfq_bfqq_busy(bfqq) && RB_EMPTY_ROOT(&bfqq->sort_list)
595 && entity->on_st &&
596 bfqq != bfqd->in_service_queue);
597 BUG_ON(!bfq_bfqq_busy(bfqq) && bfqq == bfqd->in_service_queue);
598
599 /* If bfqq is empty, then bfq_bfqq_expire also invokes
600 * bfq_del_bfqq_busy, thereby removing bfqq and its entity
601 * from data structures related to current group. Otherwise we
602 * need to remove bfqq explicitly with bfq_deactivate_bfqq, as
603 * we do below.
604 */
605 if (bfqq == bfqd->in_service_queue)
606 bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
607 false, BFQ_BFQQ_PREEMPTED);
608
609 BUG_ON(entity->on_st && !bfq_bfqq_busy(bfqq)
610 && &bfq_entity_service_tree(entity)->idle !=
611 entity->tree);
612
613 BUG_ON(RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_busy(bfqq));
614
615 if (bfq_bfqq_busy(bfqq))
616 bfq_deactivate_bfqq(bfqd, bfqq, false, false);
617 else if (entity->on_st) {
618 BUG_ON(&bfq_entity_service_tree(entity)->idle !=
619 entity->tree);
620 bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
621 }
622 #ifdef BFQ_MQ
623 bfqg_and_blkg_put(bfqq_group(bfqq));
624 #else
625 bfqg_put(bfqq_group(bfqq));
626 #endif
627
628 entity->parent = bfqg->my_entity;
629 entity->sched_data = &bfqg->sched_data;
630 #ifdef BFQ_MQ
631 /* pin down bfqg and its associated blkg */
632 bfqg_and_blkg_get(bfqg);
633 #else
634 bfqg_get(bfqg);
635 #endif
636
637 BUG_ON(RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_busy(bfqq));
638 if (bfq_bfqq_busy(bfqq)) {
639 bfq_pos_tree_add_move(bfqd, bfqq);
640 bfq_activate_bfqq(bfqd, bfqq);
641 }
642
643 if (!bfqd->in_service_queue && !bfqd->rq_in_driver)
644 bfq_schedule_dispatch(bfqd);
645 BUG_ON(entity->on_st && !bfq_bfqq_busy(bfqq)
646 && &bfq_entity_service_tree(entity)->idle !=
647 entity->tree);
648 }
649
650 /**
651 * __bfq_bic_change_cgroup - move @bic to @cgroup.
652 * @bfqd: the queue descriptor.
653 * @bic: the bic to move.
654 * @blkcg: the blk-cgroup to move to.
655 *
656 #ifdef BFQ_MQ
657 * Move bic to blkcg, assuming that bfqd->lock is held; which makes
658 * sure that the reference to cgroup is valid across the call (see
659 * comments in bfq_bic_update_cgroup on this issue)
660 #else
661 * Move bic to blkcg, assuming that bfqd->queue is locked; the caller
662 * has to make sure that the reference to cgroup is valid across the call.
663 #endif
664 *
665 * NOTE: an alternative approach might have been to store the current
666 * cgroup in bfqq and getting a reference to it, reducing the lookup
667 * time here, at the price of slightly more complex code.
668 */
669 static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
670 struct bfq_io_cq *bic,
671 struct blkcg *blkcg)
672 {
673 struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
674 struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
675 struct bfq_group *bfqg;
676 struct bfq_entity *entity;
677
678 bfqg = bfq_find_set_group(bfqd, blkcg);
679
680 if (unlikely(!bfqg))
681 bfqg = bfqd->root_group;
682
683 if (async_bfqq) {
684 entity = &async_bfqq->entity;
685
686 if (entity->sched_data != &bfqg->sched_data) {
687 bic_set_bfqq(bic, NULL, 0);
688 bfq_log_bfqq(bfqd, async_bfqq,
689 "bic_change_group: %p %d",
690 async_bfqq,
691 async_bfqq->ref);
692 bfq_put_queue(async_bfqq);
693 }
694 }
695
696 if (sync_bfqq) {
697 entity = &sync_bfqq->entity;
698 if (entity->sched_data != &bfqg->sched_data)
699 bfq_bfqq_move(bfqd, sync_bfqq, bfqg);
700 }
701
702 return bfqg;
703 }
704
705 static void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
706 {
707 struct bfq_data *bfqd = bic_to_bfqd(bic);
708 struct bfq_group *bfqg = NULL;
709 uint64_t serial_nr;
710
711 rcu_read_lock();
712 serial_nr = bio_blkcg(bio)->css.serial_nr;
713
714 /*
715 * Check whether blkcg has changed. The condition may trigger
716 * spuriously on a newly created cic but there's no harm.
717 */
718 if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr))
719 goto out;
720
721 bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio));
722 #ifdef BFQ_MQ
723 /*
724 * Update blkg_path for bfq_log_* functions. We cache this
725 * path, and update it here, for the following
726 * reasons. Operations on blkg objects in blk-cgroup are
727 * protected with the request_queue lock, and not with the
728 * lock that protects the instances of this scheduler
729 * (bfqd->lock). This exposes BFQ to the following sort of
730 * race.
731 *
732 * The blkg_lookup performed in bfq_get_queue, protected
733 * through rcu, may happen to return the address of a copy of
734 * the original blkg. If this is the case, then the
735 * bfqg_and_blkg_get performed in bfq_get_queue, to pin down
736 * the blkg, is useless: it does not prevent blk-cgroup code
737 * from destroying both the original blkg and all objects
738 * directly or indirectly referred by the copy of the
739 * blkg.
740 *
741 * On the bright side, destroy operations on a blkg invoke, as
742 * a first step, hooks of the scheduler associated with the
743 * blkg. And these hooks are executed with bfqd->lock held for
744 * BFQ. As a consequence, for any blkg associated with the
745 * request queue this instance of the scheduler is attached
746 * to, we are guaranteed that such a blkg is not destroyed, and
747 * that all the pointers it contains are consistent, while we
748 * are holding bfqd->lock. A blkg_lookup performed with
749 * bfqd->lock held then returns a fully consistent blkg, which
750 * remains consistent until this lock is held.
751 *
752 * Thanks to the last fact, and to the fact that: (1) bfqg has
753 * been obtained through a blkg_lookup in the above
754 * assignment, and (2) bfqd->lock is being held, here we can
755 * safely use the policy data for the involved blkg (i.e., the
756 * field bfqg->pd) to get to the blkg associated with bfqg,
757 * and then we can safely use any field of blkg. After we
758 * release bfqd->lock, even just getting blkg through this
759 * bfqg may cause dangling references to be traversed, as
760 * bfqg->pd may not exist any more.
761 *
762 * In view of the above facts, here we cache, in the bfqg, any
763 * blkg data we may need for this bic, and for its associated
764 * bfq_queue. As of now, we need to cache only the path of the
765 * blkg, which is used in the bfq_log_* functions.
766 *
767 * Finally, note that bfqg itself needs to be protected from
768 * destruction on the blkg_free of the original blkg (which
769 * invokes bfq_pd_free). We use an additional private
770 * refcounter for bfqg, to let it disappear only after no
771 * bfq_queue refers to it any longer.
772 */
773 blkg_path(bfqg_to_blkg(bfqg), bfqg->blkg_path, sizeof(bfqg->blkg_path));
774 #endif
775 bic->blkcg_serial_nr = serial_nr;
776 out:
777 rcu_read_unlock();
778 }
779
780 /**
781 * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
782 * @st: the service tree being flushed.
783 */
784 static void bfq_flush_idle_tree(struct bfq_service_tree *st)
785 {
786 struct bfq_entity *entity = st->first_idle;
787
788 for (; entity ; entity = st->first_idle)
789 __bfq_deactivate_entity(entity, false);
790 }
791
792 /**
793 * bfq_reparent_leaf_entity - move leaf entity to the root_group.
794 * @bfqd: the device data structure with the root group.
795 * @entity: the entity to move.
796 */
797 static void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
798 struct bfq_entity *entity)
799 {
800 struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
801
802 BUG_ON(!bfqq);
803 bfq_bfqq_move(bfqd, bfqq, bfqd->root_group);
804 }
805
806 /**
807 * bfq_reparent_active_entities - move to the root group all active
808 * entities.
809 * @bfqd: the device data structure with the root group.
810 * @bfqg: the group to move from.
811 * @st: the service tree with the entities.
812 */
813 static void bfq_reparent_active_entities(struct bfq_data *bfqd,
814 struct bfq_group *bfqg,
815 struct bfq_service_tree *st)
816 {
817 struct rb_root *active = &st->active;
818 struct bfq_entity *entity = NULL;
819
820 if (!RB_EMPTY_ROOT(&st->active))
821 entity = bfq_entity_of(rb_first(active));
822
823 for (; entity ; entity = bfq_entity_of(rb_first(active)))
824 bfq_reparent_leaf_entity(bfqd, entity);
825
826 if (bfqg->sched_data.in_service_entity)
827 bfq_reparent_leaf_entity(bfqd,
828 bfqg->sched_data.in_service_entity);
829 }
830
831 /**
832 * bfq_pd_offline - deactivate the entity associated with @pd,
833 * and reparent its children entities.
834 * @pd: descriptor of the policy going offline.
835 *
836 * blkio already grabs the queue_lock for us, so no need to use
837 * RCU-based magic
838 */
839 static void bfq_pd_offline(struct blkg_policy_data *pd)
840 {
841 struct bfq_service_tree *st;
842 struct bfq_group *bfqg;
843 struct bfq_data *bfqd;
844 struct bfq_entity *entity;
845 #ifdef BFQ_MQ
846 unsigned long flags;
847 #endif
848 int i;
849
850 BUG_ON(!pd);
851 bfqg = pd_to_bfqg(pd);
852 BUG_ON(!bfqg);
853 bfqd = bfqg->bfqd;
854 BUG_ON(bfqd && !bfqd->root_group);
855
856 entity = bfqg->my_entity;
857
858 if (!entity) /* root group */
859 return;
860
861 #ifdef BFQ_MQ
862 spin_lock_irqsave(&bfqd->lock, flags);
863 #endif
864
865 /*
866 * Empty all service_trees belonging to this group before
867 * deactivating the group itself.
868 */
869 for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
870 BUG_ON(!bfqg->sched_data.service_tree);
871 st = bfqg->sched_data.service_tree + i;
872 /*
873 * The idle tree may still contain bfq_queues belonging
874 * to exited task because they never migrated to a different
875 * cgroup from the one being destroyed now.
876 */
877 bfq_flush_idle_tree(st);
878
879 /*
880 * It may happen that some queues are still active
881 * (busy) upon group destruction (if the corresponding
882 * processes have been forced to terminate). We move
883 * all the leaf entities corresponding to these queues
884 * to the root_group.
885 * Also, it may happen that the group has an entity
886 * in service, which is disconnected from the active
887 * tree: it must be moved, too.
888 * There is no need to put the sync queues, as the
889 * scheduler has taken no reference.
890 */
891 bfq_reparent_active_entities(bfqd, bfqg, st);
892 BUG_ON(!RB_EMPTY_ROOT(&st->active));
893 BUG_ON(!RB_EMPTY_ROOT(&st->idle));
894 }
895 BUG_ON(bfqg->sched_data.next_in_service);
896 BUG_ON(bfqg->sched_data.in_service_entity);
897
898 __bfq_deactivate_entity(entity, false);
899 bfq_put_async_queues(bfqd, bfqg);
900
901 #ifdef BFQ_MQ
902 spin_unlock_irqrestore(&bfqd->lock, flags);
903 #endif
904 /*
905 * @blkg is going offline and will be ignored by
906 * blkg_[rw]stat_recursive_sum(). Transfer stats to the parent so
907 * that they don't get lost. If IOs complete after this point, the
908 * stats for them will be lost. Oh well...
909 */
910 bfqg_stats_xfer_dead(bfqg);
911 }
912
913 static void bfq_end_wr_async(struct bfq_data *bfqd)
914 {
915 struct blkcg_gq *blkg;
916
917 list_for_each_entry(blkg, &bfqd->queue->blkg_list, q_node) {
918 struct bfq_group *bfqg = blkg_to_bfqg(blkg);
919 BUG_ON(!bfqg);
920
921 bfq_end_wr_async_queues(bfqd, bfqg);
922 }
923 bfq_end_wr_async_queues(bfqd, bfqd->root_group);
924 }
925
926 static int bfq_io_show_weight(struct seq_file *sf, void *v)
927 {
928 struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
929 struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
930 unsigned int val = 0;
931
932 if (bfqgd)
933 val = bfqgd->weight;
934
935 seq_printf(sf, "%u\n", val);
936
937 return 0;
938 }
939
940 static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css,
941 struct cftype *cftype,
942 u64 val)
943 {
944 struct blkcg *blkcg = css_to_blkcg(css);
945 struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
946 struct blkcg_gq *blkg;
947 int ret = -ERANGE;
948
949 if (val < BFQ_MIN_WEIGHT || val > BFQ_MAX_WEIGHT)
950 return ret;
951
952 ret = 0;
953 spin_lock_irq(&blkcg->lock);
954 bfqgd->weight = (unsigned short)val;
955 hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
956 struct bfq_group *bfqg = blkg_to_bfqg(blkg);
957
958 if (!bfqg)
959 continue;
960 /*
961 * Setting the prio_changed flag of the entity
962 * to 1 with new_weight == weight would re-set
963 * the value of the weight to its ioprio mapping.
964 * Set the flag only if necessary.
965 */
966 if ((unsigned short)val != bfqg->entity.new_weight) {
967 bfqg->entity.new_weight = (unsigned short)val;
968 /*
969 * Make sure that the above new value has been
970 * stored in bfqg->entity.new_weight before
971 * setting the prio_changed flag. In fact,
972 * this flag may be read asynchronously (in
973 * critical sections protected by a different
974 * lock than that held here), and finding this
975 * flag set may cause the execution of the code
976 * for updating parameters whose value may
977 * depend also on bfqg->entity.new_weight (in
978 * __bfq_entity_update_weight_prio).
979 * This barrier makes sure that the new value
980 * of bfqg->entity.new_weight is correctly
981 * seen in that code.
982 */
983 smp_wmb();
984 bfqg->entity.prio_changed = 1;
985 }
986 }
987 spin_unlock_irq(&blkcg->lock);
988
989 return ret;
990 }
991
992 static ssize_t bfq_io_set_weight(struct kernfs_open_file *of,
993 char *buf, size_t nbytes,
994 loff_t off)
995 {
996 u64 weight;
997 /* First unsigned long found in the file is used */
998 int ret = kstrtoull(strim(buf), 0, &weight);
999
1000 if (ret)
1001 return ret;
1002
1003 return bfq_io_set_weight_legacy(of_css(of), NULL, weight);
1004 }
1005
1006 static int bfqg_print_stat(struct seq_file *sf, void *v)
1007 {
1008 blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_stat,
1009 &blkcg_policy_bfq, seq_cft(sf)->private, false);
1010 return 0;
1011 }
1012
1013 static int bfqg_print_rwstat(struct seq_file *sf, void *v)
1014 {
1015 blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_rwstat,
1016 &blkcg_policy_bfq, seq_cft(sf)->private, true);
1017 return 0;
1018 }
1019
1020 static u64 bfqg_prfill_stat_recursive(struct seq_file *sf,
1021 struct blkg_policy_data *pd, int off)
1022 {
1023 u64 sum = blkg_stat_recursive_sum(pd_to_blkg(pd),
1024 &blkcg_policy_bfq, off);
1025 return __blkg_prfill_u64(sf, pd, sum);
1026 }
1027
1028 static u64 bfqg_prfill_rwstat_recursive(struct seq_file *sf,
1029 struct blkg_policy_data *pd, int off)
1030 {
1031 struct blkg_rwstat sum = blkg_rwstat_recursive_sum(pd_to_blkg(pd),
1032 &blkcg_policy_bfq,
1033 off);
1034 return __blkg_prfill_rwstat(sf, pd, &sum);
1035 }
1036
1037 static int bfqg_print_stat_recursive(struct seq_file *sf, void *v)
1038 {
1039 blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
1040 bfqg_prfill_stat_recursive, &blkcg_policy_bfq,
1041 seq_cft(sf)->private, false);
1042 return 0;
1043 }
1044
1045 static int bfqg_print_rwstat_recursive(struct seq_file *sf, void *v)
1046 {
1047 blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
1048 bfqg_prfill_rwstat_recursive, &blkcg_policy_bfq,
1049 seq_cft(sf)->private, true);
1050 return 0;
1051 }
1052
1053 static u64 bfqg_prfill_sectors(struct seq_file *sf, struct blkg_policy_data *pd,
1054 int off)
1055 {
1056 u64 sum = blkg_rwstat_total(&pd->blkg->stat_bytes);
1057
1058 return __blkg_prfill_u64(sf, pd, sum >> 9);
1059 }
1060
1061 static int bfqg_print_stat_sectors(struct seq_file *sf, void *v)
1062 {
1063 blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
1064 bfqg_prfill_sectors, &blkcg_policy_bfq, 0, false);
1065 return 0;
1066 }
1067
1068 static u64 bfqg_prfill_sectors_recursive(struct seq_file *sf,
1069 struct blkg_policy_data *pd, int off)
1070 {
1071 struct blkg_rwstat tmp = blkg_rwstat_recursive_sum(pd->blkg, NULL,
1072 offsetof(struct blkcg_gq, stat_bytes));
1073 u64 sum = atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_READ]) +
1074 atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_WRITE]);
1075
1076 return __blkg_prfill_u64(sf, pd, sum >> 9);
1077 }
1078
1079 static int bfqg_print_stat_sectors_recursive(struct seq_file *sf, void *v)
1080 {
1081 blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
1082 bfqg_prfill_sectors_recursive, &blkcg_policy_bfq, 0,
1083 false);
1084 return 0;
1085 }
1086
1087
1088 static u64 bfqg_prfill_avg_queue_size(struct seq_file *sf,
1089 struct blkg_policy_data *pd, int off)
1090 {
1091 struct bfq_group *bfqg = pd_to_bfqg(pd);
1092 u64 samples = blkg_stat_read(&bfqg->stats.avg_queue_size_samples);
1093 u64 v = 0;
1094
1095 if (samples) {
1096 v = blkg_stat_read(&bfqg->stats.avg_queue_size_sum);
1097 v = div64_u64(v, samples);
1098 }
1099 __blkg_prfill_u64(sf, pd, v);
1100 return 0;
1101 }
1102
1103 /* print avg_queue_size */
1104 static int bfqg_print_avg_queue_size(struct seq_file *sf, void *v)
1105 {
1106 blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
1107 bfqg_prfill_avg_queue_size, &blkcg_policy_bfq,
1108 0, false);
1109 return 0;
1110 }
1111
1112 static struct bfq_group *
1113 bfq_create_group_hierarchy(struct bfq_data *bfqd, int node)
1114 {
1115 int ret;
1116
1117 ret = blkcg_activate_policy(bfqd->queue, &blkcg_policy_bfq);
1118 if (ret)
1119 return NULL;
1120
1121 return blkg_to_bfqg(bfqd->queue->root_blkg);
1122 }
1123
1124 #ifdef BFQ_MQ
1125 #define BFQ_CGROUP_FNAME(param) "bfq-mq."#param
1126 #else
1127 #define BFQ_CGROUP_FNAME(param) "bfq-sq."#param
1128 #endif
1129
1130 static struct cftype bfq_blkcg_legacy_files[] = {
1131 {
1132 .name = BFQ_CGROUP_FNAME(weight),
1133 .flags = CFTYPE_NOT_ON_ROOT,
1134 .seq_show = bfq_io_show_weight,
1135 .write_u64 = bfq_io_set_weight_legacy,
1136 },
1137
1138 /* statistics, covers only the tasks in the bfqg */
1139 {
1140 .name = BFQ_CGROUP_FNAME(time),
1141 .private = offsetof(struct bfq_group, stats.time),
1142 .seq_show = bfqg_print_stat,
1143 },
1144 {
1145 .name = BFQ_CGROUP_FNAME(sectors),
1146 .seq_show = bfqg_print_stat_sectors,
1147 },
1148 {
1149 .name = BFQ_CGROUP_FNAME(io_service_bytes),
1150 .private = (unsigned long)&blkcg_policy_bfq,
1151 .seq_show = blkg_print_stat_bytes,
1152 },
1153 {
1154 .name = BFQ_CGROUP_FNAME(io_serviced),
1155 .private = (unsigned long)&blkcg_policy_bfq,
1156 .seq_show = blkg_print_stat_ios,
1157 },
1158 {
1159 .name = BFQ_CGROUP_FNAME(io_service_time),
1160 .private = offsetof(struct bfq_group, stats.service_time),
1161 .seq_show = bfqg_print_rwstat,
1162 },
1163 {
1164 .name = BFQ_CGROUP_FNAME(io_wait_time),
1165 .private = offsetof(struct bfq_group, stats.wait_time),
1166 .seq_show = bfqg_print_rwstat,
1167 },
1168 {
1169 .name = BFQ_CGROUP_FNAME(io_merged),
1170 .private = offsetof(struct bfq_group, stats.merged),
1171 .seq_show = bfqg_print_rwstat,
1172 },
1173 {
1174 .name = BFQ_CGROUP_FNAME(io_queued),
1175 .private = offsetof(struct bfq_group, stats.queued),
1176 .seq_show = bfqg_print_rwstat,
1177 },
1178
1179 /* the same statictics which cover the bfqg and its descendants */
1180 {
1181 .name = BFQ_CGROUP_FNAME(time_recursive),
1182 .private = offsetof(struct bfq_group, stats.time),
1183 .seq_show = bfqg_print_stat_recursive,
1184 },
1185 {
1186 .name = BFQ_CGROUP_FNAME(sectors_recursive),
1187 .seq_show = bfqg_print_stat_sectors_recursive,
1188 },
1189 {
1190 .name = BFQ_CGROUP_FNAME(io_service_bytes_recursive),
1191 .private = (unsigned long)&blkcg_policy_bfq,
1192 .seq_show = blkg_print_stat_bytes_recursive,
1193 },
1194 {
1195 .name = BFQ_CGROUP_FNAME(io_serviced_recursive),
1196 .private = (unsigned long)&blkcg_policy_bfq,
1197 .seq_show = blkg_print_stat_ios_recursive,
1198 },
1199 {
1200 .name = BFQ_CGROUP_FNAME(io_service_time_recursive),
1201 .private = offsetof(struct bfq_group, stats.service_time),
1202 .seq_show = bfqg_print_rwstat_recursive,
1203 },
1204 {
1205 .name = BFQ_CGROUP_FNAME(io_wait_time_recursive),
1206 .private = offsetof(struct bfq_group, stats.wait_time),
1207 .seq_show = bfqg_print_rwstat_recursive,
1208 },
1209 {
1210 .name = BFQ_CGROUP_FNAME(io_merged_recursive),
1211 .private = offsetof(struct bfq_group, stats.merged),
1212 .seq_show = bfqg_print_rwstat_recursive,
1213 },
1214 {
1215 .name = BFQ_CGROUP_FNAME(io_queued_recursive),
1216 .private = offsetof(struct bfq_group, stats.queued),
1217 .seq_show = bfqg_print_rwstat_recursive,
1218 },
1219 {
1220 .name = BFQ_CGROUP_FNAME(avg_queue_size),
1221 .seq_show = bfqg_print_avg_queue_size,
1222 },
1223 {
1224 .name = BFQ_CGROUP_FNAME(group_wait_time),
1225 .private = offsetof(struct bfq_group, stats.group_wait_time),
1226 .seq_show = bfqg_print_stat,
1227 },
1228 {
1229 .name = BFQ_CGROUP_FNAME(idle_time),
1230 .private = offsetof(struct bfq_group, stats.idle_time),
1231 .seq_show = bfqg_print_stat,
1232 },
1233 {
1234 .name = BFQ_CGROUP_FNAME(empty_time),
1235 .private = offsetof(struct bfq_group, stats.empty_time),
1236 .seq_show = bfqg_print_stat,
1237 },
1238 {
1239 .name = BFQ_CGROUP_FNAME(dequeue),
1240 .private = offsetof(struct bfq_group, stats.dequeue),
1241 .seq_show = bfqg_print_stat,
1242 },
1243 { } /* terminate */
1244 };
1245
1246 static struct cftype bfq_blkg_files[] = {
1247 {
1248 .name = BFQ_CGROUP_FNAME(weight),
1249 .flags = CFTYPE_NOT_ON_ROOT,
1250 .seq_show = bfq_io_show_weight,
1251 .write = bfq_io_set_weight,
1252 },
1253 {} /* terminate */
1254 };
1255
1256 #undef BFQ_CGROUP_FNAME
1257
1258 #else /* BFQ_GROUP_IOSCHED_ENABLED */
1259
1260 static inline void bfqg_stats_update_io_add(struct bfq_group *bfqg,
1261 struct bfq_queue *bfqq, unsigned int op) { }
1262 static inline void
1263 bfqg_stats_update_io_remove(struct bfq_group *bfqg, unsigned int op) { }
1264 static inline void
1265 bfqg_stats_update_io_merged(struct bfq_group *bfqg, unsigned int op) { }
1266 static inline void bfqg_stats_update_completion(struct bfq_group *bfqg,
1267 uint64_t start_time, uint64_t io_start_time,
1268 unsigned int op) { }
1269 static inline void
1270 bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
1271 struct bfq_group *curr_bfqg) { }
1272 static inline void bfqg_stats_end_empty_time(struct bfqg_stats *stats) { }
1273 static inline void bfqg_stats_update_dequeue(struct bfq_group *bfqg) { }
1274 static inline void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg) { }
1275 static inline void bfqg_stats_update_idle_time(struct bfq_group *bfqg) { }
1276 static inline void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg) { }
1277 static inline void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) { }
1278
1279 static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
1280 struct bfq_group *bfqg) {}
1281
1282 static void bfq_init_entity(struct bfq_entity *entity,
1283 struct bfq_group *bfqg)
1284 {
1285 struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
1286
1287 entity->weight = entity->new_weight;
1288 entity->orig_weight = entity->new_weight;
1289 if (bfqq) {
1290 bfqq->ioprio = bfqq->new_ioprio;
1291 bfqq->ioprio_class = bfqq->new_ioprio_class;
1292 }
1293 entity->sched_data = &bfqg->sched_data;
1294 }
1295
1296 static void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) {}
1297
1298 static void bfq_end_wr_async(struct bfq_data *bfqd)
1299 {
1300 bfq_end_wr_async_queues(bfqd, bfqd->root_group);
1301 }
1302
1303 static struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
1304 struct blkcg *blkcg)
1305 {
1306 return bfqd->root_group;
1307 }
1308
1309 static struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
1310 {
1311 return bfqq->bfqd->root_group;
1312 }
1313
1314 static struct bfq_group *
1315 bfq_create_group_hierarchy(struct bfq_data *bfqd, int node)
1316 {
1317 struct bfq_group *bfqg;
1318 int i;
1319
1320 bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
1321 if (!bfqg)
1322 return NULL;
1323
1324 for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
1325 bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
1326
1327 return bfqg;
1328 }
1329 #endif
1 /*
2 * BFQ: I/O context handling.
3 *
4 * Based on ideas and code from CFQ:
5 * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
6 *
7 * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
8 * Paolo Valente <paolo.valente@unimore.it>
9 *
10 * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
11 */
12
13 /**
14 * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
15 * @icq: the iocontext queue.
16 */
17 static struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
18 {
19 /* bic->icq is the first member, %NULL will convert to %NULL */
20 return container_of(icq, struct bfq_io_cq, icq);
21 }
22
23 /**
24 * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
25 * @bfqd: the lookup key.
26 * @ioc: the io_context of the process doing I/O.
27 *
28 * Queue lock must be held.
29 */
30 static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
31 struct io_context *ioc)
32 {
33 if (ioc)
34 return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
35 return NULL;
36 }
725 725 }
726 726
727 727 static void
728 bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
728 bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
729 struct bfq_io_cq *bic, bool bfq_already_existing)
729 730 {
731 unsigned int old_wr_coeff = bfqq->wr_coeff;
732 bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
733
730 734 if (bic->saved_idle_window)
731 735 bfq_mark_bfqq_idle_window(bfqq);
732 736 else
758 758
759 759 /* make sure weight will be updated, however we got here */
760 760 bfqq->entity.prio_changed = 1;
761
762 if (likely(!busy))
763 return;
764
765 if (old_wr_coeff == 1 && bfqq->wr_coeff > 1)
766 bfqd->wr_busy_queues++;
767 else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
768 bfqd->wr_busy_queues--;
761 769 }
762 770
763 771 static int bfqq_process_refs(struct bfq_queue *bfqq)
3483 3483 }
3484 3484 }
3485 3485 }
3486 /* Update weight both if it must be raised and if it must be lowered */
3486 /*
3487 * To improve latency (for this or other queues), immediately
3488 * update weight both if it must be raised and if it must be
3489 * lowered. Since, entity may be on some active tree here, and
3490 * might have a pending change of its ioprio class, invoke
3491 * next function with the last parameter unset (see the
3492 * comments on the function).
3493 */
3487 3494 if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
3488 __bfq_entity_update_weight_prio(
3489 bfq_entity_service_tree(entity),
3490 entity);
3495 __bfq_entity_update_weight_prio(bfq_entity_service_tree(entity),
3496 entity, false);
3491 3497 }
3492 3498
3493 3499 /*
4299 4299 bfq_bfqq_expire(bfqd, bfqq, false,
4300 4300 BFQQE_NO_MORE_REQUESTS);
4301 4301 }
4302
4303 if (!bfqd->rq_in_driver)
4304 bfq_schedule_dispatch(bfqd);
4302 4305 }
4303 4306
4304 4307 static void bfq_put_rq_priv_body(struct bfq_queue *bfqq)
4419 4419 struct bio *bio)
4420 4420 {
4421 4421 struct bfq_data *bfqd = q->elevator->elevator_data;
4422 struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
4422 struct bfq_io_cq *bic;
4423 4423 const int is_sync = rq_is_sync(rq);
4424 4424 struct bfq_queue *bfqq;
4425 4425 bool new_queue = false;
4426 bool split = false;
4426 bool bfqq_already_existing = false, split = false;
4427 4427
4428 if (!rq->elv.icq)
4429 return 1;
4430 bic = icq_to_bic(rq->elv.icq);
4431
4428 4432 spin_lock_irq(&bfqd->lock);
4429 4433
4430 if (!bic)
4431 goto queue_fail;
4432
4433 4434 bfq_check_ioprio_change(bic, bio);
4434 4435
4435 4436 bfq_bic_update_cgroup(bic, bio);
4454 4454 bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio,
4455 4455 true, is_sync,
4456 4456 NULL);
4457 else
4458 bfqq_already_existing = true;
4457 4459 }
4458 4460 }
4459 4461
4481 4481 * queue: restore the idle window and the
4482 4482 * possible weight raising period.
4483 4483 */
4484 bfq_bfqq_resume_state(bfqq, bic);
4484 bfq_bfqq_resume_state(bfqq, bfqd, bic,
4485 bfqq_already_existing);
4485 4486 }
4486 4487 }
4487 4488
4490 4490 bfq_handle_burst(bfqd, bfqq);
4491 4491
4492 4492 spin_unlock_irq(&bfqd->lock);
4493
4494 4493 return 0;
4495
4496 queue_fail:
4497 spin_unlock_irq(&bfqd->lock);
4498
4499 return 1;
4500 4494 }
4501 4495
4502 4496 static void bfq_idle_slice_timer_body(struct bfq_queue *bfqq)
71 71 *
72 72 * bfq_sched_data is the basic scheduler queue. It supports three
73 73 * ioprio_classes, and can be used either as a toplevel queue or as an
74 * intermediate queue on a hierarchical setup. @next_in_service
75 * points to the active entity of the sched_data service trees that
76 * will be scheduled next. It is used to reduce the number of steps
77 * needed for each hierarchical-schedule update.
74 * intermediate queue in a hierarchical setup.
78 75 *
79 76 * The supported ioprio_classes are the same as in CFQ, in descending
80 77 * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
81 78 * Requests from higher priority queues are served before all the
82 79 * requests from lower priority queues; among requests of the same
83 80 * queue requests are served according to B-WF2Q+.
84 * All the fields are protected by the queue lock of the containing bfqd.
81 *
82 * The schedule is implemented by the service trees, plus the field
83 * @next_in_service, which points to the entity on the active trees
84 * that will be served next, if 1) no changes in the schedule occurs
85 * before the current in-service entity is expired, 2) the in-service
86 * queue becomes idle when it expires, and 3) if the entity pointed by
87 * in_service_entity is not a queue, then the in-service child entity
88 * of the entity pointed by in_service_entity becomes idle on
89 * expiration. This peculiar definition allows for the following
90 * optimization, not yet exploited: while a given entity is still in
91 * service, we already know which is the best candidate for next
92 * service among the other active entitities in the same parent
93 * entity. We can then quickly compare the timestamps of the
94 * in-service entity with those of such best candidate.
95 *
96 * All fields are protected by the lock of the containing bfqd.
85 97 */
86 98 struct bfq_sched_data {
87 99 /* entity in service */
904 904 struct bfq_entity *entity);
905 905 struct bfq_service_tree *
906 906 __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
907 struct bfq_entity *entity);
907 struct bfq_entity *entity,
908 bool update_class_too);
908 909 void bfq_bfqq_served(struct bfq_queue *bfqq, int served);
909 910 void bfq_bfqq_charge_time(struct bfq_data *bfqd, struct bfq_queue *bfqq,
910 911 unsigned long time_ms);
1 /*
2 * Budget Fair Queueing (BFQ) I/O scheduler.
3 *
4 * Based on ideas and code from CFQ:
5 * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
6 *
7 * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
8 * Paolo Valente <paolo.valente@unimore.it>
9 *
10 * Copyright (C) 2015 Paolo Valente <paolo.valente@unimore.it>
11 *
12 * Copyright (C) 2017 Paolo Valente <paolo.valente@linaro.org>
13 *
14 * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
15 * file.
16 *
17 * BFQ is a proportional-share I/O scheduler, with some extra
18 * low-latency capabilities. BFQ also supports full hierarchical
19 * scheduling through cgroups. Next paragraphs provide an introduction
20 * on BFQ inner workings. Details on BFQ benefits and usage can be
21 * found in Documentation/block/bfq-iosched.txt.
22 *
23 * BFQ is a proportional-share storage-I/O scheduling algorithm based
24 * on the slice-by-slice service scheme of CFQ. But BFQ assigns
25 * budgets, measured in number of sectors, to processes instead of
26 * time slices. The device is not granted to the in-service process
27 * for a given time slice, but until it has exhausted its assigned
28 * budget. This change from the time to the service domain enables BFQ
29 * to distribute the device throughput among processes as desired,
30 * without any distortion due to throughput fluctuations, or to device
31 * internal queueing. BFQ uses an ad hoc internal scheduler, called
32 * B-WF2Q+, to schedule processes according to their budgets. More
33 * precisely, BFQ schedules queues associated with processes. Thanks to
34 * the accurate policy of B-WF2Q+, BFQ can afford to assign high
35 * budgets to I/O-bound processes issuing sequential requests (to
36 * boost the throughput), and yet guarantee a low latency to
37 * interactive and soft real-time applications.
38 *
39 * NOTE: if the main or only goal, with a given device, is to achieve
40 * the maximum-possible throughput at all times, then do switch off
41 * all low-latency heuristics for that device, by setting low_latency
42 * to 0.
43 *
44 * BFQ is described in [1], where also a reference to the initial, more
45 * theoretical paper on BFQ can be found. The interested reader can find
46 * in the latter paper full details on the main algorithm, as well as
47 * formulas of the guarantees and formal proofs of all the properties.
48 * With respect to the version of BFQ presented in these papers, this
49 * implementation adds a few more heuristics, such as the one that
50 * guarantees a low latency to soft real-time applications, and a
51 * hierarchical extension based on H-WF2Q+.
52 *
53 * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
54 * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
55 * complexity derives from the one introduced with EEVDF in [3].
56 *
57 * [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
58 * Scheduler", Proceedings of the First Workshop on Mobile System
59 * Technologies (MST-2015), May 2015.
60 * http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
61 *
62 * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
63 *
64 * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
65 * Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
66 * Oct 1997.
67 *
68 * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
69 *
70 * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
71 * First: A Flexible and Accurate Mechanism for Proportional Share
72 * Resource Allocation,'' technical report.
73 *
74 * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
75 */
76 #include <linux/module.h>
77 #include <linux/slab.h>
78 #include <linux/blkdev.h>
79 #include <linux/cgroup.h>
80 #include <linux/elevator.h>
81 #include <linux/jiffies.h>
82 #include <linux/rbtree.h>
83 #include <linux/ioprio.h>
84 #include <linux/sbitmap.h>
85 #include <linux/delay.h>
86
87 #include "blk.h"
88 #include "blk-mq.h"
89 #include "blk-mq-tag.h"
90 #include "blk-mq-sched.h"
91 #include "bfq-mq.h"
92
93 /* Expiration time of sync (0) and async (1) requests, in ns. */
94 static const u64 bfq_fifo_expire[2] = { NSEC_PER_SEC / 4, NSEC_PER_SEC / 8 };
95
96 /* Maximum backwards seek, in KiB. */
97 static const int bfq_back_max = (16 * 1024);
98
99 /* Penalty of a backwards seek, in number of sectors. */
100 static const int bfq_back_penalty = 2;
101
102 /* Idling period duration, in ns. */
103 static u32 bfq_slice_idle = (NSEC_PER_SEC / 125);
104
105 /* Minimum number of assigned budgets for which stats are safe to compute. */
106 static const int bfq_stats_min_budgets = 194;
107
108 /* Default maximum budget values, in sectors and number of requests. */
109 static const int bfq_default_max_budget = (16 * 1024);
110
111 /*
112 * Async to sync throughput distribution is controlled as follows:
113 * when an async request is served, the entity is charged the number
114 * of sectors of the request, multiplied by the factor below
115 */
116 static const int bfq_async_charge_factor = 10;
117
118 /* Default timeout values, in jiffies, approximating CFQ defaults. */
119 static const int bfq_timeout = (HZ / 8);
120
121 static struct kmem_cache *bfq_pool;
122
123 /* Below this threshold (in ns), we consider thinktime immediate. */
124 #define BFQ_MIN_TT (2 * NSEC_PER_MSEC)
125
126 /* hw_tag detection: parallel requests threshold and min samples needed. */
127 #define BFQ_HW_QUEUE_THRESHOLD 4
128 #define BFQ_HW_QUEUE_SAMPLES 32
129
130 #define BFQQ_SEEK_THR (sector_t)(8 * 100)
131 #define BFQQ_SECT_THR_NONROT (sector_t)(2 * 32)
132 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
133 #define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 32/8)
134
135 /* Min number of samples required to perform peak-rate update */
136 #define BFQ_RATE_MIN_SAMPLES 32
137 /* Min observation time interval required to perform a peak-rate update (ns) */
138 #define BFQ_RATE_MIN_INTERVAL (300*NSEC_PER_MSEC)
139 /* Target observation time interval for a peak-rate update (ns) */
140 #define BFQ_RATE_REF_INTERVAL NSEC_PER_SEC
141
142 /* Shift used for peak rate fixed precision calculations. */
143 #define BFQ_RATE_SHIFT 16
144
145 /*
146 * By default, BFQ computes the duration of the weight raising for
147 * interactive applications automatically, using the following formula:
148 * duration = (R / r) * T, where r is the peak rate of the device, and
149 * R and T are two reference parameters.
150 * In particular, R is the peak rate of the reference device (see below),
151 * and T is a reference time: given the systems that are likely to be
152 * installed on the reference device according to its speed class, T is
153 * about the maximum time needed, under BFQ and while reading two files in
154 * parallel, to load typical large applications on these systems.
155 * In practice, the slower/faster the device at hand is, the more/less it
156 * takes to load applications with respect to the reference device.
157 * Accordingly, the longer/shorter BFQ grants weight raising to interactive
158 * applications.
159 *
160 * BFQ uses four different reference pairs (R, T), depending on:
161 * . whether the device is rotational or non-rotational;
162 * . whether the device is slow, such as old or portable HDDs, as well as
163 * SD cards, or fast, such as newer HDDs and SSDs.
164 *
165 * The device's speed class is dynamically (re)detected in
166 * bfq_update_peak_rate() every time the estimated peak rate is updated.
167 *
168 * In the following definitions, R_slow[0]/R_fast[0] and
169 * T_slow[0]/T_fast[0] are the reference values for a slow/fast
170 * rotational device, whereas R_slow[1]/R_fast[1] and
171 * T_slow[1]/T_fast[1] are the reference values for a slow/fast
172 * non-rotational device. Finally, device_speed_thresh are the
173 * thresholds used to switch between speed classes. The reference
174 * rates are not the actual peak rates of the devices used as a
175 * reference, but slightly lower values. The reason for using these
176 * slightly lower values is that the peak-rate estimator tends to
177 * yield slightly lower values than the actual peak rate (it can yield
178 * the actual peak rate only if there is only one process doing I/O,
179 * and the process does sequential I/O).
180 *
181 * Both the reference peak rates and the thresholds are measured in
182 * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
183 */
184 static int R_slow[2] = {1000, 10700};
185 static int R_fast[2] = {14000, 33000};
186 /*
187 * To improve readability, a conversion function is used to initialize the
188 * following arrays, which entails that they can be initialized only in a
189 * function.
190 */
191 static int T_slow[2];
192 static int T_fast[2];
193 static int device_speed_thresh[2];
194
195 #define BFQ_SERVICE_TREE_INIT ((struct bfq_service_tree) \
196 { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
197
198 #define RQ_BIC(rq) ((struct bfq_io_cq *) (rq)->elv.priv[0])
199 #define RQ_BFQQ(rq) ((rq)->elv.priv[1])
200
201 /**
202 * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
203 * @icq: the iocontext queue.
204 */
205 static struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
206 {
207 /* bic->icq is the first member, %NULL will convert to %NULL */
208 return container_of(icq, struct bfq_io_cq, icq);
209 }
210
211 /**
212 * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
213 * @bfqd: the lookup key.
214 * @ioc: the io_context of the process doing I/O.
215 * @q: the request queue.
216 */
217 static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
218 struct io_context *ioc,
219 struct request_queue *q)
220 {
221 if (ioc) {
222 unsigned long flags;
223 struct bfq_io_cq *icq;
224
225 spin_lock_irqsave(q->queue_lock, flags);
226 icq = icq_to_bic(ioc_lookup_icq(ioc, q));
227 spin_unlock_irqrestore(q->queue_lock, flags);
228
229 return icq;
230 }
231
232 return NULL;
233 }
234
235 /*
236 * Scheduler run of queue, if there are requests pending and no one in the
237 * driver that will restart queueing.
238 */
239 static void bfq_schedule_dispatch(struct bfq_data *bfqd)
240 {
241 if (bfqd->queued != 0) {
242 bfq_log(bfqd, "schedule dispatch");
243 blk_mq_run_hw_queues(bfqd->queue, true);
244 }
245 }
246
247 #define BFQ_MQ
248 #include "bfq-sched.c"
249 #include "bfq-cgroup-included.c"
250
251 #define bfq_class_idle(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
252 #define bfq_class_rt(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_RT)
253
254 #define bfq_sample_valid(samples) ((samples) > 80)
255
256 /*
257 * Lifted from AS - choose which of rq1 and rq2 that is best served now.
258 * We choose the request that is closesr to the head right now. Distance
259 * behind the head is penalized and only allowed to a certain extent.
260 */
261 static struct request *bfq_choose_req(struct bfq_data *bfqd,
262 struct request *rq1,
263 struct request *rq2,
264 sector_t last)
265 {
266 sector_t s1, s2, d1 = 0, d2 = 0;
267 unsigned long back_max;
268 #define BFQ_RQ1_WRAP 0x01 /* request 1 wraps */
269 #define BFQ_RQ2_WRAP 0x02 /* request 2 wraps */
270 unsigned int wrap = 0; /* bit mask: requests behind the disk head? */
271
272 if (!rq1 || rq1 == rq2)
273 return rq2;
274 if (!rq2)
275 return rq1;
276
277 if (rq_is_sync(rq1) && !rq_is_sync(rq2))
278 return rq1;
279 else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
280 return rq2;
281 if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
282 return rq1;
283 else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
284 return rq2;
285
286 s1 = blk_rq_pos(rq1);
287 s2 = blk_rq_pos(rq2);
288
289 /*
290 * By definition, 1KiB is 2 sectors.
291 */
292 back_max = bfqd->bfq_back_max * 2;
293
294 /*
295 * Strict one way elevator _except_ in the case where we allow
296 * short backward seeks which are biased as twice the cost of a
297 * similar forward seek.
298 */
299 if (s1 >= last)
300 d1 = s1 - last;
301 else if (s1 + back_max >= last)
302 d1 = (last - s1) * bfqd->bfq_back_penalty;
303 else
304 wrap |= BFQ_RQ1_WRAP;
305
306 if (s2 >= last)
307 d2 = s2 - last;
308 else if (s2 + back_max >= last)
309 d2 = (last - s2) * bfqd->bfq_back_penalty;
310 else
311 wrap |= BFQ_RQ2_WRAP;
312
313 /* Found required data */
314
315 /*
316 * By doing switch() on the bit mask "wrap" we avoid having to
317 * check two variables for all permutations: --> faster!
318 */
319 switch (wrap) {
320 case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
321 if (d1 < d2)
322 return rq1;
323 else if (d2 < d1)
324 return rq2;
325
326 if (s1 >= s2)
327 return rq1;
328 else
329 return rq2;
330
331 case BFQ_RQ2_WRAP:
332 return rq1;
333 case BFQ_RQ1_WRAP:
334 return rq2;
335 case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
336 default:
337 /*
338 * Since both rqs are wrapped,
339 * start with the one that's further behind head
340 * (--> only *one* back seek required),
341 * since back seek takes more time than forward.
342 */
343 if (s1 <= s2)
344 return rq1;
345 else
346 return rq2;
347 }
348 }
349
350 static struct bfq_queue *
351 bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
352 sector_t sector, struct rb_node **ret_parent,
353 struct rb_node ***rb_link)
354 {
355 struct rb_node **p, *parent;
356 struct bfq_queue *bfqq = NULL;
357
358 parent = NULL;
359 p = &root->rb_node;
360 while (*p) {
361 struct rb_node **n;
362
363 parent = *p;
364 bfqq = rb_entry(parent, struct bfq_queue, pos_node);
365
366 /*
367 * Sort strictly based on sector. Smallest to the left,
368 * largest to the right.
369 */
370 if (sector > blk_rq_pos(bfqq->next_rq))
371 n = &(*p)->rb_right;
372 else if (sector < blk_rq_pos(bfqq->next_rq))
373 n = &(*p)->rb_left;
374 else
375 break;
376 p = n;
377 bfqq = NULL;
378 }
379