-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmTWBMYACgkQONu9yGCS
aT5fcw//f8IqgXhnM1RmdENWcj8Yttld1jY0L8+z2fRvkzmuqFJAnOuTEP/BV9Zk
iMNH6Hg5iZh/ajGyW4OxsWHvaDNyZtpPOgNtQkhHPPDq5tqAgg+8ZgPlZkmbvnGd
askxaSJE7OuJOfG193o/Uf0CR/boSIN1ioIu0vumqhrP2NUbe44/PLeSB239ZdGy
nIaBo1JXffOH8P7kSS4E9NSrfoA9MQEuJgcYPkc1c08W2FWO8MftM/hdQtXGwbNC
LCy4yGc3PN40MT7tOsXE0w3P+ZUXfP6g8NgHooRKuLimSiAYodLgCwnvELZ/Nsg+
w1TPDxbLD99te5J16GzlzhN4+9BUtf2qq9ZgiJQ8lmKaGc+hAMRKF2h2E5Qhla8R
TJubYFjD5yilANlRumVHMzNJZntROw0hG0ZIX6An/1QM5JAy7B736jI6jt+RZFSx
r08xhBXcO+m3s2Vc2OojJFKLot9i0ugiKkTuQBZsBFDfcOtSrUUarB6Vz6wZvCY8
sojQOS0eoYb+2GlKJ0UzTPLEHrCpusRkEnv3QMAPfTkw6vqvkrYACfOEbBujfT8e
TtC7wuS3beULYPKpObe9HrpCooOXX8YQFXyld5e5iBINXwt/UT4daDL85BbMsPEu
MPaSKrTMGXUsRoOWHiuPumT/MDE5LBSCqhyi41k90R9qRW6M+Wk=
=KuAb
-----END PGP SIGNATURE-----
Merge 4.19.291 into android-4.19-stable
Changes in 4.19.291
gfs2: Don't deref jdesc in evict
x86/smp: Use dedicated cache-line for mwait_play_dead()
video: imsttfb: check for ioremap() failures
fbdev: imsttfb: Fix use after free bug in imsttfb_probe
drm/edid: Fix uninitialized variable in drm_cvt_modes()
scripts/tags.sh: Resolve gtags empty index generation
drm/amdgpu: Validate VM ioctl flags.
treewide: Remove uninitialized_var() usage
md/raid10: check slab-out-of-bounds in md_bitmap_get_counter
md/raid10: fix overflow of md/safe_mode_delay
md/raid10: fix wrong setting of max_corr_read_errors
md/raid10: fix io loss while replacement replace rdev
irqchip/jcore-aic: Kill use of irq_create_strict_mappings()
irqchip/jcore-aic: Fix missing allocation of IRQ descriptors
clocksource/drivers: Unify the names to timer-* format
clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
clocksource/drivers/cadence-ttc: Fix memory leak in ttc_timer_probe
PM: domains: fix integer overflow issues in genpd_parse_state()
ARM: 9303/1: kprobes: avoid missing-declaration warnings
evm: Complete description of evm_inode_setattr()
wifi: ath9k: fix AR9003 mac hardware hang check register offset calculation
wifi: ath9k: avoid referencing uninit memory in ath9k_wmi_ctrl_rx
samples/bpf: Fix buffer overflow in tcp_basertt
wifi: mwifiex: Fix the size of a memory allocation in mwifiex_ret_802_11_scan()
nfc: constify several pointers to u8, char and sk_buff
nfc: llcp: fix possible use of uninitialized variable in nfc_llcp_send_connect()
wifi: orinoco: Fix an error handling path in spectrum_cs_probe()
wifi: orinoco: Fix an error handling path in orinoco_cs_probe()
wifi: atmel: Fix an error handling path in atmel_probe()
wl3501_cs: Fix a bunch of formatting issues related to function docs
wl3501_cs: Remove unnecessary NULL check
wl3501_cs: Fix misspelling and provide missing documentation
net: create netdev->dev_addr assignment helpers
wl3501_cs: use eth_hw_addr_set()
wifi: wl3501_cs: Fix an error handling path in wl3501_probe()
wifi: ray_cs: Utilize strnlen() in parse_addr()
wifi: ray_cs: Drop useless status variable in parse_addr()
wifi: ray_cs: Fix an error handling path in ray_probe()
wifi: ath9k: don't allow to overwrite ENDPOINT0 attributes
wifi: rsi: Do not set MMC_PM_KEEP_POWER in shutdown
watchdog/perf: define dummy watchdog_update_hrtimer_threshold() on correct config
watchdog/perf: more properly prevent false positives with turbo modes
kexec: fix a memory leak in crash_shrink_memory()
memstick r592: make memstick_debug_get_tpc_name() static
wifi: ath9k: Fix possible stall on ath9k_txq_list_has_key()
wifi: ath9k: convert msecs to jiffies where needed
netlink: fix potential deadlock in netlink_set_err()
netlink: do not hard code device address lenth in fdb dumps
gtp: Fix use-after-free in __gtp_encap_destroy().
lib/ts_bm: reset initial match offset for every block of text
netfilter: nf_conntrack_sip: fix the ct_sip_parse_numerical_param() return value.
ipvlan: Fix return value of ipvlan_queue_xmit()
netlink: Add __sock_i_ino() for __netlink_diag_dump().
radeon: avoid double free in ci_dpm_init()
Input: drv260x - sleep between polling GO bit
ARM: dts: BCM5301X: Drop "clock-names" from the SPI node
Input: adxl34x - do not hardcode interrupt trigger type
drm/panel: simple: fix active size for Ampire AM-480272H3TMQW-T01H
ARM: ep93xx: fix missing-prototype warnings
ASoC: es8316: Increment max value for ALC Capture Target Volume control
soc/fsl/qe: fix usb.c build errors
IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors
arm64: dts: renesas: ulcb-kf: Remove flow control for SCIF1
fbdev: omapfb: lcd_mipid: Fix an error handling path in mipid_spi_probe()
drm/radeon: fix possible division-by-zero errors
ALSA: ac97: Fix possible NULL dereference in snd_ac97_mixer
scsi: 3w-xxxx: Add error handling for initialization failure in tw_probe()
PCI: Add pci_clear_master() stub for non-CONFIG_PCI
pinctrl: cherryview: Return correct value if pin in push-pull mode
perf dwarf-aux: Fix off-by-one in die_get_varname()
pinctrl: at91-pio4: check return value of devm_kasprintf()
hwrng: virtio - add an internal buffer
hwrng: virtio - don't wait on cleanup
hwrng: virtio - don't waste entropy
hwrng: virtio - always add a pending request
hwrng: virtio - Fix race on data_avail and actual data
crypto: nx - fix build warnings when DEBUG_FS is not enabled
modpost: fix section mismatch message for R_ARM_ABS32
modpost: fix section mismatch message for R_ARM_{PC24,CALL,JUMP24}
ARCv2: entry: comments about hardware auto-save on taken interrupts
ARCv2: entry: push out the Z flag unclobber from common EXCEPTION_PROLOGUE
ARCv2: entry: avoid a branch
ARCv2: entry: rewrite to enable use of double load/stores LDD/STD
ARC: define ASM_NL and __ALIGN(_STR) outside #ifdef __ASSEMBLY__ guard
USB: serial: option: add LARA-R6 01B PIDs
block: change all __u32 annotations to __be32 in affs_hardblocks.h
w1: fix loop in w1_fini()
sh: j2: Use ioremap() to translate device tree address into kernel memory
media: usb: Check az6007_read() return value
media: videodev2.h: Fix struct v4l2_input tuner index comment
media: usb: siano: Fix warning due to null work_func_t function pointer
extcon: Fix kernel doc of property fields to avoid warnings
extcon: Fix kernel doc of property capability fields to avoid warnings
usb: phy: phy-tahvo: fix memory leak in tahvo_usb_probe()
mfd: rt5033: Drop rt5033-battery sub-device
KVM: s390: fix KVM_S390_GET_CMMA_BITS for GFNs in memslot holes
mfd: intel-lpss: Add missing check for platform_get_resource
mfd: stmpe: Only disable the regulators if they are enabled
rtc: st-lpc: Release some resources in st_rtc_probe() in case of error
sctp: fix potential deadlock on &net->sctp.addr_wq_lock
Add MODULE_FIRMWARE() for FIRMWARE_TG357766.
spi: bcm-qspi: return error if neither hif_mspi nor mspi is available
mailbox: ti-msgmgr: Fill non-message tx data fields with 0x0
f2fs: fix error path handling in truncate_dnode()
powerpc: allow PPC_EARLY_DEBUG_CPM only when SERIAL_CPM=y
net: bridge: keep ports without IFF_UNICAST_FLT in BR_PROMISC mode
tcp: annotate data races in __tcp_oow_rate_limited()
net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX
sh: dma: Fix DMA channel offset calculation
i2c: xiic: Defer xiic_wakeup() and __xiic_start_xfer() in xiic_process()
i2c: xiic: Don't try to handle more interrupt events after error
ALSA: jack: Fix mutex call in snd_jack_report()
NFSD: add encoding of op_recall flag for write delegation
mmc: core: disable TRIM on Kingston EMMC04G-M627
mmc: core: disable TRIM on Micron MTFC4GACAJCN-1M
bcache: Remove unnecessary NULL point check in node allocations
integrity: Fix possible multiple allocation in integrity_inode_get()
jffs2: reduce stack usage in jffs2_build_xattr_subsystem()
btrfs: fix race when deleting quota root from the dirty cow roots list
ARM: orion5x: fix d2net gpio initialization
spi: spi-fsl-spi: remove always-true conditional in fsl_spi_do_one_msg
spi: spi-fsl-spi: relax message sanity checking a little
spi: spi-fsl-spi: allow changing bits_per_word while CS is still active
netfilter: nf_tables: fix nat hook table deletion
netfilter: nf_tables: add rescheduling points during loop detection walks
netfilter: nftables: add helper function to set the base sequence number
netfilter: add helper function to set up the nfnetlink header and use it
netfilter: nf_tables: use net_generic infra for transaction data
netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE
netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain
netfilter: nf_tables: reject unbound anonymous set before commit phase
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
netfilter: nf_tables: fix scheduling-while-atomic splat
netfilter: conntrack: Avoid nf_ct_helper_hash uses after free
netfilter: nf_tables: prevent OOB access in nft_byteorder_eval
net: lan743x: Don't sleep in atomic context
workqueue: clean up WORK_* constant types, clarify masking
net: mvneta: fix txq_map in case of txq_number==1
vrf: Increment Icmp6InMsgs on the original netdev
icmp6: Fix null-ptr-deref of ip6_null_entry->rt6i_idev in icmp6_dev().
udp6: fix udp6_ehashfn() typo
ntb: idt: Fix error handling in idt_pci_driver_init()
NTB: amd: Fix error handling in amd_ntb_pci_driver_init()
ntb: intel: Fix error handling in intel_ntb_pci_driver_init()
NTB: ntb_transport: fix possible memory leak while device_register() fails
NTB: ntb_tool: Add check for devm_kcalloc
ipv6/addrconf: fix a potential refcount underflow for idev
wifi: airo: avoid uninitialized warning in airo_get_rate()
net/sched: make psched_mtu() RTNL-less safe
pinctrl: amd: Fix mistake in handling clearing pins at startup
pinctrl: amd: Detect internal GPIO0 debounce handling
pinctrl: amd: Only use special debounce behavior for GPIO 0
tpm: tpm_vtpm_proxy: fix a race condition in /dev/vtpmx creation
net: bcmgenet: Ensure MDIO unregistration has clocks enabled
SUNRPC: Fix UAF in svc_tcp_listen_data_ready()
perf intel-pt: Fix CYC timestamps after standalone CBR
ext4: fix wrong unit use in ext4_mb_clear_bb
ext4: only update i_reserved_data_blocks on successful block allocation
jfs: jfs_dmap: Validate db_l2nbperpage while mounting
PCI/PM: Avoid putting EloPOS E2/S2/H2 PCIe Ports in D3cold
PCI: Add function 1 DMA alias quirk for Marvell 88SE9235
PCI: qcom: Disable write access to read only registers for IP v2.3.3
PCI: rockchip: Assert PCI Configuration Enable bit after probe
PCI: rockchip: Write PCI Device ID to correct register
PCI: rockchip: Add poll and timeout to wait for PHY PLLs to be locked
PCI: rockchip: Fix legacy IRQ generation for RK3399 PCIe endpoint core
PCI: rockchip: Use u32 variable to access 32-bit registers
misc: pci_endpoint_test: Free IRQs before removing the device
misc: pci_endpoint_test: Re-init completion for every test
md/raid0: add discard support for the 'original' layout
fs: dlm: return positive pid value for F_GETLK
serial: atmel: don't enable IRQs prematurely
hwrng: imx-rngc - fix the timeout for init and self check
ceph: don't let check_caps skip sending responses for revoke msgs
meson saradc: fix clock divider mask length
Revert "8250: add support for ASIX devices with a FIFO bug"
tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() in case of error
tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() when iterating clk
ring-buffer: Fix deadloop issue on reading trace_pipe
xtensa: ISS: fix call to split_if_spec
scsi: qla2xxx: Wait for io return on terminate rport
scsi: qla2xxx: Fix potential NULL pointer dereference
scsi: qla2xxx: Check valid rport returned by fc_bsg_to_rport()
scsi: qla2xxx: Pointer may be dereferenced
drm/atomic: Fix potential use-after-free in nonblocking commits
tracing/histograms: Add histograms to hist_vars if they have referenced variables
perf probe: Add test for regression introduced by switch to die_get_decl_file()
fuse: revalidate: don't invalidate if interrupted
can: bcm: Fix UAF in bcm_proc_show()
ext4: correct inline offset when handling xattrs in inode body
debugobjects: Recheck debug_objects_enabled before reporting
nbd: Add the maximum limit of allocated index in nbd_dev_add
md: fix data corruption for raid456 when reshape restart while grow up
md/raid10: prevent soft lockup while flush writes
posix-timers: Ensure timer ID search-loop limit is valid
sched/fair: Don't balance task to its current running CPU
bpf: Address KCSAN report on bpf_lru_list
wifi: wext-core: Fix -Wstringop-overflow warning in ioctl_standard_iw_point()
wifi: iwlwifi: mvm: avoid baid size integer overflow
igb: Fix igb_down hung on surprise removal
spi: bcm63xx: fix max prepend length
fbdev: imxfb: warn about invalid left/right margin
pinctrl: amd: Use amd_pinconf_set() for all config options
net: ethernet: ti: cpsw_ale: Fix cpsw_ale_get_field()/cpsw_ale_set_field()
net:ipv6: check return value of pskb_trim()
Revert "tcp: avoid the lookup process failing to get sk in ehash table"
fbdev: au1200fb: Fix missing IRQ check in au1200fb_drv_probe
llc: Don't drop packet from non-root netns.
netfilter: nf_tables: fix spurious set element insertion failure
netfilter: nf_tables: can't schedule in nft_chain_validate
net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX
tcp: annotate data-races around tp->linger2
tcp: annotate data-races around rskq_defer_accept
tcp: annotate data-races around tp->notsent_lowat
tcp: annotate data-races around fastopenq.max_qlen
tracing/histograms: Return an error if we fail to add histogram to hist_vars list
gpio: tps68470: Make tps68470_gpio_output() always set the initial value
bcache: use MAX_CACHES_PER_SET instead of magic number 8 in __bch_bucket_alloc_set
bcache: remove 'int n' from parameter list of bch_bucket_alloc_set()
bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent
btrfs: fix extent buffer leak after tree mod log failure at split_node()
ext4: rename journal_dev to s_journal_dev inside ext4_sb_info
ext4: Fix reusing stale buffer heads from last failed mounting
PCI: Rework pcie_retrain_link() wait loop
PCI/ASPM: Return 0 or -ETIMEDOUT from pcie_retrain_link()
PCI/ASPM: Factor out pcie_wait_for_retrain()
PCI/ASPM: Avoid link retraining race
dlm: cleanup plock_op vs plock_xop
dlm: rearrange async condition return
fs: dlm: interrupt posix locks only when process is killed
ftrace: Add information on number of page groups allocated
ftrace: Check if pages were allocated before calling free_pages()
ftrace: Store the order of pages allocated in ftrace_page
ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
scsi: qla2xxx: Fix inconsistent format argument type in qla_os.c
scsi: qla2xxx: Array index may go out of bound
ext4: fix to check return value of freeze_bdev() in ext4_shutdown()
i40e: Fix an NULL vs IS_ERR() bug for debugfs_create_dir()
phy: hisilicon: Fix an out of bounds check in hisi_inno_phy_probe()
ethernet: atheros: fix return value check in atl1e_tso_csum()
ipv6 addrconf: fix bug where deleting a mngtmpaddr can create a new temporary address
tcp: Reduce chance of collisions in inet6_hashfn().
bonding: reset bond's flags when down link is P2P device
team: reset team's flags when down link is P2P device
platform/x86: msi-laptop: Fix rfkill out-of-sync on MSI Wind U100
net/sched: mqprio: refactor nlattr parsing to a separate function
net/sched: mqprio: add extack to mqprio_parse_nlattr()
net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64
benet: fix return value check in be_lancer_xmit_workarounds()
RDMA/mlx4: Make check for invalid flags stricter
drm/msm: Fix IS_ERR_OR_NULL() vs NULL check in a5xx_submit_in_rb()
ASoC: fsl_spdif: Silence output on stop
block: Fix a source code comment in include/uapi/linux/blkzoned.h
dm raid: fix missing reconfig_mutex unlock in raid_ctr() error paths
ata: pata_ns87415: mark ns87560_tf_read static
ring-buffer: Fix wrong stat of cpu_buffer->read
tracing: Fix warning in trace_buffered_event_disable()
USB: serial: option: support Quectel EM060K_128
USB: serial: option: add Quectel EC200A module support
USB: serial: simple: add Kaufmann RKS+CAN VCP
USB: serial: simple: sort driver entries
can: gs_usb: gs_can_close(): add missing set of CAN state to CAN_STATE_STOPPED
Revert "usb: dwc3: core: Enable AutoRetry feature in the controller"
usb: dwc3: pci: skip BYT GPIO lookup table for hardwired phy
usb: dwc3: don't reset device side if dwc3 was configured as host-only
usb: ohci-at91: Fix the unhandle interrupt when resume
USB: quirks: add quirk for Focusrite Scarlett
usb: xhci-mtk: set the dma max_seg_size
Documentation: security-bugs.rst: update preferences when dealing with the linux-distros group
Documentation: security-bugs.rst: clarify CVE handling
staging: ks7010: potential buffer overflow in ks_wlan_set_encode_ext()
hwmon: (nct7802) Fix for temp6 (PECI1) processed even if PECI1 disabled
btrfs: check for commit error at btrfs_attach_transaction_barrier()
tpm_tis: Explicitly check for error code
irq-bcm6345-l1: Do not assume a fixed block to cpu mapping
serial: 8250_dw: split Synopsys DesignWare 8250 common functions
serial: 8250_dw: Preserve original value of DLF register
virtio-net: fix race between set queues and probe
s390/dasd: fix hanging device after quiesce/resume
ASoC: wm8904: Fill the cache for WM8904_ADC_TEST_0 register
dm cache policy smq: ensure IO doesn't prevent cleaner policy progress
drm/client: Fix memory leak in drm_client_target_cloned
net/sched: cls_fw: Fix improper refcount update leads to use-after-free
net/sched: sch_qfq: account for stab overhead in qfq_enqueue
ASoC: cs42l51: fix driver to properly autoload with automatic module loading
net/sched: cls_u32: Fix reference counter leak leading to overflow
perf: Fix function pointer case
loop: Select I/O scheduler 'none' from inside add_disk()
word-at-a-time: use the same return type for has_zero regardless of endianness
KVM: s390: fix sthyi error handling
net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer()
perf test uprobe_from_different_cu: Skip if there is no gcc
net: sched: cls_u32: Fix match key mis-addressing
net: add missing data-race annotations around sk->sk_peek_off
net: add missing data-race annotation for sk_ll_usec
net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
ip6mr: Fix skb_under_panic in ip6mr_cache_report()
tcp_metrics: fix addr_same() helper
tcp_metrics: annotate data-races around tm->tcpm_stamp
tcp_metrics: annotate data-races around tm->tcpm_lock
tcp_metrics: annotate data-races around tm->tcpm_vals[]
tcp_metrics: annotate data-races around tm->tcpm_net
tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
scsi: zfcp: Defer fc_rport blocking until after ADISC response
libceph: fix potential hang in ceph_osdc_notify()
USB: zaurus: Add ID for A-300/B-500/C-700
fs/sysv: Null check to prevent null-ptr-deref bug
Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb
net: usbnet: Fix WARNING in usbnet_start_xmit/usb_submit_urb
ext2: Drop fragment support
test_firmware: fix a memory leak with reqs buffer
test_firmware: return ENOMEM instead of ENOSPC on failed memory allocation
mtd: rawnand: omap_elm: Fix incorrect type in assignment
powerpc/mm/altmap: Fix altmap boundary check
PM / wakeirq: support enabling wake-up irq after runtime_suspend called
PM: sleep: wakeirq: fix wake irq arming
ARM: dts: imx6sll: Make ssi node name same as other platforms
ARM: dts: imx: add usb alias
ARM: dts: imx6sll: fixup of operating points
ARM: dts: nxp/imx6sll: fix wrong property name in usbphy node
drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions
arm64: dts: stratix10: fix incorrect I2C property for SCL signal
drm/edid: fix objtool warning in drm_cvt_modes()
Linux 4.19.291
Change-Id: I4f78e25efd18415989ecf5e227a17e05b0d6386c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 3f649ab728cda8038259d8f14492fe400fbab911 upstream.
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b3b33d3c43bbe0177d70653f4e889c78cc37f097 upstream.
Variable populated, which is a member of struct pcpu_chunk, is used as a
unit of size of unsigned long.
However, size of populated is miscounted. So, I fix this minor part.
Fixes: 8ab16c43ea ("percpu: change the number of pages marked in the first_chunk pop bitmap")
Cc: <stable@vger.kernel.org> # 4.14+
Signed-off-by: Sunghyun Jin <mcsmonk@gmail.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
These symbols are used by these modules. Export them so that these
modules can be loaded.
Bug: 149258398
Test: compile
Change-Id: Ie29c2811c07944b5a01dbc47b7a60dd56fe2aa5b
Signed-off-by: Saravana Kannan <saravanak@google.com>
(cherry picked from commit 9d5b6d29f017e1ae5df2dac53694382b24393f5f)
Signed-off-by: Will McVicker <willmcvicker@google.com>
[ Upstream commit 8c43004af01635cc9fbb11031d070e5e0d327ef2 ]
pcpu_find_block_fit() guarantees that a fit is found within
PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as
it compares against the block's contig_hint. This can lead to
incorrectly scanning past the end of the bitmap. The behavior was okay
given the check after for bit_off >= end and the correctness of the
hints from pcpu_find_block_fit().
This patch fixes this by bounding the end offset by the number of bits
in a chunk.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 198790d9a3aeaef5792d33a560020861126edc22 ]
In free_percpu() we sometimes call pcpu_schedule_balance_work() to
queue a work item (which does a wakeup) while holding pcpu_lock.
This creates an unnecessary lock dependency between pcpu_lock and
the scheduler's pi_lock. There are other places where we call
pcpu_schedule_balance_work() without hold pcpu_lock, and this case
doesn't need to be different.
Moving the call outside the lock prevents the following lockdep splat
when running tools/testing/selftests/bpf/{test_maps,test_progs} in
sequence with lockdep enabled:
======================================================
WARNING: possible circular locking dependency detected
5.1.0-dbg-DEV #1 Not tainted
------------------------------------------------------
kworker/23:255/18872 is trying to acquire lock:
000000000bc79290 (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0xb2/0x520
but task is already holding lock:
00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (pcpu_lock){..-.}:
lock_acquire+0x9e/0x180
_raw_spin_lock_irqsave+0x3a/0x50
pcpu_alloc+0xfa/0x780
__alloc_percpu_gfp+0x12/0x20
alloc_htab_elem+0x184/0x2b0
__htab_percpu_map_update_elem+0x252/0x290
bpf_percpu_hash_update+0x7c/0x130
__do_sys_bpf+0x1912/0x1be0
__x64_sys_bpf+0x1a/0x20
do_syscall_64+0x59/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #3 (&htab->buckets[i].lock){....}:
lock_acquire+0x9e/0x180
_raw_spin_lock_irqsave+0x3a/0x50
htab_map_update_elem+0x1af/0x3a0
-> #2 (&rq->lock){-.-.}:
lock_acquire+0x9e/0x180
_raw_spin_lock+0x2f/0x40
task_fork_fair+0x37/0x160
sched_fork+0x211/0x310
copy_process.part.43+0x7b1/0x2160
_do_fork+0xda/0x6b0
kernel_thread+0x29/0x30
rest_init+0x22/0x260
arch_call_rest_init+0xe/0x10
start_kernel+0x4fd/0x520
x86_64_start_reservations+0x24/0x26
x86_64_start_kernel+0x6f/0x72
secondary_startup_64+0xa4/0xb0
-> #1 (&p->pi_lock){-.-.}:
lock_acquire+0x9e/0x180
_raw_spin_lock_irqsave+0x3a/0x50
try_to_wake_up+0x41/0x600
wake_up_process+0x15/0x20
create_worker+0x16b/0x1e0
workqueue_init+0x279/0x2ee
kernel_init_freeable+0xf7/0x288
kernel_init+0xf/0x180
ret_from_fork+0x24/0x30
-> #0 (&(&pool->lock)->rlock){-.-.}:
__lock_acquire+0x101f/0x12a0
lock_acquire+0x9e/0x180
_raw_spin_lock+0x2f/0x40
__queue_work+0xb2/0x520
queue_work_on+0x38/0x80
free_percpu+0x221/0x260
pcpu_freelist_destroy+0x11/0x20
stack_map_free+0x2a/0x40
bpf_map_free_deferred+0x3c/0x50
process_one_work+0x1f7/0x580
worker_thread+0x54/0x410
kthread+0x10f/0x150
ret_from_fork+0x24/0x30
other info that might help us debug this:
Chain exists of:
&(&pool->lock)->rlock --> &htab->buckets[i].lock --> pcpu_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(pcpu_lock);
lock(&htab->buckets[i].lock);
lock(pcpu_lock);
lock(&(&pool->lock)->rlock);
*** DEADLOCK ***
3 locks held by kworker/23:255/18872:
#0: 00000000b36a6e16 ((wq_completion)events){+.+.},
at: process_one_work+0x17a/0x580
#1: 00000000dfd966f0 ((work_completion)(&map->work)){+.+.},
at: process_one_work+0x17a/0x580
#2: 00000000e3e7a6aa (pcpu_lock){..-.},
at: free_percpu+0x36/0x260
stack backtrace:
CPU: 23 PID: 18872 Comm: kworker/23:255 Not tainted 5.1.0-dbg-DEV #1
Hardware name: ...
Workqueue: events bpf_map_free_deferred
Call Trace:
dump_stack+0x67/0x95
print_circular_bug.isra.38+0x1c6/0x220
check_prev_add.constprop.50+0x9f6/0xd20
__lock_acquire+0x101f/0x12a0
lock_acquire+0x9e/0x180
_raw_spin_lock+0x2f/0x40
__queue_work+0xb2/0x520
queue_work_on+0x38/0x80
free_percpu+0x221/0x260
pcpu_freelist_destroy+0x11/0x20
stack_map_free+0x2a/0x40
bpf_map_free_deferred+0x3c/0x50
process_one_work+0x1f7/0x580
worker_thread+0x54/0x410
kthread+0x10f/0x150
ret_from_fork+0x24/0x30
Signed-off-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 00206a69ee32f03e6f40837684dcbe475ea02266 upstream.
Since commit ad67b74d24 ("printk: hash addresses printed with %p"),
at boot "____ptrval____" is printed instead of actual addresses:
percpu: Embedded 38 pages/cpu @(____ptrval____) s124376 r0 d31272 u524288
Instead of changing the print to "%px", and leaking kernel addresses,
just remove the print completely, cfr. e.g. commit 071929dbdd
("arm64: Stop printing the virtual memory layout").
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The commit ca460b3c96 ("percpu: introduce bitmap metadata blocks")
introduced bitmap metadata blocks. These metadata blocks are allocated
whenever a new chunk is created, but they are never freed. Fix it.
Fixes: ca460b3c96 ("percpu: introduce bitmap metadata blocks")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Currently, percpu memory only exposes allocation and utilization
information via debugfs. This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters. This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory. Let's make it easier for someone to
identify how much memory is being used.
This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use. This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level. Metadata is excluded. I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata. It also makes this calculation light.
Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the entire architecture code for blackfin, cris, frv, m32r,
metag, mn10300, score, and tile, including the associated device drivers.
I have been working with the (former) maintainers for each one to ensure
that my interpretation was right and the code is definitely unused in
mainline kernels. Many had fond memories of working on the respective
ports to start with and getting them included in upstream, but also saw
no point in keeping the port alive without any users.
In the end, it seems that while the eight architectures are extremely
different, they all suffered the same fate: There was one company
in charge of an SoC line, a CPU microarchitecture and a software
ecosystem, which was more costly than licensing newer off-the-shelf
CPU cores from a third party (typically ARM, MIPS, or RISC-V). It seems
that all the SoC product lines are still around, but have not used the
custom CPU architectures for several years at this point. In contrast,
CPU instruction sets that remain popular and have actively maintained
kernel ports tend to all be used across multiple licensees.
The removal came out of a discussion that is now documented at
https://lwn.net/Articles/748074/. Unlike the original plans, I'm not
marking any ports as deprecated but remove them all at once after I made
sure that they are all unused. Some architectures (notably tile, mn10300,
and blackfin) are still being shipped in products with old kernels,
but those products will never be updated to newer kernel releases.
After this series, we still have a few architectures without mainline
gcc support:
- unicore32 and hexagon both have very outdated gcc releases, but the
maintainers promised to work on providing something newer. At least
in case of hexagon, this will only be llvm, not gcc.
- openrisc, risc-v and nds32 are still in the process of finishing their
support or getting it added to mainline gcc in the first place.
They all have patched gcc-7.3 ports that work to some degree, but
complete upstream support won't happen before gcc-8.1. Csky posted
their first kernel patch set last week, their situation will be similar.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJawdL2AAoJEGCrR//JCVInuH0P/RJAZh1nTD+TR34ZhJq2TBoo
PgygwDU7Z2+tQVU+EZ453Gywz9/NMRFk1RWAZqrLix4ZtyIMvC6A1qfT2yH1Y7Fb
Qh6tccQeLe4ezq5u4S/46R/fQXu3Txr92yVwzJJUuPyU0arF9rv5MmI8e6p7L1en
yb74kSEaCe+/eMlsEj1Cc1dgthDNXGKIURHkRsILoweysCpesjiTg4qDcL+yTibV
FP2wjVbniKESMKS6qL71tiT5sexvLsLwMNcGiHPj94qCIQuI7DLhLdBVsL5Su6gI
sbtgv0dsq4auRYAbQdMaH1hFvu6WptsuttIbOMnz2Yegi2z28H8uVXkbk2WVLbqG
ZESUwutGh8MzOL2RJ4jyyQq5sfo++CRGlfKjr6ImZRv03dv0pe/W85062cK5cKNs
cgDDJjGRorOXW7dyU6jG2gRqODOQBObIv3w5efdq5OgzOWlbI4EC+Y5u1Z0JF/76
pSwtGXA6YhwC+9LLAlnVTHG+yOwuLmAICgoKcTbzTVDKA2YQZG/cYuQfI5S1wD8e
X6urPx3Md2GCwLXQ9mzKBzKZUpu/Tuhx0NvwF4qVxy6x1PELjn68zuP7abDHr46r
57/09ooVN+iXXnEGMtQVS/OPvYHSa2NgTSZz6Y86lCRbZmUOOlK31RDNlMvYNA+s
3iIVHovno/JuJnTOE8LY
=fQ8z
-----END PGP SIGNATURE-----
Merge tag 'arch-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pul removal of obsolete architecture ports from Arnd Bergmann:
"This removes the entire architecture code for blackfin, cris, frv,
m32r, metag, mn10300, score, and tile, including the associated device
drivers.
I have been working with the (former) maintainers for each one to
ensure that my interpretation was right and the code is definitely
unused in mainline kernels. Many had fond memories of working on the
respective ports to start with and getting them included in upstream,
but also saw no point in keeping the port alive without any users.
In the end, it seems that while the eight architectures are extremely
different, they all suffered the same fate: There was one company in
charge of an SoC line, a CPU microarchitecture and a software
ecosystem, which was more costly than licensing newer off-the-shelf
CPU cores from a third party (typically ARM, MIPS, or RISC-V). It
seems that all the SoC product lines are still around, but have not
used the custom CPU architectures for several years at this point. In
contrast, CPU instruction sets that remain popular and have actively
maintained kernel ports tend to all be used across multiple licensees.
[ See the new nds32 port merged in the previous commit for the next
generation of "one company in charge of an SoC line, a CPU
microarchitecture and a software ecosystem" - Linus ]
The removal came out of a discussion that is now documented at
https://lwn.net/Articles/748074/. Unlike the original plans, I'm not
marking any ports as deprecated but remove them all at once after I
made sure that they are all unused. Some architectures (notably tile,
mn10300, and blackfin) are still being shipped in products with old
kernels, but those products will never be updated to newer kernel
releases.
After this series, we still have a few architectures without mainline
gcc support:
- unicore32 and hexagon both have very outdated gcc releases, but the
maintainers promised to work on providing something newer. At least
in case of hexagon, this will only be llvm, not gcc.
- openrisc, risc-v and nds32 are still in the process of finishing
their support or getting it added to mainline gcc in the first
place. They all have patched gcc-7.3 ports that work to some
degree, but complete upstream support won't happen before gcc-8.1.
Csky posted their first kernel patch set last week, their situation
will be similar
[ Palmer Dabbelt points out that RISC-V support is in mainline gcc
since gcc-7, although gcc-7.3.0 is the recommended minimum - Linus ]"
This really says it all:
2498 files changed, 95 insertions(+), 467668 deletions(-)
* tag 'arch-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (74 commits)
MAINTAINERS: UNICORE32: Change email account
staging: iio: remove iio-trig-bfin-timer driver
tty: hvc: remove tile driver
tty: remove bfin_jtag_comm and hvc_bfin_jtag drivers
serial: remove tile uart driver
serial: remove m32r_sio driver
serial: remove blackfin drivers
serial: remove cris/etrax uart drivers
usb: Remove Blackfin references in USB support
usb: isp1362: remove blackfin arch glue
usb: musb: remove blackfin port
usb: host: remove tilegx platform glue
pwm: remove pwm-bfin driver
i2c: remove bfin-twi driver
spi: remove blackfin related host drivers
watchdog: remove bfin_wdt driver
can: remove bfin_can driver
mmc: remove bfin_sdh driver
input: misc: remove blackfin rotary driver
input: keyboard: remove bf54x driver
...
A lot of Kconfig symbols have architecture specific dependencies.
In those cases that depend on architectures we have already removed,
they can be omitted.
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
In case of memory deficit and low percpu memory pages,
pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
time (as it makes memory allocations itself and waits
for memory reclaim). If tasks doing pcpu_alloc() are
choosen by OOM killer, they can't exit, because they
are waiting for the mutex.
The patch makes pcpu_alloc() to care about killing signal
and use mutex_lock_killable(), when it's allowed by GFP
flags. This guarantees, a task does not miss SIGKILL
from OOM killer.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
microblaze build broke due to missing declaration of the
cond_resched() invocation added recently. Let's include linux/sched.h
explicitly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
When a large BPF percpu map is destroyed, I have seen
pcpu_balance_workfn() holding cpu for hundreds of milliseconds.
On KASAN config and 112 hyperthreads, average time to destroy a chunk
is ~4 ms.
[ 2489.841376] destroy chunk 1 in 4148689 ns
...
[ 2490.093428] destroy chunk 32 in 4072718 ns
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The prior patch added support for passing gfp flags through to the
underlying allocators. This patch allows users to pass along gfp flags
(currently only __GFP_NORETRY and __GFP_NOWARN) to the underlying
allocators. This should allow users to decide if they are ok with
failing allocations recovering in a more graceful way.
Additionally, gfp passing was done as additional flags in the previous
patch. Instead, change this to caller passed semantics. GFP_KERNEL is
also removed as the default flag. It continues to be used for internally
caused underlying percpu allocations.
V2:
Removed gfp_percpu_mask in favor of doing it inline.
Removed GFP_KERNEL as a default flag for __alloc_percpu_gfp.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Percpu memory using the vmalloc area based chunk allocator lazily
populates chunks by first requesting the full virtual address space
required for the chunk and subsequently adding pages as allocations come
through. To ensure atomic allocations can succeed, a workqueue item is
used to maintain a minimum number of empty pages. In certain scenarios,
such as reported in [1], it is possible that physical memory becomes
quite scarce which can result in either a rather long time spent trying
to find free pages or worse, a kernel panic.
This patch adds support for __GFP_NORETRY and __GFP_NOWARN passing them
through to the underlying allocators. This should prevent any
unnecessary panics potentially caused by the workqueue item. The passing
of gfp around is as additional flags rather than a full set of flags.
The next patch will change these to caller passed semantics.
V2:
Added const modifier to gfp flags in the balance path.
Removed an extra whitespace.
[1] https://lkml.org/lkml/2018/2/12/551
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
At some point the function declaration parameters got out of sync with
the function definitions in percpu-vm.c and percpu-km.c. This patch
makes them match again.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 438a506180 ("percpu: don't forget to free the temporary struct
pcpu_alloc_info") uncovered a problem on the CRIS architecture where
the bootmem allocator is initialized with virtual addresses. Given it
has:
#define __va(x) ((void *)((unsigned long)(x) | 0x80000000))
then things just work out because the end result is the same whether you
give this a physical or a virtual address.
Untill you call memblock_free_early(__pa(address)) that is, because
values from __pa() don't match with the virtual addresses stuffed in the
bootmem allocator anymore.
Avoid freeing the temporary pcpu_alloc_info memory on that architecture
until they fix things up to let the kernel boot like it did before.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 438a506180 ("percpu: don't forget to free the temporary struct pcpu_alloc_info")
Pull percpu update from Tejun Heo:
"Another minor pull request. It only contains one commit which can
reclaim a bit of memory wasted during boot on UP"
* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: don't forget to free the temporary struct pcpu_alloc_info
Add an option for pcpu_alloc() to support __GFP_NOWARN flag.
Currently, we always throw a warning when size or alignment
is unsupported (and also dump stack on failed allocation
requests). The warning itself is harmless since we return
NULL anyway for any failed request, which callers are
required to handle anyway. However, it becomes harmful when
panic_on_warn is set.
The rationale for the WARN() in pcpu_alloc() is that it can
be tracked when larger than supported allocation requests are
made such that allocations limits can be tweaked if warranted.
This makes sense for in-kernel users, however, there are users
of pcpu allocator where allocation size is derived from user
space requests, e.g. when creating BPF maps. In these cases,
the requests should fail gracefully without throwing a splat.
The current work-around was to check allocation size against
the upper limit of PCPU_MIN_UNIT_SIZE from call-sites for
bailing out prior to a call to pcpu_alloc() in order to
avoid throwing the WARN(). This is bad in multiple ways since
PCPU_MIN_UNIT_SIZE is an implementation detail, and having
the checks on call-sites only complicates the code for no
good reason. Thus, lets fix it generically by supporting the
__GFP_NOWARN flag that users can then use with calling the
__alloc_percpu_gfp() helper instead.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlike the SMP case, the !SMP case does not free the memory for struct
pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a
chance of being reused by the page allocator later, align it to a page
boundary just like its size.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The iterator functions pcpu_next_md_free_region and
pcpu_next_fit_region use the block offset to determine if they have
checked the area in the prior iteration. However, this causes an issue
when the block offset is greater than subsequent block contig hints. If
within the iterator it moves to check subsequent blocks, it may fail in
the second predicate due to the block offset not being cleared. Thus,
this causes the allocator to skip over blocks leading to false failures
when allocating from the reserved chunk. While this happens in the
general case as well, it will only fail if it cannot allocate a new
chunk.
This patch resets the block offset to 0 to pass the second predicate
when checking subseqent blocks within the iterator function.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reported-and-tested-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The other patches contain a lot of information, so adding this
information in a separate patch. It adds my copyright and a brief
explanation of how the bitmap allocator works. There is a minor typo as
well in the prior explanation so that is fixed.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The simple, and expensive, way to find a free area is to iterate over
the entire bitmap until an area is found that fits the allocation size
and alignment. This patch makes use of an iterate that find an area to
check by using the block level contig hints. It will only return an area
that can fit the size and alignment request. If the request can fit
inside a block, it returns the first_free bit to start checking from to
see if it can be fulfilled prior to the contig hint. The pcpu_alloc_area
check has a bound of a block size added in case it is wrong.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The largest free region will either be a block level contig hint or an
aggregate over the left_free and right_free areas of blocks. This is a
much smaller set of free areas that need to be checked than a full
traverse.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The bitmap allocator must keep metadata consistent. The easiest way is
to scan after every allocation for each affected block and the entire
chunk. This is rather expensive.
The free path can take advantage of current contig hints to prevent
scanning within the start and end block. If a scan is needed, it can
be done by scanning backwards from the start and forwards from the end
to identify the entire free area this can be combined with. The blocks
can then be updated by some basic checks rather than complete block
scans.
A chunk scan happens when the freed area makes a page free, a block
free, or spans across blocks. This is necessary as the contig hint at
this point could span across blocks. The check uses the minimum of page
size and the block size to allow for variable sized blocks. There is a
tradeoff here with not updating after every free. It is possible a
contig hint in one block can be merged with the contig hint in the next
block. This means the contig hint can be off by up to a page. However,
if the chunk's contig hint is contained in one block, the contig hint
will be accurate.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Metadata is kept per block to keep track of where the contig hints are.
Scanning can be avoided when the contig hints are not broken. In that
case, left and right contigs have to be managed manually.
This patch changes the allocation path hint updating to only scan when
contig hints are broken.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch makes the contig hint starting offset optimization from the
previous patch as honest as it can be. For both chunk and block starting
offsets, make sure it keeps the starting offset with the best alignment.
The block skip optimization is added in a later patch when the
pcpu_find_block_fit iterator is swapped in.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch adds chunk->contig_bits_start to keep track of the contig
hint's offset and the check to skip the chunk if it does not fit. If
the chunk's contig hint starting offset cannot satisfy an allocation,
the allocator assumes there is enough memory pressure in this chunk to
either use a different chunk or create a new one. This accepts a less
tight packing for a smoother latency curve.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch adds first_bit to keep track of the first free bit in the
bitmap. This hint helps prevent scanning of fully allocated blocks.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch introduces the bitmap metadata blocks and adds the skeleton
of the code that will be used to maintain these blocks. Each chunk's
bitmap is made up of full metadata blocks. These blocks maintain basic
metadata to help prevent scanning unnecssarily to update hints. Full
scanning methods are used for the skeleton and will be replaced in the
coming patches. A number of helper functions are added as well to do
conversion of pages to blocks and manage offsets. Comments will be
updated as the final version of each function is added.
There exists a relationship between PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE,
the region size, and unit_size. Every chunk's region (including offsets)
is page aligned at the beginning to preserve alignment. The end is
aligned to LCM(PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE) to ensure that the end
can fit with the populated page map which is by page and every metadata
block is fully accounted for. The unit_size is already page aligned, but
must also be aligned with PCPU_BITMAP_BLOCK_SIZE to ensure full metadata
blocks.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The area map allocator only used a bitmap for the backing page state.
The new bitmap allocator will use bitmaps to manage the allocation
region in addition to this.
This patch generalizes the bitmap iterators so they can be reused with
the bitmap allocator.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch increases the minimum allocation size of percpu memory to
4-bytes. This change will help minimize the metadata overhead
associated with the bitmap allocator. The assumption is that most
allocations will be of objects or structs greater than 2 bytes with
integers or longs being used rather than shorts.
The first chunk regions are now aligned with the minimum allocation
size. The reserved region is expected to be set as a multiple of the
minimum allocation size. The static region is aligned up and the delta
is removed from the dynamic size. This works because the dynamic size is
increased to be page aligned. If the static size is not minimum
allocation size aligned, then there must be a gap that is added to the
dynamic size. The dynamic size will never be smaller than the set value.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
pcpu_nr_empty_pop_pages is used to ensure there are a handful of free
pages around to serve atomic allocations. A new field, nr_empty_pop_pages,
is added to the pcpu_chunk struct to keep track of the number of empty
pages. This field is needed as the number of empty populated pages is
globally tracked and deltas are used to update in the bitmap allocator.
Pages that contain a hidden area are not considered to be empty. This
new field is exposed in percpu_stats.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The populated bitmap represents the state of the pages the chunk serves.
Prior, the bitmap was marked completely used as the first chunk was
allocated and immutable. This is misleading because the first chunk may
not be completely filled. Additionally, with moving the base_addr up in
the previous patch, the population check no longer corresponds to what
was being checked.
This patch modifies the population map to be only the number of pages
the region serves and to make what it was checking correspond correctly
again. The change is to remove any misunderstanding between the size of
the populated bitmap and the actual size of it. The work function page
iterators now use nr_pages for the check rather than pcpu_unit_pages
because nr_populated is now chunk specific. Without this, the work
function would try to populate the remainder of these chunks despite it
not serving any more than nr_pages when nr_pages is set less than
pcpu_unit_pages.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The percpu address checks for the reserved and dynamic region chunks are
now specific to each region. The address checking logic can be combined
taking advantage of the global references to the dynamic and static
region chunks.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Originally, the first chunk was served by one or two chunks, each
given a region they are responsible for. Despite this, the arithmetic
was based off of the true base_addr of the chunk making it be overly
inclusive.
This patch moves the base_addr of chunks that are responsible for the
first chunk. The base_addr must remain page aligned to keep the
address alignment correct, so it is the beginning of the region served
page aligned down. start_offset holds where the region served begins
from this new base_addr.
The corresponding percpu address checks are modified to be more specific
as a result. The first chunk considers only the dynamic region and both
first chunk and reserved chunk checks ignore the static region. The
static region addresses should never be passed into the allocator. There
is no impact here besides distinguishing the first chunk and making the
checks specific.
The percpu pointer to physical address is left intact as addresses are
not given out in the non-allocated portion of percpu memory.
nr_pages is added to pcpu_chunk to keep track of the size of the entire
region served containing both start_offset and end_offset. This variable
will be used to manage the bitmap allocator.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There is no need to have the static chunk and dynamic chunk be named
separately as the allocations are sequential. This preemptively solves
the misnomer problem with the base_addrs being moved up in the following
patch. It also removes a ternary operation deciding the first chunk.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The area map allocator manages the first chunk area by hiding all but
the region it is responsible for serving in the area map. To align this
with the populated page bitmap, end_offset is introduced to keep track
of the delta to end page aligned. The area map is appended with the
page aligned end when necessary to be in line with how the bitmap
allocator requires the ending to be aligned with the LCM of PAGE_SIZE
and the size of each bitmap block. percpu_stats is updated to ignore
this region when present.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Create a common allocator for first chunk initialization,
pcpu_alloc_first_chunk. Comments for this function will be added in a
later patch once the bitmap allocator is added.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There is logic for setting variables in the static chunk init code that
could be consolidated with the dynamic chunk init code. This combines
this logic to setup for combining the allocation paths. reserved_size is
used as the conditional as a dynamic region will always exist.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Prior this variable was used to manage statistics when the first chunk
had a reserved region. The previous patch introduced start_offset to
keep track of the offset by value rather than boolean. Therefore,
has_reserved can be removed.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The reserved chunk arithmetic uses a global variable
pcpu_reserved_chunk_limit that is set in the first chunk init code to
hide a portion of the area map. The bitmap allocator to come will
eventually move the base_addr up and require both the reserved chunk
and static chunk to maintain this offset. pcpu_reserved_chunk_limit is
removed and start_offset is added.
The first chunk that is circulated and is pcpu_first_chunk serves the
dynamic region, the region following the reserved region. The reserved
chunk address check will temporarily use the first chunk to identify its
address range. A following patch will increase the base_addr and remove
this. If there is no reserved chunk, this will check the static region
and return false because those values should never be passed into the
allocator.
Lastly, when linking in the first chunk, make sure to count the right
free region for the number of empty populated pages.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The first chunk is handled as a special case as it is composed of the
static, reserved, and dynamic regions. The code handles each case
individually. The next several patches will merge these code paths and
lay the foundation for the bitmap allocator.
This patch modifies logic to enforce that a dynamic region exists and
changes the area map to account for that. This brings the logic closer
to the dynamic chunk's init logic.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The header comment for percpu memory is a little hard to parse and is
not super clear about how the first chunk is managed. This adds a
little more clarity to the situation.
There is also quite a bit of tricky logic in the pcpu_build_alloc_info.
This adds a restructure of a comment to add a little more information.
Unfortunately, you will still have to piece together a handful of other
comments too, but should help direct you to the meaningful comments.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Percpu memory holds a minimum threshold of pages that are populated
in order to serve atomic percpu memory requests. This change makes it
easier to verify that there are a minimum number of populated pages
lying around.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
From 4a42ecc735cff0015cc73c3d87edede631f4b885 Mon Sep 17 00:00:00 2001
From: Dennis Zhou <dennisz@fb.com>
Date: Wed, 21 Jun 2017 08:07:15 -0700
Add error message to out of space failure for atomic allocations in
percpu allocation path to fix -Wmaybe-uninitialized.
Signed-off-by: Dennis Zhou <dennisz@fb.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add support for tracepoints to the following events: chunk allocation,
chunk free, area allocation, area free, and area allocation failure.
This should let us replay percpu memory requests and evaluate
corresponding decisions.
Signed-off-by: Dennis Zhou <dennisz@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>