Symptoms1. After upgrading PVC from 4.0/4.6 to 4.7 LA is much higher than before upgrade.
2. PVC 4.7 shows enormous LA (1k+) while disk subsystem is almost idle.
Additional symptoms:1. Changing elevator for block device to deadline helps to reduce LA values.
~# echo deadline > /sys/block/sda/queue/scheduler
2. With CFQ, values in queued_avg is much greater than in in_driver_avg for long periods (in /sys/block/*/queue/iosched/ directories).
3. The most of containers' and node's processes in D-state have wchan (the fifth column in the command's output below) with sync_buffer, sync_page, get_request_wait or log_wait_commit:
~# vzps axww -eL -o veid,ppid,pid,tid,wchan:20,rsz,vsz,state,cmd | awk '$8~/[DR]/'
ResolutionAdd the commands to set slice_idle to 0 into a start script:
# for f in /sys/block/*/queue/iosched/slice_idle; do echo 0 > $f; done
CauseWith switching to the kernel 2.6.32, PVC uses native kernel's functionality to manage disk access fairness for separate processes and groups. Limits and balancing for containers is done with group-wise scheduling of CFQ IO scheduler.
Comparing with 2.6.18 kernels of PVC 4.0/4.6, the kernel 2.6.32 in PVC 4.7 has two-layer scheduling: in addition to per-process scheduling, there are control groups and related to group-based disk access management.
For hardware/software RAID storage arrays assembled with 6 or more hard drives (and multi-disk based iSCSI LUNs), it might be worth to disable per-task queue idling in CFQ.
The corresponding part of kernel documentation describes this in detail:
This specifies how long CFQ should idle for next request on certain cfq queues (for sequential workloads) and service trees (for random workloads) before
queue is expired and CFQ selects next queue to dispatch from.
By default slice_idle is a non-zero value. That means by default we idle on queues/service trees. This can be very helpful on highly seeky media like single spindle SATA/SAS disks where we can cut down on overall number of seeks and see improved throughput.
Setting slice_idle to 0 will remove all the idling on queues/service tree level and one should see an overall improved throughput on faster storage devices like multiple SATA/SAS disks in hardware RAID configuration. The down side is that isolation provided from WRITES also goes down and notion of IO priority becomes weaker.
So depending on storage and workload, it might be useful to set slice_idle=0. In general I think for SATA/SAS disks and software RAID of SATA/SAS disks keeping slice_idle enabled should be useful. For any configurations where there are multiple spindles behind single LUN (Host based hardware RAID controller or for storage arrays), setting slice_idle=0 might end up in better throughput and acceptable latencies.
FAQQ: Does this option affect I/O priorities and disk throughoutput limitations of Virtuozzo?
A: No, priorities and limitations continue to work since the scheduler is group-aware and containers are configured through groups. Group idling works with task idling disabled.
Q: Is it applicable to Virtuozzo 4.0/4.6?
A: No, disabling idling in 2.6.18 kernels of PVC 4.0/4.6 turns CFQ to deadline elevator.
Q: Why isn't this option set to 0 by default?
A: This option is not useful in case data is located on a rotational device with single spindle, i.e. on a single harddrive; setting slice_idle=0 decreases performance and increases load to drive mechanics. Only if reading operations can be reordered or executed by different units in parallel (like it is possible in RAID5/6 and RAID10 installations with 4-8 or more drives; for non-rotational media like SSD), it can be suggested and there might be performance improvement.