University of Kaiserslautern | Computer Science | Distributed Systems | Research | Squirrel
Next: Related Work | Up: Flexible Scheduling for Linux | Previous: syncopation

Results

Scheduler Overhead | Proportional-Share Scheduling | Bandwidth Reservation | Transfer of Allocated Capacities | Isolation from Best-Effort Load | Multimedia Usability | Conclusion


Having presented the design of syncopation in the previous chapter, the actual performance of the modified kernel is examined in this chapter. The presentation begins with an attempt to describe what penalty has to be expected when replacing the original scheduler to enable syncopation's features. Afterwards, the scheduler will be measured with respect to the distribution of processing time to process with allocated shares, in a fixed configuration as well as with dynamic modifications produced by a ticket transfer in a client/server scenario. Finally, the usability of the reservation mechanism will be tested with a common video player to see if and what modifications need to be done to such a player to successfully employ the mechanism.

Scheduler Overhead

When serving best-effort processes in syncopation, an increased overhead for the scheduler in comparison with the original kernel must be accepted. Two aspects of the design cause this additional overhead: Firstly, before entering the best-effort scheduler, the hierarchical stride scheduler is run. Secondly, syncopation calls the scheduling code at least every jiffy, while original LINUX may allow tasks to run for several jiffies. To get a notion of the costs of the extended features, two scenarios have been designed.

In the first scenario, a single process counts up a long integer in between two measurements of system time. The resulting difference should give an impression of the absolute penalty produced within the scheduler. Table 4.1 displays the results of ten measurements with both kernels. In this simple looping test, the overhead produced by syncopation is measured by an amount of less than two thousandth parts.


Table 4.1: Scheduling Overhead - Simple Loop
  LINUX syncopation
mean value 50.208944s 50.386934s
std.deviation 0.077974s 0.028844s
maximum 50.336736s 50.435984s
minimum 50.107979s 50.342680s


A practical test involved a kernel build measured by the tool `time'. We have performed ten consecutive measurement with each kernel. Prior to the measurement two additional build commands had run to make sure that results have equal preconditions with regard to file caching effects. Table 4.2 shows the results.

For each configuration, the test was performed ten times. The results show that for jobs such as kernel builds, the scheduler overhead is less important, as job run-time strongly varies because of other reasons. However, it is not part of this thesis to examine these reasons. Instead, this scenario shows that the measured overhead does not provide significant latencies.


Table 4.2: Scheduling Overhead - Kernel Build
  time LINUX syncopation
mean value real 5m39.615s 5m41.015s
  user 5s15.133s 5m14.909s
  sys 0m20.298s 0m21.178s
std.deviation real 1.871s 0.770s
  user 0.316s 0.306s
  sys 0.346s 0.308s
maximum real 5m42.926s 5m42.258s
  user 5m15.750s 5m15.390s
  sys 0m20.940s 0m21.860s
minimum real 5m36.097s 5m40.007s
  user 5m14.420s 5m14.280s
  sys 0m19.660s 0m20.590s


Proportional-Share Scheduling

The hierarchical stride-scheduler implemented in syncopation has been implemented to receive guaranteed minimum bandwidth when all tickets are allocated, but implements proportional-share CPU scheduling whenever tickets are left for further allocation. Figure 4.1 displays the progress of three tasks. The x-axis shows the progress of real-time. The accumulated processing time of a process is given by the y-axis. The three processes have allocated 40%, 20% and 10% of available tickets. 25% of tickets are allocated by the best-effort share. 5% of tickets are still available and proportionally distributed to the stride-scheduled processes and the best-effort share. As dynamic participation of the best-effort share in the hierarchical tree structure would result in noise in the diagram, a simple loop process has been kept running as a best-effort process to hinder the best-effort fraction from becoming idle.

Figure 4.1: Proportional-Share Scheduling
\includegraphics {messungen/proportional_share.eps}

During the first second, all three processes accumulate CPU time slightly over their reservations, which is due to the unreserved 5% bandwidth which is proportionally distributed to the three processes in the figure and to the best-effort share. The graphs of the processes display smooth progress throughout their whole execution. After the first processes has finished execution, its bandwidth is distributed to the two remaining processes and the best-effort share. As a consequence, the graphs of the two remaining processes get steeper. The progress of process two is less steep than that of the first. This results from the fact, that the best-effort share receives a majority of the freed bandwidth. The second process' share is $20/(25+20+10) \approx 36\%$. Analogously, the graph of the third process increases to a progress rate of $10/(25+10) \approx 28.6\%$ after the second process has finished.


Bandwidth Reservation

While the previous section has already shown that reservations by processes are met when a free share of bandwidth is left, the next scenario will reserve all available bandwidth and run clients that are able to consume all of it. Four processes A, B, C and D are observed, that have reservations of 40%, 20%, 10% and 5% of CPU bandwidth. All processes are simple looping processes that do not get blocked. Again, a similar process without reservation is running in the best-effort section to avoid temporal availability of the 25% of bandwidth from the best-effort share. In addition to this, KDE - a graphical user interface - as well as system processes have been running during the experiment. As a consequence, scheduling has not generally been only after timer interrupts, but may also have occurred after other processes have blocked within their time-slice.

Figure 4.2: Process Progress
\includegraphics {messungen/a2_5_10_20_40.eps} \includegraphics {messungen/a_5_10_20_40.eps}

The mechanism of hierarchical stride-scheduling as presented in Chapter 2 results in deterministic intervals between allocations of processes. A perfect scheduling for process D would allocate it every twentieth scheduling, process C would be perfectly served with allocations at every tenth scheduling. While single processes could perfectly be served, the combination of them produces conflicting goals, as different intervals between allocations may provide a period after which both processes would like to allocate. Consequently, processes within a group of many can not always be allocated at fixed intervals. The graphs of Figure 4.2 visualize these effects. Both types of graph present the accumulated processing time of a process resulting from passed system time. In the graph at the top, each line denotes the actual progress of computation time. For some periods of time, the lines are parallel to the x-axis. This denotes that a task is not allocated to the CPU. Whenever the curve advances in the y-axis, the process consumes processor cycles. In the graph below, each point denotes the accumulated processing time at the moment the process is dispatched from the CPU. Both graphs show that the lower the reservation of a process is, the lower is the variation in the intervals between allocations. Process A, which features the highest reservation, is supplied with the largest latency when periods of processes B, C and D produce scheduling points close to each other. At these points of relative system time, hierarchical stride scheduling does not produce successive allocations for process A, as can be seen in the graph at the top. Apart from the intervals between allocations, the graphs clearly show that all processes receive the bandwidth they have reserved.

Figure 4.3: CPU percentage
\includegraphics {messungen/b_5_10_20_40.eps}

The same scenario which has been employed for Figure 4.2 is also used for Figure 4.3. While the x-axis still presents the progress of system time, the y-axis now presents the share of computation time between two measured points in time. Like in the previous measurement, points display the moment of time when a process has been dispatched from the CPU. To reduce variation of percentages, not every single moment of dispatch time is employed, but a process needed to accumulate at least 30 ms of computation time. Only then, a measurement is taken. The results of this experiment show that the percentages of all processes are close to their reserved fraction of bandwidth.

Transfer of Allocated Capacities

The credit mechanism provided by syncopation allows the transfer of fractions of reserved CPU bandwidth to cooperating processes. To display the effectiveness of this transfer, a scenario to simulate a simple client/server relation between two processes has been designed. Process A serves as a client and starts with a reservation of 40% of CPU bandwidth, while process A represents a server with an initial reservation of 10%. After some time, process A transfers 30% of total reservation, so that utilized bandwidths between A and B are exchanged. After waiting for some more jiffies process A recollects its credit from process B so that the initial distribution is restored. In this scenario, both processes utilize more bandwidth than guaranteed. This is due to the fact that only 25% of bandwidth are allocated by the best-effort share. Another 25% are unreserved and thus allocated according to the proportional reservations of the best-effort share and processes A and B.

Figure 4.4 displays the progress of both processes as previously presented in the other scenarios. Tickets are transferred at an approximated relative system time of 0.35 seconds. After that, both processes have exchanged progress rates until at 0.73 seconds, tickets are recollected.

Figure 4.4: Transfer - Process Progress
\includegraphics {messungen/a_transfer.eps}

The percentages of CPU consumption for this scenario are displayed in Figure 4.5. The method employed for this measurement is the same as described in Section 4.3. The percentages show that the sequences of allocations of both processes automatically synchronize after some allocations to find almost identical patterns of allocation.

Figure 4.5: Transfer - Process Shares
\includegraphics {messungen/b_transfer.eps}

Isolation from Best-Effort Load

To complete the proof of concept that reservations are guaranteed by the kernel, an additional experiment proves that reservations are violated by an increase of best-effort processes. To do this, we have employed a reservation of 75% of tickets for one process and before that had started different amounts of non-blocking CPU-bound processes in the best-effort section. Figure 4.6 presents two curves, one of which displays the progress of the process with reserved shared when competing with one best-effort process. The second shows the line when competing with one-hundred processes. Although allocation sequences slightly differ, both processes match their reservations.

Figure 4.6: Isolation from Best-Effort Load
\includegraphics {messungen/be_load.eps}

Multimedia Usability

To test the usability of syncopation for multimedia applications, the final scenario examined whether a common MPEG-player could employ the modified kernel without modifying its code by nothing but calling the ticket reservation. The experiment uses mpeg_play - an open-source player - together with an MPEG-coded video excerpt from the famous movie `Wallace & Gromit - A Grand Day Out'. The video provided no audio information and was encoded at 30 frames per second with a total of 359 frames.

Within all of the following tests, two general simulation techniques were employed: In the best-effort system, concurrent load has been produced by simple-loop processes as those described before to simulate CPU-intensive processes. In the syncopation environment, one single looping process has always been running to avoid an idle best-effort fraction. At the same time, a similar process was employed which reserved the remaining share of CPU bandwidth.

Performed Frame-Rates

To get an overall impression of performance, measurements started with determining how many frames per second mpeg_play has been able to display in environments. Table 4.3 presents the results from the best-effort experiments. The third row describes the relation of displayed frames per second to the specified number of frames per second according to the format of the stream.


Table 4.3: Best-Effort Display Rates
other load frames/second performed/planned
0 27.9 93%
1 26.7 89%
2 19.1 63%
3 14.8 49%
4 11.8 39%


The first line shows that the player cannot perform the video-defined rate of 30% although it is the only process except for system processes. This seems to be caused by LINUX' timer resolution because of the following reasons: When entering the display function, the player compares the targeted time for the next frame to the current system time. If targeted time is at least one jiffy away, the player sleeps via select(). A comparison of the time stamp before the select call to the targeted display time of the frame showed that the player always had been in time before the select call. After the select call, the process sleeps until the next timer interrupt and thus up to a worst-case 10 ms.

When a single other process is executed, the video player does not lose much of its performance. Actually, a measurement with the `time'-command returned that the player consumes 50% of processor time when replaying the video. Consequently, the LINUX scheduler seems to perfectly distribute time to both competitors. When competing with more than one process, the frame rates drop significantly with a worst-case loss of 7.6 frames per second when the number of competing processes is increased to two processes.

In the end, the table shows that player performance sinks quickly with rising load. When four other processes are running the player is not only slowed down but jerks unacceptably.

A much smoother degradation of performance is given by Table 4.4, which shows the performance of the video player in the syncopation environment. As all shares are reserved by processes and concurrent processes do not block, the player receives only the share of CPU cycles it had reserved before.


Table 4.4: Reservation-Based Display Rates
reservation frames/second performed/planned
50% 27.8 93%
40% 26.8 89%
30% 21.4 71%
20% 15.6 52%


By specifying a CPU reservation of 50%, which is the CPU share observed by `time', almost exactly the same performance as observed in the `idle' best-effort system is measured. Even with only 30% of processing time, the number of frames displayed decreases to only 71%. In comparison, the best-effort environment already showed a remaining share of 63% of planned frames, when two other processes were running.

Latency Tests

As has been mentioned before, the player resigns from the CPU whenever at least one jiffy is left until the designated display time of a frame. If less than one jiffy is left, the player immediately displays the frame. It also continues, if display is behind the targeted time. Targeted display times are always computed from the display time of the last frame.

Figure 4.7: Best-Effort Latencies
\includegraphics {messungen/mpeg_latency/best_effort.eps}

Figure 4.8: Reservation Latencies
\includegraphics {messungen/mpeg_latency/reservations.eps}

Figures 4.7 and 4.8 show the difference between the target time to display a frame and the actual display time. Figure 4.7 presents the best-effort results. As expected, the player processes may be waiting for several jiffies until it is assigned to the CPU whenever it has blocked. To keep the graph understandable, graphs for a single concurrent process and for three concurrent processes are not included, as graphs largely overlap. The worst case latency for a single concurrent process is bound by 16 ms and is except for one case frame identical to the timing results in the idle system. As is shown in the figure, two other processes already produce a maximum latency of 66 ms for frame number 35. Four other processes already produce a recurrent delay of 116 ms.

In contrast, the reservation scenario in Figure 4.8 presents close bounds for every degree of reservation. Even with a reservation of only 20% the player reaches a worst-case delay of 60 ms. With a reservation of 30% the worst-case is already bounded to 26 ms, a reservation of 40% already displays a worst-case bound of 7 ms. The same bound is given to a 50% reservation.

Progress in Time

After examining the total display rates and the individual frame-display delays, the following figures are meant to compare the two environments according to the player's progress in time.

Figure 4.9 tries to compare the two scenarios. As a process in a fair environment should be expected to be allocated as much resources as any other process of the same priority, we have divided 100% available CPU bandwidth by the number of running processes in total. For example, if the player in the best-effort scenario competes with three others for the processor, we assume that it should receive a quarter of totally available bandwidth. As we further have to take into account that all system processes are also running in a best-effort environment, we compare it to a process with a reservation of only 20%. Although a reservation of 20% could still be expected to display slower progress than a process with three other concurrent processes in the best-effort environment, the reservation-based player performs better. The same motivation has been applied for the comparison of the best-effort scenario with two other competing processes to the process with a reserved 30% of bandwidth. The experiment again presents an advantage for the reservation-based approach.

Figure 4.9: Best-Effort vs. Reservations
\includegraphics {messungen/mpeg_progress/reservation_vs_besteffort.eps}

What is more, Figure 4.9 displays, that reservations for the video player are successful even to the smoothness of display intervals. The curves show the player in the `idle' best-effort environment and with a reservation of 50% in the syncopation environment. Both curves are close to one another.

Conclusion

The experiments presented in this chapter prove that the mechanisms for CPU bandwidth reservation, the proportional-share scheduling and the transfer of capacities have been successfully implemented. At the same time, the features of syncopation produce a low overhead which can hardly be measured at all when running applications. Furthermore, the multimedia scenario has shown that the code of an existing program can easily be altered to utilize the features of syncopation.


Next: Related Work | Up: Flexible Scheduling for Linux | Previous: syncopation
University of Kaiserslautern | Computer Science | Distributed Systems | Research | Squirrel
Rainer Koster, 2000-03-17