|
|
|
|
|
|
|
We have decided to implement syncopation in LINUX. A major reason for this has been the fact, that LINUX sources are freely available and current documentation on its concepts and internals [1,15] can be purchased. Furthermore, it is one of the most popular members of the UNIX family.
The two service classes SCHED_FIFO and SCHED_RR both provide fixed-priority real-time services. Processes of class SCHED_FIFO are scheduled in a first-in-first-out manner and thus are not associated with time slices. Execution of a process of this class is stopped only if either a process with a higher priority gets runnable or the process gets blocked. Processes of service class SCHED_RR run for only one jiffy and afterwards call the scheduler again, which enables a round-robin service for equally prioritized processes. Both policies are mapped into the same range of priorities, so that a process of SCHED_FIFO is inferior to a process of SCHED_RR associated with a higher priority and vice versa.
syncopation integrates processor capacity reservations for processes into a general-purpose operating system. Processor capacity reservation supplies a process with guarantees on the relative share of CPU cycles in a relatively fine temporal granularity. To achieve this, we have selected to employ hierarchical stride scheduling as presented in the previous chapter. The strong deterministic guarantees on the distribution of processing cycles over a period of time can provide a process with an even performance of a virtual processor with a scaled performance of its own. To support an application domain with additional services, the mechanisms and policies for these should be implemented in the user-defined scheduler. For example, an EDF policy realized within a user-defined scheduler can provide real-time guarantees to its threads.
The kernel-level scheduler is a hierarchical arrangement of a hierarchical stride scheduler together with a modification of LINUX best-effort scheduler as illustrated in Figure 3.1. The hierarchical stride scheduler partitions processor bandwidth by serving as a proportional-share scheduler to all of its clients. The set of clients of this scheduler consists of processes which have reserved shares of CPU bandwidth as well as of the best-effort scheduler. The best-effort scheduler serves all processes that have not reserved processor shares and is supplied with a user-defined minimum share of bandwidth to prevent starvation of best-effort processes.
Corresponding to the original suggestion of hierarchical stride scheduling, the scheduler is called at every timer interrupt. As has been described in the previous section, the original CPU scheduler allows processes to run for multiple jiffies until they are preempted. However, we have limited all time-slices for best-effort processes to only one jiffy which results in increased context-switches, if multiple processes are running. We plan to implement a propagation mechanism in a future release. Such propagation can be performed at scheduling time by successively determining possible consecutive allocations at every level of the tree and afterwards computing the minimum from the results of all levels.
To organize all running clients for the stride scheduler, we have employed a simple height balanced tree. All clients are positioned in the leaves, while inner nodes store the corresponding information on the subtree. Tree balance is striven for at insertion time. Every inner node keeps information on the balance of clients in the left and the right subtree. The insertion recursively chooses to insert into that subtree which has less nodes stored in it. At the time a leaf is reached, a new connecting node will be created in the former position of the leaf, that becomes the father of both the inserted and the existing leaf. When a client is deleted from the tree, its father which is an inner node will be removed and replaced by its brother. This tree organization produces a worst case of a depth of (k-1) steps down its branches with k nodes resident in the tree. As client dynamic in competition for CPU allocation is usually high and the number of running processes low, the tree can be expected to be fairly balanced most of the time.
syncopation administers a pool of 100000 tickets which represents the
total allocatable processor shares. Because of the proportional-share
mechanism realized by hierarchical stride scheduling, a process that allocates
an amount
of tickets may expect to allocate at least for one quantum every
jiffies. Violation of this bound may happen in the
order of
jiffies as presented in Section
2.2. The
best-effort share is protected against starvation by a minimum funding of
25000 tickets which are part of the pool of overall available
tickets. Consequently, the best-effort share as a whole is guaranteed a
minimal 25% of
CPU bandwidth. As utilization of the system may
require either to prefer best-effort or reservation-based processes, the
minimum funding for the best-effort share can be specified by the user at
compile time.
Processes cannot automatically inherit reservations, but need to explicitly allocate tickets. We have chosen explicit allocations instead of allocation inheritance, as we don't find absolute processor requirements of one process to be related to that of its father. The system-call stride(long tickets) enables a task to allocate or modify processing time by specifying the amount of tickets corresponding to the fraction of CPU time it requests. We defined a fixed number of 100000 tickets in total. This allows easy reservation of percentages with a finest grain of one part in 100000.
A task can specify negative ticket amounts in the system call stride() to free reserved capacities it does not consume. Measurement of actual capacity consumption can be performed by calling the monitoring mechanisms which are described later in this section. By specifying the negative value of its total amount of allocated tickets, a task can release its share of capacity and switch back into best-effort policy.
Whenever a remaining amount of tickets is not allocated by processes, the scheduler behaves like a common proportional-share scheduler: The unreserved bandwidth is distributed to all processes competing for the CPU corresponding to their proportional share of reservation. This also includes the group of best-effort processes.
Whenever a process leaves competition within the stride scheduler, the unused processor resource is proportionally distributed to all other competitors. Analogously, whenever the best-effort fraction of the system gets idle, its node is also removed from the stride tree. In both situations, the stride scheduler serves as a proportional-share scheduler for the clients remaining in competition for CPU allocation. As a consequence, the best-effort fraction as well as the tasks that allocated tickets possibly receive more processing time than guaranteed.
A major weakness of stride scheduling is based upon the fact, that any period
that a process is not runnable is not taken into account when recomputing its
pass after return: A process that leaves competition is associated with a
remain value, which indicates how much of its stride it has left to wait
for. However, the duration of the following pause for the process, which can
be computed from its pass value, is later removed from its properties, as the
task's pass is simply replaced by the global pass plus the process' remain
value. As a consequence, a task that reserves a fraction
of CPU cycles
and is blocked for a fraction
of its lifetime, will receive only a total
fraction of
of CPU cycles.
Obviously the client cannot rejoin competition with the pass value it had reached when leaving competition. A process that had been blocked for a relatively long period would otherwise block all other processes for many allocations. Still, a major improvement can be made, if the blocking period is taken into account without accumulating successive allocation rights. A client which rejoins competition thus may keep its pass if it is not exceeded by its blocking period:
if (client->pass < global_pass) /* task has lost part of its reservation, prefer it to others */ pass = global_pass -1; else /* keep pass, to take into account any blocking periods */ ;
Now a process that has already been out of competition for more than its stride demanded, will be immediately processed. On the other hand, a process that has already been waiting for part of its share will only need to wait for the remainder of its stride. The example in Figure 3.2 illustrates the modification. The allocation sequences in the illustration display two processes which have allocated 33% (grey) and 66% (white) of CPU bandwidth.
Sequence a) presents the allocation series according to original stride scheduling. After unblocking, the grey process is forced to additionally wait for the corresponding of its stride. Consequently, the process loses almost 66% of its processor share, although the process unblocks before the end of its stride.
Sequence b) displays the same situation with our modified algorithm. As the process has already been waiting for longer than its stride had demanded, the scheduler invocation caused by the change of state by the process lets it immediately allocate the CPU. Although the grey process still loses processor shares, the loss is limited the fraction that its blocking time exceeds its stride.
Applications often depend on the cooperation of different processes with each other, as for example in a client/server relation. The credit abstraction implemented in syncopation is an extension of the ticket transfer abstraction as described in Section 2.3. Like ticket transfers, credits enable the transfer of allocated capacities with the delegation of a job across the allocation boundaries of different processes. While a process loses its possession rights to transfered tickets in the original abstraction, credits are aware of their owner and may be recollected by the owner at any point of time after the transfer. The process which originally possesses the tickets represented by a credit is called its creditor. The process which uses tickets lent from another process is called a debtor. A debtor may at any time return a credit owned by another process.
To keep track of the credits and debts of individual processes, each process is associated with a linked list for each role it may serve in a creditor/debtor relationship. A `credit list' contains all credits the process has lent to other processes, a `debt list' keeps all credits it has been given by other processes.
Credits can be identified by a unique numerical property, which is set at creation time by a successive counter kept in the kernel. This identifier enables cooperating processes to differentiate between credits. Furthermore, processes that want to either recollect or return a single credit, can specify an individual credit by its id. Return or recollection of credits can also be performed as set operations, specifying either the full set of credits or debts contained in the corresponding list or restricting the set by providing the id of the process which is related to the credit.
Credits may recursively fund new credits:: A credit supplied by a creditor process can base upon a credit to which the process serves as a debtor. This extension of the credit abstraction supports multi-tier architectures, in which a server may have a client role in relation to further servers. We call a credit which serves as a foundation for a new credit its `father', the new credit is called a `son' of the first. As a credit may fund multiple further credits, credits may have multiple brothers.
This hierarchy is especially important in the case of recollection or return of credits. If a creditor recollects one of its credits, they cannot any longer serve as a foundation for `son'-credits transferred by the debtor. Consequently, all sons of the credit which is to be recollected need to be recollected too. To cope with recursive credits, we have implemented cascaded recollection and return of credits: Before a credit is returned or recollected, all of its sons are recursively recollected. To navigate within the hierarchy of tickets, each ticket is associated with a link to its father, a list of sons and the next brother.
While the percentage of missed deadlines signals a process that its progress is behind its demands, a process cannot easily determine whether it occupies more CPU bandwidth without a corresponding mechanism. While the stride scheduler only records accounting information during the time a process has been competing for processing time and does not consider those periods of time the process had been blocked, the accounting mechanism is designed to measure the actually consumed share of CPU cycles over a period of time.
To allow for easy comparison with the reservation a process has made, consumption is returned in parts per 100000 and thus is displayed in the same scale as allocations are specified. The combination of measurement of missed deadlines with detailed information on consumed processor shares enables a process to adjust its reservation of tickets on a given system at run time.
The initial value for this property is set to the allocated amount of tickets. Accounting would result in distorted information, if compensation for any modifications to process' allotted amount of tickets were missed. Thus, whenever the amount of employed tickets for a task is modified, either by transfer of tickets or modifications to allocation, the effectively computed share of tickets is reset to the amount of reservation.
As scheduling intervals of processes strongly differ as a result of their
reservation, the frequency of updates performed by the accounting mechanism
also depends on the allotted share. Accounting is computed with
accuracy. The measurement of a task's
CPU burst finishes just before
the selection of the next task begins. The point of time when measurement of
the previous processes has finished is taken as the starting point of every
burst.
Allocation of tickets in syncopation basically follows a first-come-first-served policy and thus may result in a situation where a client asks for an amount of tickets which is larger then the leftover amount of allocatable ones. While some tasks may adapt to a shortage of capacities by reducing their quality of service, others might not be able to provide useful results at all when not given an appropriate amount of tickets.
Consequently, an additional mechanism is needed to enable a process to question the actual reservations of other processes whenever its own demand cannot be served. As the final decision whether a process is capable of freeing a fraction of its tickets needs to be made by the process itself, a kernel mechanism for renegotiation can only signal its need for additional tickets to processes. Furthermore, a number of processes will have reserved exactly the processor bandwidth they actually need for execution and thus will not be willing to receive to renegotiation request. As a consequence a task should be able to signal its willingness for renegotiation requests.
Our concept for renegotiation consists of two components. Firstly, there need to be tasks that enable the kernel to signal a demand for renegotiation to them. Clients may explicity enable renegotiation via a system call in combination with the specification of a maximum deviation from the allocated share. If renegotiation is enabled, the accounting mechanism automatically monitors the actual CPU consumption of the process. Whenever a renegotiation is requested by another task, the kernel checks whether the maximum deviation from the reserved share has been violated by the actual consumption. In the case that a violation has occurred, the task is asked for renegotiation by the kernel.
When a task starts a renegotiation, information on the caller, its demanded amount of tickets, and the maximum temporal duration are stored in a status record for renegotiation. The maximum duration is computed as the point in time, at which the kernel must drop its efforts to free tickets and signal failure to the caller. This point of time is stored as an absolute jiffy value. Afterwards the kernel tests every task which has enabled renegotiation, whether the task-defined threshold between reservation and consumption has been exceeded. Every task that had been tested positively gets inserted into a `to do' list of tasks the kernel will signal to free tickets.
The order in which processes are asked for renegotiation is the order of allocation by the scheduler. Although a greedy approach to start renegotiation with those processes which have violated their maximum deviation should be expected to be the most successful strategy, this would result in violation of the temporal guarantees given to all processes which have reserved certain shares of tickets.
The interaction of processes with regard to returns or recollections of tickets as well as renegotiation is supported by an extended signal mechanism which offers further information to affected processes. Information may be demanded in the following situations:
A system call exists to explicitly enable extended signal information for a client. As additional information is written to the process' address space, a task needs to specify the location for kernel access to its address space. To notify a task of events such as tickets transferred or credits recollected by another task we have specified a protocol to exchange information between applications and the kernel and between different applications. The formerly unused Signal #31 now needs to be interpreted in conjunction with additional information written into the user space of the signal's target task. The struct stridesig_info has been defined as a new type to standardize information. Within its structure, information on the following is given:
As synchronization of signals allows only sequential processing by the client, signals are queued until the target is selected by the scheduler to allocate the CPU. At allocation time the first queued signal of a task is launched by LINUX' common send_sig(). To allow tasks to receive more than one signal per allocation, a task can ask for the number of queued stride signals and dequeue the next signal and write its information into the user space. Again, the protocol demands the unread flag to be reset first. Queuing is realized by a doubly-linked list, which is referred to from the task.
The advantages of this mechanism are twofold. Firstly, processes don't need to employ additional communication mechanisms to inform each other on the particular events. Secondly, the process is informed of events immediately when it is scheduled next. However, as processes may require to exchange additional information and thus may well include the information provided with our mechanism, this functionality does not need to be activated by processes.
The new scheduling policy SCHED_STRIDE which is set for all processes with reservations, already implies certain effects on the scheduler. As the original LINUX scheduling mechanism already has implemented POSIX-conform real-time policies, the kernel already calls the scheduler whenever a process with another policy than the best-effort service wakes up. Consequently, at the time a process wakes up and after its rejoin provides a minimum pass within the set of scheduled clients of the stride scheduler, it may immediately allocate the CPU.
The main modification in the scheduler-code has been done to the selection of processes. Prior to the selection mechanism implemented by the original LINUX scheduler, the stride scheduler performs its own scheduling. The best-effort scheduler is represented as a single client to the hierarchical stride scheduler and thus is served as any other process. If the representing node is selected at allocation, LINUX' scheduling code is executed. Otherwise, the process associated with the selected node allocates.
As has been presented earlier, the original LINUX scheduler selects the process to be run next by computing the maximum goodness of all processes contained in the run-queue. To prevent processes of the new real-time class to steal time from the best-effort share, the goodness function has been modified so that they are supplied with the minimum goodness value. This minimum goodness value is lower than the goodness of the idle-task, so that the scheduler will not select a stride-driven process even if no other task is ready to run. Of course, this does not make the system to run the idle task when stride-based processes are runnable. As mentioned in the beginning of this section, the representant for the best-effort share is removed from the tree of competing clients and its reserved shares are proportionally distributed to all other clients.
A new credit needs to be paid by a client from either the pool of tickets the client owns or by a credit the client itself owes to another one. Because of this, the properties which describe the tickets associated with a task are refined: Additional fields have been introduced to describe the number of tickets owned by a client and the number of owned tickets lend to other tasks.
The system call ttransfer() needs to be be supplied with parameters for the target of the transfer, the amount of tickets to be transfered and the type of payment. The target of a transfer is specified by the process id pid of the task that shall receive the tickets. By now, these targets must be tasks that already employ SCHED_STRIDE as their policy. In a future release, temporary policy switching for tasks may be introduced for the period a task of another policy obtains a credit.
The type of payment depends on whether a credit_id is specified or not. With
tickets will be taken from the amount of tickets owned by the
caller. Setting
determines that the credit will be paid by
a credit to which the caller is a debtor. In both cases, there needs to be a
sufficient amount of tickets left either in the pool of the task's tickets or
the credit to generate the new credit. If the tickets shall be taken from
those owned by the caller, testing is simple. If the credit should be paid
from another, the credits need to be authenticated as an existing debt of the
process. Afterwards, an additional test will see how many tickets of this
credit have already been spent to pay other credits (sons).
Credits can be cleared by either the debtor returning them to the creditor, or the creditor itself collecting the credit. System calls treturn() and ccollect() have been introduced to do this. Both calls have three different modes of operation:
Accounting information needs to be periodically updated and provide intervals
in between consecutive measurements which are large enough to provide
reasonable average numbers between consecutive allocations. As a process with
an allocation
would allocate at least every
clock ticks in an the best case, we
have taken a multiple of this interval to determine the intervals between
accounting updates.
The new system call stride_acc(bool enable) can be used to start or stop accounting by the kernel. Accounting is automatically enabled when a process has given permission for renegotiation. To get information on its effective share, a process may call geteffshare(). This function returns the effective share of the process in parts per 100000. If a process is not able to consume the share it had reserved, this function corresponds to a reservation which would be sufficient to meet its requirements.
As best-effort processes may need to call for renegotiation if their first allocation request cannot be served, a renegotiation can be started by any task in the system. In the current version, renegotiations are given exclusive access to the pool of free tickets. As a consequence no second renegotiation may be started whenever another is running and no task is able to allocate more tickets meanwhile. This restriction implements a first-come-first-served policy for allocations. Tickets can be freed at any point of time. To start a renegotiation, a task calls renegotiate(long tickets_demanded, long timeout). The parameters specify the number of tickets the task would like to allocate and a time out which denotes the maximum period of time that the renegotiation should be attempted.
While the renegotiation is running, its status information will be checked with every call to the scheduler. To do this, the scheduler calls the function step_renegotiation(), which compares the number of free tickets to the demanded amount to see whether the request can be fulfilled at that particular moment. If the task's demands cannot be served, the function will then check whether the specified timeout has passed and renegotiation has failed. In each of the two cases, renegotiation would have been finished. Otherwise, the task will compare the tasks that has been selected by the scheduler to the `to do' list mentioned above. If the task is listed in the `to do', it will be asked to free tickets and then be removed from the list.
|
|
|
|
| Rainer Koster, 2000-03-17 |