O(1) and Completely Fair Scheduler (CFS)
In this post we discussed about scheduling, various approaches for scheduling. Today we wold see how scheduling is implemented in Linux kernel.
Earlier versions of Linux kernel used round robin approach which was implemented using circular queue. It was quite simple to implement and was fast.
In Linux kernel 2.2, scheduling classes were introduced. Process were divided in three classes namely : Real-time, non-preemptible and normal or non-real time processes.
In Linux Kernel 2.4, scheduler came with O(N) complexity, N being number of tasks currently in system in runnable state and. It maintained a queue,whenever there was scheduling needs to be done, it goes through all the processes in queue and selects the best process (based on priority) to be scheduled next.
There was a function called as goodness(), which compared every process’s goodness, and the process having highest goodness was scheduled first.
In Linux Kernel 2.6, complexity of scheduler was reduced to O(1) and hence the name of the schedule became O(1) scheduler. This scheduler was not required to scan all tasks in runnable state as in O(N) scheduler in Kernel 2.4. Idea behind it was very simple. There were two run queues for each priority level. As we will see later, there are 0 to 140 priority levels in Linux, there will be 280 such queues.
Two for each priority because one is active processes while the second is for expired process.
Process is moved to expired queue when its time slice is completed. When all processes in active queue exhaust their time -slice, simply expired queue is made active.
To identify the task to be scheduled next, scheduler just need to dequeue task from per priority queue.
It also introduced interactivity metrics.Figure shows the arrangement.
Completely Fair Schedule
Due to unwieldy nature of O(1) scheduler, there came another scheduler called as Completely Fair Scheduler.
As name suggested, this scheduler tries to be fair with all processes by giving them fair chance to execute on processor. Idea is simple again. Keep track of time a process has executed on processor, processes which have got small amount of time are boosted to get the processor, those who got bigger amount of time are thwarted.
Biggest change which came with this scheduler was data structure used to maintain runnable processes list. Till now all implementations used linked list based queues. This implementation used Red-Black Tree to maintain runnable processes which this property : All nodes which have got small time on processor are stored on left side of root, while those who have got larger time on processor are stored on right hand side. Whenever kernel wants to select next process for scheduling, it takes the left most node of RB Tree and grants processor to it.
While insert operation being performed in RB tree, the left most child value is cached in sched_entity, so it is fast lookup there.
Red Black Tree has this very interesting property that is is self balancing tree and no path is more than twice as long as any other path in tree and all operations in RB tree are done in O(logn) complexity.
We know that process in kernel is represented as task_struct. However,the scheduler does not use this structure, it uses another structure sched_entity which is then included in task_struct.
sched_entity contains pointer to node in red black tree which in turn contains pointers to left and right children and color of parent node.
As I mentioned earlier, every process has associated run time, it denotes time it has got at processor. This time is stored in sched_entity as vruntime. This vruntime is key for moving process node from one side of RB tree to another. Figure shows relationships between various data structures used in scheduling. In next post we would see how priorities are calculated and used.
Every process has two priorities associated with it. One is called as nice value of process. It ranges from -20 to 19, default being 0. Lower the nice value, higher the priority. If there are two processes having nice value as 5 (process 1)and 10 (process 2), process 1 has high priority.
We can check nice values of process’s nice values using ps -el command on shell.
In Kernel space it is translated as MAX_RT_PRIORITY + 20 + nice value. It is stored in static_prio field of the task_struct.
Second priority is real-time priority. These have opposite notion, means higher the value, higher the priority of the process. It ranges from 0 to 100.
There is one more field which is present in task_struct called as prio, which in turn stores the effective priority of process which is being considered by scheduler to boost or thwart priority in order to avoid cases like priority inversion.
In earlier schedulers priority values were used in decided the next process and the time slice it gets on the processor. There were many short-comings in that approach , very well described in book “Linux Kernel Development” by Robet Love.
In CFS, instead of using priorities to decide absolute time a process gets on processor, portion of processor time a process ought to get on processor was calculated, based on the overall load of the system.
If a process is getting less than what it should get, it will eventually move towards the left side of RB tree and get the processor time.
There are three scheduling classes. Every process is associated with one of the class.
1.SCHED_FIFO : This is class which classifies all real-time processes. There is no preemption, process with equal priority will not preempt process already running.
2.SCHED_RR : It is same as SHCED_FIFO but there is time slice associate with process, process leaves processor when time slice finishes.
3.SCHED_NORMAL/SHCED_OTHER : All processes which are not real time as put under this class.
These classes are stored in queue in order SCHED_FIFO, SCHED_RR, SCHED_NORMAL, scheduler goes from on class goes through them and first class which has a runnable process gets to decide which process gets the processor next.