Linked list in linux kernel implementation

Linked list in linux kernel implementation

We learned a lot about linked and solve around 30 odd problems : Linked List Problems. However, actual implementation of linked list in linux kernel is very different that what we learned. So, let’s understand how linked list is implemented in linux kernel and how it is used in kernel code.

In simple linked list, we have nodes which contain data and these nodes point to next node of them. In other words, its the list which contains nodes which are linked. Typical example of structure of node of this kind of list is:

struct node {
  int data;
  struct node * next;
} Node;

However, in linked lists in kernel, it’s the other way around, that is linked list is contained inside the node. This means, that there is no next pointer inside node and every node is effectively a head node, nature of linked list circular and it is doubly linked list. Lot of things in one sentence!!

Linked list implementation in Linux

Let’s understand things in detail. As said, linked list is contained inside the node, structure of node is something like this:

struct node {
 int data;
 list_head list; // list is inside the node

Here list_head is what defined as :

struct list_head{
  struct list_head *next, *prev;

See it has two pointers, essentially, making any node which contains this structure, part of a doubly linked list. Most interesting part of this kind of definition of node is that same node can be part of multiple list without being reallocated for each list. For example, in traditional linked lists, if I need two linked one as odd numbers and other as prime numbers, i would have to define two linked lists , one containing odd numbers and other for primer number. With implementation provided in linux kernel, we can attache the same node to two lists as shown below.

struct numbers {
 int number;
 list_head odd_numbers; // contains all odd numbers
 list_head primer_numbers; // Contains all prime numbers

How to access a node in linked list in Linux Kernel

Now that we have changed the node structure, how can we access a give node of linked list. It was simple to do in normal linked list as base address of node was easily accessible. In linked list in linux kernel, we have pointer to list_head structure in next node and not pointer to next node itself, as shown below.

linked list in linux

There is a beautiful trick in C, which is used here to access base address of node whose list_head pointer is given. Once base address of node is found, accessing becomes similar to normal linked list. Trick is that given a pointer to list_head in structure, and to find the base pointer of structure in which it is present, we need to find the offset at which list_head is stored in list. Once, we know the offset, (how many bytes, it is far from base address), then just subtract that offset from the absolute address of the pointer (which is given) and we get the base address. Figure explains

linked list in linux kernel example

Let’s take an example, we will use structure numbers as given above. To get offset of element number in that, code is:

(unsigned long)(&((struct numbers *)0)->number)

Now, that we have offset of number and absolute address of number, we can get the base address of struct numbers as :

((struct numbers *)((char *)(pos) - (unsigned long)(&((numbers *)0)->number)))

ANSI C defines the offsetof() macro in , which lets you compute the offset of field f in struct s as offsetof(struct s, f). If for some reason you have to code this sort of thing yourself, one possibility is

#define offsetof(type, f) ((size_t) \
 ((char *)&((type *)0)->f - (char *)(type *)0))

Above code is not portable and some compilers may have problems with it.

There are some MACROS which are defined in linux kernel which are useful in dealing with linked lists. Below are some examples:

#define list_entry(ptr, type, member) \
	((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))

#define LIST_HEAD(name) \
	struct list_head name = LIST_HEAD_INIT(name)

#define INIT_LIST_HEAD(ptr) do { \
	(ptr)->next = (ptr); (ptr)->prev = (ptr); \
} while (0)

Please refer to this file to understand various macros which are used in linux kernel. In next post, using these constructs, we shall create a linked list , access it and delete a list.

Please share if there is something wrong or suggestion for improvements. Also, if you like the content, please share it.

Generation & Analysing Cores Files or Core Dump

How to generate and analyse core dump in GDB

Core file or Core dump generation is a standard unix feature. A core file is simply a snap-shot of a running process’s memory at the time of generating the core. This is collected by the kernel when it decides to snap the process or task. You can consider this as memory photograph of a process in question. This is generated which process crashes or can be generated on user demand for various purposes example: debugging.
The core contains the memory contents of the process at the point of the seg-fault including its code segment, data segment, stack segment and heap segment. So, we can analyze why the particular crash happened.
This can be generated using following commands:
Kill –ABORT <pid>
Under GDB:
gcore <filename>, Or
generate-core-file <filename>
Parameters that control core file generation
In linux, we have
  • resource limits of the process
Every process has various limits associated with it. Read man setrlimit for these. Bash offers a built-in ulimit through which we can edit these. These control the limits of future process spawned from this bash-shell. Core-file max-size (ulimit -c/RLIMIT_CORE) controls the maximum core file size that can be generated when the process snaps.
  • kernel core file pattern.
Linux offers a proc file hook – /proc/sys/kernel/core_pattern, which contains the format for the name of the core-file. It allows using some format specifiers
 %p: pid
 %: '%' is dropped
 %%: output one '%'
 %u: uid
 %g: gid
 %s: signal number
 %t: UNIX time of dump
 %h: hostname
 %e: executable filename
 %: both are dropped
In addition using ‘/’ as part of the names helps to move the core-file location to a directory other than pwd (relative or absolute depending on if start is a ‘/’). In addition, if the patter starts with a pipe (‘|’), linux will pipe the core file into the stdin of the program that follows next! One can permanently set this in a linux machine by adding kernel.core_pattern in /etc/sysctl.conf.
# ulimit -c
If the output is 0, this suggest that core dump is disabled by default. Else, you will get resource limits using ‘ulimit –a’

Checking goodness of the core generated fine

ubuntu@:~$ ulimit -c unlimited
ubuntu@:~$ ulimit -a
ubuntu@:~$ vim abc.c
ubuntu@:~$ gcc abc.c -o abc ubuntu@:~$ ./abc Segmentation fault (core dumped) ubuntu@:~$ ls abc  abc.c  core
At times due to various reasons core file corrupts and and GDB is not happy with it, to really validate this fact is is important to check the consistence of the core file:
Good Core file

<div>A corrupted core-file will give a different o/p with objdump and will give warnings when tested with readelf. The readelf check is usually more reliable than objdump.

ubuntu@:~$ objdump -a core
core:     file format elf64-x86-64
ubuntu@:~$ readelf -a core 2>&1 | grep -i warn
If you have printable stings in your code like Version Number, Software Release, etc. You can print them to validate the correct version of core file you are debugging is like this:
ubuntu@:~$ strings abc | grep SW-VERSION
Make sure GDB doesn’t complain about the else it may give warning like “warning: exec file is newer than core file.” or something related to mismatch of core Vs executable.
Code debugging improves system understanding and its behavior, hence do it often even if you do not get crashes.

Cache and cache mapping techniques

In general terms, cache is a store which are fast accessible and where you keep things which are frequently accessed. It can be a website, browser, or processor cache. In this post, we are focusing on memory caches.

Cache Definition

A small and fast memory which stores some block of data being used by process. The block of cached data can be something which is recently used or something which is close to what was recently accessed. There are different techniques how data is cached and moved out like least recently used, round robin, least frequently used etc. With advent of changes in processor technologies and memory chips, level of caches has gone up from 1 to 2 to 3. Multiple levels can be there between processor and memory

cache levels

Why do we need to cache?

There is a huge difference between processing speed of processors and memories which are usually DRAM. So, if it takes more time to fetch instruction or data from memory, overall throughput of machine goes down. That’s why we need faster SRAM memories which have considerably less latency in fetching instruction or data. These SRAM sit between processor and main memory to speed things up. SRAM chips are faster since they do not have to refresh their contents.

Then, why not have main memory equally fast? It is because SRAM are very expensive and power consuming, so their size beyond a point is not affordable.

Programming with caches

Caches are transparent to programmers, it means that programmer does not have to worry about caching strategy while coding however, sometimes, keeping this in mind while programming may give tremendous performance benefits. Programmers are kept oblivious to cache to shield them from the micro instruction of CPU which usually change. Also, cache implementations vary even in same processor, so it will be difficult to write general purpose code. Size of cache limits the scalability of program.

Size and latency of various storages

Operations on data

Whenever a process asks for a data, it is first checked in cache. If data is already cached and present, it is returned. This is called as a Hit. If data is not present, main memory is accessed and data is brought returned to process and put into cache also. This is called as a Miss.
cache hit or miss

What’s the data cached?

By caching data, operating systems want to minimize delay to fetch next data or instruction. Cache mechanisms use principle of locality to bring in data which may be accessed next, based on currently accessed data, for faster access. There are two locality principles:
Temporal locality:  data which is used recently may be used again in near future.
Spatial locality:  data near to current accessed data may be accessed in near future.

Data consistency with caches

With cache in between memory and process, there are two copies of data. How to maintain consistency of data in this case? What happens, when process accesses data, gets it from cache and want to modify it? There are three ways to do so.

  1. Data is written on to cache and main memory with write through approach. Both are in consistent state.
  2. Data is accessed, modified in main memory but not updated in cache. This is write around approach.
  3. With copy back method, when data is just written back to cache but main memory is not updated.

Cache mapping techniques

As mentioned earlier, size of cache is much smaller than main memory and during writes by process, data has to be put back to main memory. So we need some kind of mapping between two.

Before understand that a line is nothing but a block of data which is accessed together and mapped to a block of main memory. Each line has a tag (usually high order bits in cache line address) which identifies which memory address it is mapped to.There are three mapping techniques : Direct mapping, fully associative and set associative mapping

Direct map

Consider a cache of 128 blocks and 16 words each. In this scheme, each address j in memory is mapped to j modulo 128 mapped cacheHow can we map memory of 64 KB divided into 4096 bytes blocks?  Block number 0,128,256….3096 will map to same block number in cache,
Number of bits required for addressing 64 KB will be 16 bits. To represent 16 words in cache line, we need 4 bits. To address 128 cache lines, 7 bits are used. Rest 5 bits are used to mark as tag.
How to identify that memory block is present in cache?  Let’s take an example: accessed address is 43029 ( 10101 0000001 0101). Here, tag is 5, block is 1 and word 5. If block 1 has tag 5, then it’s a hit else miss.
Drawback is that more than one memory address may be mapped to same cache line. For example, in example if address with block 0 and 128 are accessed together, there will be miss every time and hence decreasing the performance.

Associative map

A block of memory can be anywhere in cache as opposed to fixed positions based memory address bits in direct mapping. Each line has tag. When processor accesses memory location, it checks all tags and determines if it’s cache hit. This slows down memory access (checking all tags for each memory access), however, since all tags are checked in parallel using special circuitry, it’s not an expensive operation.The last log(size of line) bits defines word number to be accessed from cache.
To map 64KB memory, 16 words each, in 128 block memory of 16 words each, 16 bits address are required, last 4 bits (log(16)) give word’s address while most significant 12 bits give tag.

Set associative map

To avoid checking all tags, there is another mapping technique called as set associative mapping. Set associative mapping is mixture of direct map and associative map. Set contains multiple lines, with number ranging from 2 to 16 lines/set. Specific bits in address identify which set stores given memory address, .  When processor accesses memory location, it first looks at set bits, goes to that set and checks all tags in it.

Translation look-aside buffers

Translation of virtual or logical address to physical address takes a significant processing time in virtual memory implementation. To understand what is virtual memory and it’s implementation, please refer to post on virtual memory and paging.

Each time process accesses a page, it refers page table which leads to delay. Solution for this is having a cache for page table too. It is known as translation look-aside buffer, commonly known as TLB.
Process goes to TLB first when it accesses page. If physical address of page is available in TLB, no need to access page table and do all translations, page is accessed using physical address. Whenever, a page is moves out of main memory, then its reference from TLB is removed.

Please share if something is wrong or missing. If you want to contribute on algorithms and me, please refer publishing on algorithmandme.

Virtual memory using paging

Virtual memory using paging

In last post in this section, basics of virtual memory were discussed. In this post, we will understand how actually it is implemented in operating system and hardware using paging. As in said earlier there are two methods with which virtual memory is implemented: Demand paging and segmentation. Focus of this post is to discuss demand paging.
What is demand paging?
It is concept where the entire process in not brought into the memory but is divided into number of pages, and only relevant pages which are required for execution of instructions at that time are brought in main memory. Rest all pages reside at disk and as and when they are required by process, are swapped in from disk. Since pages are brought in on demand, it’s called demand paging.
As we know logical address space of a process can be bigger than the real physical memory, there might be scenarios when the required address page is not in the main memory and hence swap in and out are required. Let’s understand the paging and demand paging using a simple example.

A process which has logical address space of 128 bytes. (ridiculously small, but good for an example) and page size of process is 16 bytes, so there are total of 8 pages as shown in figure below.

Now, the main memory has only 64 bytes of memory and page size is same as 16 byte. At a time only four pages can reside in main memory. Mapping of four pages of the process is shown below. Rest four pages are put on to disk. Initial arrangement is as shown below

paging virtual memory

How many bytes are required to represent logical address space of process of 128 bytes? 7 bits right.
If there are 8 pages  then 3 bits are required to identify the page and rest 4 bits can identify the offset in that page raging from 0 to 16 bytes.3 bits give us virtual page number called as VPN and last 4 bits are offset.

logical address space page address

How do system translates VPN into physical page also called as Physical Frame Number?

There is a special per process data structure which stores this translation information which is called as page table. Page table is keyed on VPN and stores the corresponding PFN on to main memory.
One the PFN is found from this table offset is added to that PFN and the required address is accessed from the memory. For the above initial state, page table entries will be like this

Page table layout

Translation happens like this. In real world there are multiple hierarchies of page tables like page directory and then that points to page table and then the offset. For simplicity, those details are skipped here.

page address translation

Let’s take an example : Process wants to access address “34”. Binary representation of “34” is 
“0100010″ . In this VPN bits are 010 which is 2. Process looks in page table corresponding to VPN , PFN is 2  and get PFN as 3 in binary “011” which is added to the offset which is “0010”, So the final address accessed on physical memory will be “0110010” which is address number “50”.

There is a catch here. As mentioned earlier, the PFN given by page table may not be in the memory. In this case page fault is generated and operating system brings in the required page from disk to memory and instruction is restarted.

What if the main memory is already full when page fault occurs? Operating system needs to swap out one of the page to disk to make room for new page. How does operating system decides which page to move out? There are various page replacement algorithms like first in first out, round robin, least recently used, these paging algorithms will be discussed in detail in next post.
The entire process is depicted by figure below.

Page access and fault handling

There are some special bits which are stored in page table along with the PFN for a VPN.
These bits are useful to operating system while deciding on action to be taken for particular page. bits are as follows:
  1. Invalid bit /Present bit : This bit means that page is not valid and moved out the memory.
  2. Dirty bit: If this bit is set that means page has been written by some other process and this is not the latest copy of the page.
  3. Read/Write bit : It indicated the access right of process on this page.
  4. Accessed bit : This indicate that page is accessed since it is in the memory. This helps in page replacement algorithms like least recently used page replacement algorithm.
It is not that every time a process requires a page from memory, it has to read it from memory. There is a concept called as caching, In caching, operating system store some of the pages into faster accessible memory called as cache. Caching works on principle of locality, that means if the process has requested an address, it is more likely that it will access a near by address in next few instruction. So why not cache that page and made it fast accessible, Process first checks the cache in order to fetch the page, then if page is not found in cache then it goes to main memory. This is called as cache miss. Again size of cache is much smaller than main memory and page replacement algorithms need to applied here too.
There are various kind of caches which can be employed for this purpose, details of this later.
This is how virtual memory is implemented using paging, please write in comment if I missed something or something is wrong.

Virtual memory in operating systems

Virtual memory

What is virtual memory?

Memory is integral part of any execution of process. It is the place where process reads and writes data to process on. An executable instruction needs to in memory for before processor can execute it. What happens when available memory is less than that is required by process?
That’s where virtual memory comes into picture. It is a technique which is implemented by operating system to meet process memory requirements even when there is not enough physical memory available. This memory which is not real physical memory is called as virtual memory. So it is nothing but extension of real memory with the help of disk space.
Virtual memory brings in another concept called logical address space. The address which are directly mapped to physical memory are called as physical addresses. Other type of address which is actually visible to process and used by it to execute its instructions, is called as logical address space. Size of this space vary from architecture to architecture. This is not the actual physical memory allocated to the process. So, the address space which a process sees is not actual physical address space available. Separation of logical space from physical address space removes the dependency of program size on the available physical memory. Program size can be much bigger than memory available because it is not necessary to bring in entire process in memory at the same time. There are various sections in process like error handling/ exception which are rarely executed. No point in bringing all those in memory. Operating system can bring those in as and when they are required. How do these two spaces map?  To map these two, operating system implements memory map function which does the dirty work of mapping process’s logical address to physical address on to memory. 
One question still remains: What happens to extra memory which is required by process and is not available now? For example if memory required by process is 150 MB and only 80 MB is available in memory, there is 70 MB short. How do we manage that? Operating systems rely on the fact that not all instructions of process are required to be in memory at the same time in order to execute it. So, only instruction which are needed are brought into memory and rest 70 MB are allocated on disk and managed by a process called as Virtual Memory Manager (VMM). Instructions are swapped in and out when there is a need of an instruction which is on disk and not in main memory.

There are two methods in which virtual memory concept can be implemented :
1. Demand paging (showed below)
2. Segmentation (will be discussed later)

All in all, below picture explain overall virtual memory concept.
Virtual memory
Virtual memory overview
Virtual memory allows process to use much more memory than what is really available. It makes process which are very big in size (more than memory size) to be able to execute in memory. Other advantage it brings is ability to run multiple process at a time in main memory which is called as multiprocessing.
To go in details of concept, let’s understand how physical memory is organized. Physical memory is divided into chunks of a predefined size usually 4 KB. These chunks are called as pages. When a logical address is accessed by process, it physical address is found by memory map and the page in which that memory address falls is accessed. What if that address is not available in memory? In that case, operating system needs to bring that page into main memory from the disk. Another question? What if there is no space available in main memory to accommodate that new page? Then operating system which decide which page needs to be taken out from memory to make place for new page. Page replacement algorithms is another study all together. I have discussed one Least Recently Used page replacement algorithm in this post: Least Recently Used cache. Point when a page is not found in memory and needs to be brought in from disk is said to be page fault. Page fault has it’s own overhead and hit on performance. We will talk of principle of locality in later posts. 
Below figure explain the basics of paging

Paging overview
Paging overview

Once desired page is loaded into main memory, process resumes it execution. 

This swapping of pages between hard disk and main memory presents another problem called as thrashing. Thrashing is an occurrence when every instruction or most of them lead to swapping of pages from main memory to disk and vice-versa. This has tremendous impact on performance.

From the above explanation it is clear that in order to implement virtual memory, we need hardware support for paging or segmentation and operating system capable of moving pages in and out of main memory. Effective time to access is defined as
ETA = (1-p) * memory access time +  p * {swap in time + swap out time + restart overhead } 

This article explain basic concept of virtual memory. In next post we will discuss how page address translation works in order to realize it.  Please leave comment if you think there is something I missed.

Operating systems : Thread models using user and kernel space threads

Threads are very important concept in operating systems. We covered the basics and difference between a process and thread in Process Vs Threads post. Let’s understand it in more detail.

Thread models

We briefly touched upon various thread models which are usually used for implementation of threads. These models are :
1. 1: 1 model
2. M:1 model
3. M:N model
So before understanding what these models are, we should understand what is difference between user space threads and kernel space threads.
User space threads are those threads which are completely implemented in user space using some library which is implemented above kernel. These threads are easy to manage and schedule as whole logic is in application itself. Also, for obvious reasons, application using user level threads is portable to any kernel. No kernel mode privileges are required for switching between threads.
Problem with these threads is that kernel is totally unaware of multithreading at application and any blocking system call will block the whole process.So, these kind of threads are useful in application when thread are in runnable for most of their lifetime. Also, thread needs to give up control voluntarily, else no other thread in process will be able to run.
When thread is implemented in user space, application has to maintain its thread table to keep track of thread status.
Kernel space threads are threads which are implemented in kernel and there is no logic required at application level. Thread creating, management and scheduling is done in kernel space. Advantage is that kernel is aware of threading model and scheduling can be done accordingly.
Drawback of this mode is kernel has to keep track of threads, synchronizations and scheduling, which will be overhead.
To get best of both worlds, we have something called as hybrid models and that’s where above mentioned threading models come into picture, Let’s see them one by one.
1 : 1 model
In this model, for each thread in user space, there is corresponding thread in kernel space, so in true sense kernel can leverage the multiprocessor by scheduling different threads on different processors. Window server usually follow this model.
M:1 model
In this model, for M user space threads there is only on thread in kernel space, It is as good as implementing user space thread because once there is system call to kernel, process will be blocked as only once thread is there in kernel. This can be overcome by replacing blocking system calls to kernel with non blocking equivalents, however many systems cannot be implemented as non blocking version.
M:N model
This model in true sense leverage the threading model, where for M threads in user space, kernel has N threads, there is no predefined one to one mapping between user space threads and kernel space thread and it is done dynamically.
Problem is that scheduling needs to be implemented at both user space as well as kernel space.

Interrupts and bottom halves Part 2

Interrupts and bottom halves

In the last post in interrupts we discussed some of the design issued to be considered while writing an interrupt service routine. Two main things which came up were: Interrupt Service Routine should be as quick as possible and perform some reasonable amount of work. These two requirement are mutually contradicting. How do we achieve these two goals?

Bottom Halves

To do so, we divide the ISR into two parts called as halves. First one is top half which does the minimum required functions of for handling given interrupt and schedules the other half which is called as bottom half which does the bulk of the task later.
Bottom halves are required because as we know when ISR executes, it disables all other interrupts on running processor and same interrupt on all processors. To increase the response time and throughput of the system, we need to finish ISR as soon as possible.
Let’s take an example : A packet arrives at network card. Card has to generate an interrupt to kernel saying it has a packet. Now above dichotomy comes into picture. If kernel starts to process the whole packet in the ISR, network card will not be able to receive any packet till ISR if finish.
So in top of half of the ISR, kernel just acknowledges the reception of packet, copies it in main memory and network is again ready for the new packet.Rest of the processing is done by the bottom half at later point of time.
There are some considerations we should keep in mind while delegating work to bottom half:
If work is time sensitive, hardware specific or needs all other interrupts to be disable while being done, do it in interrupts top half, rest all can be delegated to bottom half.
Bottom halves are implemented using: Tasklets, softirqs or works queues.
Now let us look into some of the very basic questions asked about interrupts.

Difference between exception, trap and interrupt


Trap is usually called as software interrupts also. It is specifically added by the programmer using instruction to transfer control to a given handler. Best example is INT instruction. As it is added explicitly by the programmer, we always know when trap will occur. Trap and simple function call are same except that trap stores registers on stack and we need to do IRET in order to return from it.


Exceptions occur when program as encounter a conditions which is not as per the expectation. Exceptions are usually called as automatically generated traps, however there is no specific instruction associated with exceptions. Common causes of exceptions are divide by zero, executing illegal op-code or illegal memory access, break point exceptions, overflow exception etc. As soon as any condition above is encountered, current execution is suspended and control passes to exception handler. Exception handler decides what to do like aborting the process or restarting it etc.


Interrupts do not have anything to do with currently executing instruction. These are external to the process currently being executed. These are generated by like receiving a packet on network card, pressing a key on keyboard, serial ports, parallel ports, timers etc. CPU after executing the current instruction, passed control to Interrupt handler and once interrupt handler finishes, control returns back to the next instruction of the program which was interrupted.

Interrupts and bottom halves

In any operating system, interrupts are very important because these provide mechanism to handle asynchronous events from various sources like keyboard, network cards, timers etc.These devices do not need CPU at all the time but when they need it, they should be able to request it. Interrupt provide that mechanism. Else we need to rely on the polling by CPU.
Let’s understand what exactly interrupts are. Going by the literal meaning of word interrupt, we can say that anything which is external to currently executing task and alters the execution paths of the task is an interrupt. New task which is executed is Interrupt Service Routine (ISR). Once ISR finishes, control goes back to the next instructions of previously executing task.
Typical handling of an interrupt

Classification of interrupts

There are two source of interrupts :Hardware interrupts and software interrupts. As names suggest one is raised by hardware and other by the software.
Examples of hardware interrupts: Interrupts generated by the clocks external to CPU, network card interrupts etc. Software interrupts are divide by zero, illegal memory access, page faults etc.
Hardware interrupts are indicated on dedicated line to CPU and then CPU indicates the reception of such interrupt on Interrupt Acknowledgement line.

Interrupt Masking

If a task or operating system wants to ignore a particular interrupt, it can mask it. Masking of an interrupt will make sure that ISR for that interrupt is not executed even if interrupt has occurred.
Non-Maskable Interrupts (NMI) are interrupts which by nature can not be masked by task or operating system. Usually non maskable interrupts are used to sense state of sub systems external to CPU, if CPU sees that there is no activity on a given line for a particular sub system, it can take action. It avoids deadlocks in overall system.
In Pentium architecture, there is a pin each for raising non maskable and maskable interrupts.

Identifying an interrupt handler

Every interrupt has an interrupt number associated with it. This number helps CPU to identify which ISR needs to be invoked when a particular interrupt occurs. operating systems sets up an Interrupt Descriptor Table (IDT) at boot time. So handler pointer is stored at IDT[intrrupt_number].

Interrupt handling

Since we have suspended normal execution of a task, it is important that we resume that task as soon as possible. So two things come up here : We need to return where we came from, so save the return address, and we need to exit ISR as soon as possible, hence no big processing here.
What else? What if other interrupt comes while are servicing an interrupt? We need to take call, what all interrupts can be serviced from this ISR, rest all are masked, so that they are ignored. Even if the same interrupt comes while in ISR, we may need to ignore it.
Since we are going to resume task we suspended, we need its register states. So save all registers states and restore it once ISR is executed.
Apart from these three basic actions, do the interrupt specific processing and return as soon as possible.
One more design issue which we need to take care is what if two interrupts occur at same time? We need to associate priority to each interrupt and based on the priority of interrupts, execution order is decided. 
In next post we will discuss what are bottom halves, how interrupts is actually written in operating systems and various APIs available in Linux kernel.

Scheduling : O(1) and Completely Fair Scheduler (CFS)

O(1) and Completely Fair Scheduler (CFS)

In this post we discussed about scheduling, various approaches for scheduling. Today we wold see how scheduling is implemented in Linux kernel.
Earlier versions of Linux kernel used round robin approach which was implemented using circular queue. It was quite simple to implement and was fast.
In Linux kernel 2.2, scheduling classes were introduced. Process were divided in three classes namely : Real-time, non-preemptible and normal or non-real time processes.
In Linux Kernel 2.4, scheduler came with O(N) complexity, N being number of tasks currently in system in runnable state and. It maintained a queue,whenever there was scheduling needs to be done, it goes through all the processes in queue and selects the best process (based on priority) to be scheduled next.
There was a function called as goodness(), which compared every process’s goodness, and the process having highest goodness was scheduled first.
In Linux Kernel 2.6, complexity of scheduler was reduced to O(1) and hence the name of the schedule became O(1) scheduler. This scheduler was not required to scan all tasks in runnable state as in O(N) scheduler in Kernel 2.4. Idea behind it was very simple. There were two run queues for each priority level. As we will see later, there are 0 to 140 priority levels in Linux, there will be 280 such queues.
Two for each priority because one is active processes while the second is for expired process.
Process is moved to expired queue when its time slice is completed. When all processes in active queue exhaust their time -slice, simply expired queue is made active. 
To identify the task to be scheduled next, scheduler just need to dequeue task from per priority queue.
It also introduced interactivity metrics.Figure shows the arrangement.
O(1) Scheduler

Completely Fair Schedule

Due to unwieldy nature of O(1) scheduler, there came another scheduler called as Completely Fair Scheduler. 
As name suggested, this scheduler tries to be fair with all processes by giving them fair chance to execute on processor. Idea is simple again. Keep track of time a process has executed on processor, processes which have got small amount of time are boosted to get the processor, those who got bigger amount of time are thwarted.
Biggest change which came with this scheduler was data structure used to maintain runnable processes list. Till now all implementations used linked list based queues. This implementation used Red-Black Tree to maintain runnable processes which this property : All nodes which have got small time on processor are stored on left side of root, while those who have got larger time on processor are stored on right hand side. Whenever kernel wants to select next process for scheduling, it takes the left most node of RB Tree and grants processor to it. 
While insert operation being performed in RB tree, the left most child value is cached in sched_entity, so it is fast lookup there.

Red Black Tree has this very interesting property that is is self balancing tree and no path is more than twice as long as any other path in tree and all operations in RB tree are done in O(logn) complexity.

Internal implementation

We know that process in kernel is represented as task_struct. However,the scheduler does not use this structure, it uses another structure sched_entity which is then included in task_struct.

sched_entity contains pointer to node in red black tree which in turn contains pointers to left and right children and color of parent node.
As I mentioned earlier, every process has associated run time, it denotes time it has got at processor. This time is stored in sched_entity as vruntime. This vruntime is key for moving process node from one side of RB tree to another. Figure shows relationships between various data structures used in scheduling. In next post we would see how priorities are calculated and used.



Every process has two priorities associated with it. One is called as nice value of process. It ranges from -20 to 19, default being 0. Lower the nice value, higher the priority. If there are two processes having nice value as 5 (process 1)and 10 (process 2), process 1 has high priority.
We can check nice values of process’s nice values using ps -el command on shell.
In Kernel space it is translated as MAX_RT_PRIORITY  + 20 + nice value. It is stored in static_prio field of the task_struct.
Second priority is real-time priority. These have opposite notion, means higher the value, higher the priority of the process. It ranges from 0 to 100. 
There is one more field which is present in task_struct called as prio, which in turn stores the effective priority of process which is being considered by scheduler to boost or thwart priority in order to avoid cases like priority inversion. 
In earlier schedulers priority values were used in decided the next process and the time slice it gets on the processor. There were many short-comings in that approach , very well described in book “Linux Kernel Development” by Robet Love.
In CFS, instead of using priorities to decide absolute time a process gets on processor, portion of processor time a process ought to get on processor was calculated, based on the overall load of the system. 
If a process is getting less than what it should get, it will eventually move towards the left side of RB tree and get the processor time.

Scheduling classes

There are three scheduling classes. Every process is associated with one of the class.
1.SCHED_FIFO : This is class which classifies all real-time processes. There is no preemption, process with equal priority will not preempt process already running. 
2.SCHED_RR : It is same as SHCED_FIFO but there is time slice associate with process, process leaves processor when time slice finishes.
3.SCHED_NORMAL/SHCED_OTHER : All processes which are not real time as put under this class.
These classes are stored in queue in order SCHED_FIFO, SCHED_RR, SCHED_NORMAL, scheduler goes from on class goes through them and first class which has a runnable process gets to decide which process gets the processor next.

Process Scheduling

Process scheduling algorithms

Linux is multitasking operating system. It gives feel to user that there are many tasks which are running simultaneously in system. However, there is only one processor, which can execute only one task at a time, how is this possible? This is where scheduling of process come into picture.

Kernel, wisely switches processes from processor to give feeling of simultaneous execution. The switching rate should be decided cautiously, as if it is too low, task would not look simultaneous, if it too high, kernel will be too busy switching processes. Following are the criteria which are usually looked for when deciding on scheduling approach:
1. Throughput : Number of processes which complete in per unit of time
2. Response time : It is difference between the time when process is taken for execution and the time it was submitted.
3. Turnaround time : It is difference between time when process finishes execution and time when it was submitted.
4. Fair chance : It should not happen that some process starve in need of process for execution.

Following are the approaches which are usually adopted by operating systems for scheduling :

1. First Come First Serve

This approach works on FIFO principle i.e the process which comes first is scheduled first. It gives very good response time but throughput and turnaround time may not be very good. Process moves to wait queue when it request some I/O. Other process is selected from head of runnable queue and granted processor.

For example if arrival rate of process is as follows, execution sequence is given below

CPU burst

2. Round Robin

Kernel gives a particular quantum of time to every process for which it can execute on processor. If that quantum finishes or process requests I/O, processes is taken at back of queue till the time again gets the time to execute.It’s non preemptive approach.

Round Robin approach
For example if arrival rate of process is as follows, execution sequence is given  below
CPU Burst

3. Shortest Job First

In this approach, kernels selects a process which has shortest time for execution to finish. It reduces the average response time.  There is another preemptive variant of it called as Shortest Remaining Time First. It takes into consideration the remaining time of task and task can be preempted in between unlike Shortest Job First where process runs to completion.

For example if arrival rate of process is as follows, execution sequence is given  below:

CPU Burst

There are two terms we should be knowing while learning scheduling.
One, is CPU bound and I/O bound processes. If a process uses CPU most of the time during execution, it is called as cpu bound process. If a process does I/O operation most of the time and uses CPU sporadically, process is called as I/O bound process.

Second is preemptive and non-preemptive : If other process can preempt or interrupt execution of current process and move it from the process, we call it as preemptive system. If other process does not have that capability, it is called non-preemptive system.
Above explanations of scheduling algorithms are too simplistic. Let’s spice them up 🙂 As in Round Robin approach, every process gets equal amount of time on process, that means kernel treats every process with equal priority. Is that the case in real word? No.

So let’s take priority into consideration now during scheduling a process. This is called as priority scheduling. Of all the process which are runnable, the process which has highest process will get the processor.

There are two kinds of priorities associated with a given process

1. Static priority : It is the priority which remains same throughout the lifetime of process.
2. Dynamic priority : This is adjusted during run time by the process using some parameters like remaining time, base priority etc.  Scheduler uses this priority of boost chances of execution of process which are not scheduled for long time or decrease chances of process of execution which are executed for long time.