Thursday, April 26, 2012

OS: It all depends on how you say it

This is the best one-line definition of an Operating System, I have come across so far.
Source: http://tldp.org/LDP/khg/HyperNews/get/devices/whatis.html

"An operating system is essentially a privileged, general, shareable library of low-level hardware and memory and process control functions and routines."

It kind of, takes away the halo that surrounds the word 'kernel' and 'operating system'. Isn't it?

Sunday, April 22, 2012

Linux: Why 'to recurse' is NOT divine in the kernel

In CS folklore, it's often stated - "To iterate is human, to recurse divine". However, someone involved with kernel programming is likely to reject this belief outright - Reason being the limited size of the per process allocated 'kernel-stack'.

Although, this detail is often carefully pushed under the wraps, the fact is that whenever a process (task) executing in user-space makes a system-call, the kernel code starts utilizing the kernel stack to support function calls (being made while executing the kernel code). Now, this per-process kernel stack happens to be very small - generally 2 pages, which roughly translates to 8 KB or 4 KB, depending on page-size. And this size is fixed: the kernel stack can't dynamically expand like the user-space stack, and therefore, we don't have the same kind of liberty to define random local variables or make recursive calls in the kernel-land as in user-land. Also, for the x86 architecture, the data structure that defines each process (task): task_struct, is stored at the end (end as in at a lower memory address) of kernel stack (for stacks that grow down towards lower memory addresses), thus effectively reducing the size of usable kernel stack. So, if the stack pointer keeps on increasing in lieu of say, repetitive function calls, the kernel stack may ultimately encroach upon the 'task_struct' object for that process, corrupting it, and thus leading to a kernel crash. However, in practice, the fixed kernel-stack size of 8 KB (or 4 KB) has been found to be good enough.

The rationale of keeping task_struct at the bottom of kernel stack is that, in the x86 architecture, we have few processor registers, and this approach allows us to extract the task_struct of a process through the stack-pointer itself (stored in %esp). If the page-size is 8 KB, then masking off the lower 13 bits of the stack pointer, gives us the address of task_struct object. If the page-size is 4 KB, just mask off the lower 12 bits off the stack-pointer, and we have the address of task_struct.

* task_struct was used prior to 2.6 kernel release; in later releases, thread_info struct is used, which has a pointer to the task_struct. However, thread_info also resides at the end of stack in x86 architectures.

Reference:
Linux Kernel Development, Robert Love (3rd Edition)

Saturday, April 14, 2012

Address Alignment of a struct

Consider the following struct on a 64-bit Linux machine:

struct example{

unsigned int x;
unsigned int * ptr1;
unsigned int * ptr2;

}

Now the general rule that is followed in alignment of fundamental data types (int, short etc.) is that they are "self-aligned" - which means that a variable gets aligned at a z-byte aligned address where 'z' is the size of the variable. So, a unsigned int variable on a 64-bit machine would get aligned at a 4-byte aligned address (unsigned int is of size 4 bytes on a 64-bit machine that follows LP64 standard - which Linux does) and any pointer at a 8-byte aligned address.

However, it turns out that this pretty intuitive "self-aligned" rule is applicable only for the fundamental data types. For user-defined structs, as in the example above, the rule is a bit different. Instead of the struct getting aligned according to the alignment requirements of its first data member (this is what I expected initially), it gets aligned according to that data-member which has the largest size, and consequently strictest data-alignment requirements. So, the 'struct example' above would start at an 8-byte aligned address (because of the 8-byte pointers), even though its first data member has a looser requirement of a 4-byte aligned address.

Things like these become extremely important when you are unlucky enough to be reading a raw dump of kernel memory with just a struct definition in hand. Even the slightest of mistakes like this as outlined here, can throw all your analysis out of the window.

References:

2. Also, see this for a nice table showing the sizes of fundamental data types on 32 bit and 64 bit machines following various standards (including LP64 - the one followed by Linux).




Monday, April 2, 2012

SLEEP: There is more to it than meets the eye

Today, I learnt a new thing. We all remember from our OS class that a Process (referred to as 'task' in Linux) can be in 'Running', 'Ready', 'Blocked'/ 'Sleeping' state etc. etc. However, in Linux, the 'Blocked' state is really represented by two states: TASK_INTERRUPTIBLE & TASK_UNINTERRUPTIBLE.

Following are the five states defined for a process in linux kernel 0.01:
Source File: include/linux/sched.h

#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define TASK_ZOMBIE 3
#define TASK_STOPPED 4

TASK_INTERRUPTIBLE: As the name suggests, a task (process) in this state can be interrupted by a signal. Thus, a process in this state has two ways to be woken up from its slumber:
1- Event notification for the event for which this particular process went to sleep in the first place.
2- A signal interrupt came (say, SIGKILL, SIGTERM etc.), and the process returns from sleep after properly handling the signal through a signal handler.

TASK_UNINTERRUPTIBLE: As the name suggests again, a task in this state can't be interrupted; not even by the old faithful `kill -9 ` command. It so happens, that sometimes tasks are at such a point in their execution history that they are not expected to be ever interrupted by any signal and they wake up from sleep only on getting a notification for the event, for which they went to sleep. One reason to do this that I know of, is to avoid corruption of kernel data structures.
A typical example of a task in this state, is of a task that is waiting for IO to complete on an underlying hardware device. That is why, you can't kill a task waiting on completion of IO even by `kill -9`. Another example is of a task trying to access NFS mounted files. And if by chance, your NFS server is down, then you can pretty much expect the task to sleep infinitely.

References: