Computer architects have long used instruction prefetching to improve the performance of operating system (OS) intensive workloads. Sophisticated instruction prefetchers are implemented mostly in hardware; they record the execution history of a program in dedicated structures and use this information for prefetching if a known execution pattern is repeated; the storage overheads of these structures are prohibitively high (64- 200 KB per core). We show that in the case of OS intensive applications, the i-cache misses are mostly clustered in small execution blocks that follow OS events such as interrupts, system calls, and context switches. We propose a novel technique to identify and prefetch these execution blocks using a combination hardware and software modifications. Our technique uses only 4 additional registers per core, and still gives a performance improvement of up to 14% (mean: 7%) over the state of the art instruction prefetchers for a suite of 8 OS intensive applications.