Page fault from IRQL (0xD1)

This blog post is about an interesting dump file I came across on our forum.
So far an IRP was sent from the Logitech Webcam down the driver stack of the USB parent driver, and then the USB 3.0 host controller. The IRP was changing the power state of this device, and incurred a page fault at Dispatch IRQL, a big no no.

Continue reading


Doubly linked lists

In this article I will talk about linked lists, specifically doubly linked list.

Linked lists are a sequential list data structure, that allows memory locations to be ‘linked’ together in a sequence, across an entire memory address space.

Windows was designed as a fast, reliable operating system, which meant support from sequential memory allocations, rather than just simple array data structures, was necessary. Bear in mind, that in the low level operations of Kernel execution, everything must be completed in a timely manner. Adding and removing values from arrays would prove very difficult, in the sense that shifting all entries in front of , or behind the target location, would prove to be a very costly process.

A linked list acts like a table, with data, and links. The data stores the necessary value, and the links provide the forward (or backward, in doubly linked lists) pointer to the subsequent entry.
Every linked list has a list head entry, which contains two entries, flink and blink. Pointing to the first, and last entries in the list. In the following image, you can see the parsed data structure of linked lists in Windbg.


You can also see the ActiveProcessLinks entry in the EPROCESS data structure. This contains the linked list of all EPROCESS data structure instances running on the system, each process has their own instance.


Finally, driver developers can use the Windows API to aid in driver development. Two API functions InsertHeadList and InsertTailList add the appropriate data entries at the start or end of the list. However, you might be wondering: What if I want to insert an entry into the middle of the list? Well you can, all that happens is a temporary variable is set up to hold the contents of the flink, or blink list head entry, then swapping of specific entries to move the flink and blink pointers, in order to inset the new entry into the middle.

You can pass the values of entry two, into the new entry, three. Thus adding a new entry into the middle.

So now we’ve looked at how entries are added, the removal is a very similar process, which simply involves changing the flink and blink values of the entries before and after the target, to prevent pointers dereferencing null, or freed memory addresses.

Lets look at a real example using Windbg.


You might be thinking, well this is simply a block full of addresses. Well you wouldn’t be wrong. But I’ll state what they are.
The first column is the address of that entry. The second column is the flink pointer, and the second column is the blink point.
Notice anything strange?
The second row, and fourth row aren’t in use, and thus are invalid. Referencing them would result in a bugcheck.
They’re in the process of removal, as the address is still part of the list, but they aren’t being referenced. Notice how the flink and blinks of the entries before and after skip the address?

Finally, I thought it would be a good idea to show an example of how useful linked lists can be.


This command shows various fields of information for timers which have made changes to the timer resolution on the system.
All of which is very useful in keeping track of which process has altered the timer resolution. It allows the system to identify which processes still have to restore the timer resolution, acting as a safety mechanism to prevent issues with thread execution later on.
Discussing timer processing is beyond the scope of this blog post, so I will not go into detail as to how Windows performs these operations. For more information on this see Windows Internals Part 1, Chapter 3, System Mechanisms


Cyclic Redundancy Checks

First of all, I want to apologise for not posting for a long time. I’ve been very busy with work at college, I’ll try to post more regularly.

In this post I want to talk about Cyclic Redundancy Checks, and what they are. In the most basic form, CRCs are an error checking method for digital transmission of data. In short, they use polynomial division to check for inconsistencies within data. As you know, data is in the form of binary, which has base two, so the bit is either 0 or 1. As well as the message, which is a series of bits, there is a check word value which is added as well as the message, this number is the divisor. The transmitter and receiver both have the value of the divisor, which is used to check for errors in the bits. When the message is sent, the receiver divides the divisor from dividend (message), if the remainder is the same, then the data should be intact.

The key word (divisor) is usually presented in the form of a generator polynomial, the coeffecients of this number will be the binary bits of K. This is probably difficult to picture, so I’ll explain more. So suppose we want to use the number 36 as K. This number is written as 100100 in binary, and as x^5 + x^2 + 1 as a polynomial expression. The bits of the remainder left of K will usually lead to k-1, so in this case, we are left with a 6 bit CRC. The higher the value of the divisor, the chances of a false positive occurring are much more slim.

Alright, so suppose we want to send message M, which has the value of 110101101000101100010111. We then perform long division on the bits by our generator polynomial K = 100100.  All we need to know is the XOR method, which essentially means if two bits are the same, they equal zero, and if they are different, they equal one.

0/0 = 0    1/0 = 1    0/1= 1    1/1= 0

binary crc

So here in this picture, I’ve used binary long division to work out the remainder, and this remainder is the CRC. So when a message is sent M, K is also added on to create a divisor which should have a specific remainder. When the data doesn’t match up, the remainder will be different to what it should be, thus the integrity of the data has been compromised.

So what happens if something like occurs in Windows? Well in the following scenario, there’s a bugcheck that keeps occurring, but the cause isn’t a driver, it’s a ahrdware component.

3: kd> .bugcheck
Bugcheck code 0000007A
Arguments fffff6fc`00e500a0 ffffffff`c000003f 00000003`182a3860 fffff801`ca014d84

//Kernel inpage error, kernel data couldn’t be brought in from disk

3: kd> !error ffffffffc000003f
Error code: (NTSTATUS) 0xc000003f (3221225535) – {Bad CRC}  A cyclic redundancy check (CRC) checksum error occurred.

//Here’s out redundancy check, the data is corrupt, it doesn’t match what was originally written.

3: kd> k
Child-SP          RetAddr           Call Site
ffffd000`3cd0d5c8 fffff803`881ade2f nt!KeBugCheckEx //Data is corrupt, bugcheck
ffffd000`3cd0d5d0 fffff803`8806a0ac nt!MiWaitForInPageComplete+0x3177f //Wait for the data to be paged in
ffffd000`3cd0d6c0 fffff803`88081c44 nt!MiIssueHardFault+0x184 //Data isn’t present in memory, hard fault, page //in from disk
ffffd000`3cd0d780 fffff803`8817642f nt!MmAccessFault+0x524 //Incur the memory manager page fault handler
ffffd000`3cd0d930 fffff801`ca014d84 nt!KiPageFault+0x12f //Hit a page fault
ffffd000`3cd0dac8 fffff801`c9ff4b92 dxgkrnl!DxgkMiracastQueryMiracastSupport
ffffd000`3cd0dad0 fffff803`881779b3 dxgkrnl!DxgkNetDispQueryMiracastDisplayDeviceSupport+0x1a //Internal DirectX functions
ffffd000`3cd0db00 00007ff8`bc8b15ea nt!KiSystemServiceCopyEnd+0x13 //Transition to kernel mode
000000e3`3521e508 00000000`00000000 0x00007ff8`bc8b15ea //User mode

fffff801ca0148d4-fffff801ca01491a  71 bytes – dxgkrnl!DxgkHandleMiracastEscape+380 (+0x06)
[ 85 c9 74 2e ff c9 74 15:00 c7 84 24 08 02 00 00 ]
fffff801ca01491c-fffff801ca014951  54 bytes – dxgkrnl!DxgkHandleMiracastEscape+3c8 (+0x48)
[ 00 41 bd 40 00 00 00 45:41 89 cc 45 2b e3 41 89 ]
WARNING: !chkimg output was truncated to 50 lines. Invoke !chkimg without ‘-lo [num_lines]’ to view  entire output.
3988 errors : !dxgkrnl (fffff801ca014000-fffff801ca014fff)

So we can see all these errors, 3988 to be precise, now was it the RAM or the disk? Well, it’s difficult to say with one dump file, so I analysed a few more, although the error was different. It was a failing disk that simply disappeared, thus crashing the system. The error in this case was almost certainly due to part of the disk which had stopped functioning correctly, and therefore corrupting kernel data.

Device/Driver Objects and Stacks

Today I thought I’d write a bit about device stacks and driver stacks and how they implement IRPs.
I’m not going into detail on how drivers function and the types of drivers as I would be here all day so I’ll save that for another time.

What is a device object and a driver object?

A device object is an opaque structure that represents a device or function. A device object is an instance of the DEVICE_OBJECT data structure which is used by the operating system to represent a device.
Some device objects don’t always represent a physical device, they can represent a logical device.

A driver object is just a Kernel image to represent the Kernel mode driver which includes a pointer to the driver’s routines.
When a driver initialises it creates a device object to represent physical or logical devices.

Device stacks and Device nodes

The Kernel organises drivers into a tree structure called the Plug and Play device tree containing device nodes that represent devices, do note that some nodes represent software components which don’t have any physical devices attached to them.

A device stack contains a PDO (Physical Device Object) which represents the physical device connected to a physical bus on the motherboard, in this case I’ll talk about the PCI bus as an example.
The PCI bus enumerates the child devices which are connected to the PCI bus on the motherboard, this creates the PDO for each device and is then represented by a device node in the PnP device tree.
Do note that depending on your perspective determines what type of driver the pci.sys driver is, for example if you’re looking at the PCI bus device node then it’s the function driver but if you’re looking at one of the PCI bus node child devices associated with it then it’s the bus driver.

After the device node has been associated with the new PDO the PnP manager then searches the registry for the driver(s) which needs to be part of the device stack, these drivers are called Function Drivers.

Here’s a small point about the drivers usually found in device stacks:

  • Bus drivers detect and inform the PnP manager about its devices on its bus as well as controlling the power to the bus. There is only allowed to be one bus driver at once and Microsoft normally supplies them.
  • Function drivers on the other hand are the main driver that represents the device and performs the basic operations for reading and writing, it’s the driver that knows the most about its device.
  • Filter drivers modify the device behaviour when needed and it’s located above or below the function driver. It normally fixes errors that are detected before it reaches the function driver on the stack.

0: kd> !devstack fffffa8004615680
!DevObj           !DrvObj            !DevExt           ObjectName
fffffa8005f8c150  \DRIVER\VERIFIER_FILTERfffffa8005f8c2a0
fffffa8005f8c390 *** ERROR: Module load completed but symbols could not be loaded for GEARAspiWDM.sys
fffffa8005f202e0  \DRIVER\VERIFIER_FILTERfffffa8005f20430
fffffa8005eec060  \Driver\cdrom      fffffa8005f77b80  CdRom0
fffffa80057379b0  \Driver\ACPI       fffffa80047f6a00
> fffffa8004615680  \Driver\atapi      fffffa80046157d0  IdeDeviceP0T1L0-5
!DevNode fffffa8005742900 :
DeviceInst is “IDE\CdRomATAPI_iHAS124___B_______________________AL0R____\5&f437ab5&0&0.1.0”
ServiceName is “cdrom”

This is the device stack for the cd drive in the computer which shows the associated device objects and driver objects within it.

  • Atapi provides the interface to enable support for cd players.
  • ACPI is the bus filter driver that enables Power Management for the operating system so when devices are not in use (In this case the cd player) it will be powered off.
  • cdrom is the function driver for the cd drive that allows discs to be read and written to.
  • GEARAspiWDM.sys is the cdrom 3rd party filter driver.
  • VERIFIER_FILTER are filter drivers used by Driver Verifier which is enabled to monitor driver routines and operations to make sure everything is working correctly.

For more information on Driver Verifier see here: Driver Verifier (Windows Drivers)

Driver Stacks are determined by how many drivers are present when processing an IRP by passing it down a device stack or in some cases multiple device stacks.
A driver object can be associated with multiple different device objects and therefore lots of device stacks, this shows that an IRP can be passed down lots of device stacks but only being serviced by a few drivers.

0: kd> !drvobj \Driver\ACPI
Driver object (fffffa80039a6af0) is for:
Driver Extension List: (id , addr)

Device Object list:
fffffa80057379b0  fffffa800573a9b0  fffffa80057399b0  fffffa800572fc20
fffffa800572fe40  fffffa800572ea00  fffffa800572ec20  fffffa800572ee40
fffffa800572da00  fffffa800572dc20  fffffa800572de40  fffffa8005819e40
fffffa8005814e40  fffffa800580fe40  fffffa800572a9b0  fffffa80057289b0
fffffa8005720e40  fffffa800571c920  fffffa800571cb20  fffffa800571bc40
fffffa800571be40  fffffa8005713bc0  fffffa8005700e40  fffffa80056ffa40
fffffa80056ffc40  fffffa80056ffe40  fffffa80056fea40  fffffa80056fec40
fffffa80056fee40  fffffa80056fda40  fffffa80056fdc40  fffffa80056fde40
fffffa80056fca40  fffffa80056fcc40  fffffa80056fce40  fffffa8004616330
fffffa8004616040  fffffa8004616c20  fffffa8004616e40  fffffa80047f9770
fffffa80039eadb0  fffffa8004be1060  fffffa80047fe170  fffffa80047fe390
fffffa80047fe5b0  fffffa80047fe7d0  fffffa80047fe9f0  fffffa80047fec10

So as proven here we can clearly see that the ACPI.sys driver is associated with a lot of device objects as it can’t just represent one device otherwise one hardware component would use ACPI and everything else would be powered on all the time, think about how many USB devices would be turned on.
So our CD drive is just one component that uses ACPI.

Finally we can see information about the IRP being sent by looking at the IRP data structure.

0: kd> dt nt!_IRP fffff9801c458dc0
+0x000 Type             : 0n6
+0x002 Size             : 0x238
+0x008 MdlAddress       : (null)
+0x010 Flags            : 0x40000000
+0x018 AssociatedIrp    :
+0x020 ThreadListEntry  : _LIST_ENTRY [ 0xfffff980`1c458de0 – 0xfffff980`1c458de0 ]
+0x030 IoStatus         : _IO_STATUS_BLOCK
+0x040 RequestorMode    : 0 ”
+0x041 PendingReturned  : 0 ”
+0x042 StackCount       : 5 ”
+0x043 CurrentLocation  : 1 ”
+0x044 Cancel           : 0 ”
+0x045 CancelIrql       : 0 ”
+0x046 ApcEnvironment   : 0 ”
+0x047 AllocationFlags  : 0x80 ”
+0x048 UserIosb         : (null)
+0x050 UserEvent        : (null)
+0x058 Overlay          :
+0x068 CancelRoutine    : (null)
+0x070 UserBuffer       : (null)
+0x078 Tail             :

Some of the entries are pretty obvious from the name and some aren’t documented, the ones that are can be found here:

IRP (Windows Drivers)

Power IRPs

I found an old dump file which was a 0x9F Kernel dump file caused by a power IRP not synchronising with the pnp manager.
Power IRPs are used to change the power state for a device and therefore they must reach the bottom of the device stack which is the physical device object.

A driver has failed to complete a power IRP within a specific time.
Arg1: 0000000000000004, The power transition timed out waiting to synchronize with the Pnp

Arg2: 0000000000000258, Timeout in seconds.
Arg3: fffffa8007005660, The thread currently holding on to the Pnp lock.
Arg4: fffff800053e83d0, nt!TRIAGE_9F_PNP on Win7 and higher

So we can see our 0x9F bugcheck with a power IRP failing to synchronise with the PnP manager because the IRP hasn’t reached the bottom of the stack.

0: kd> !locks
KD: Scanning for held locks..

Resource @ nt!IopDeviceTreeLock (0xfffff80003492ce0)    Shared 1 owning threads
Contention Count = 1
Threads: fffffa8007005660-01
KD: Scanning for held locks.

Resource @ nt!PiEngineLock (0xfffff80003492be0)    Exclusively owned
Contention Count = 21
NumberOfExclusiveWaiters = 1
Threads: fffffa8007005660-01
Threads Waiting On Exclusive Access:

KD: Scanning for held locks……..
18855 total locks, 2 locks currently held

We can see two locks have been held, IopDeviceTreeLock is to synchronise the device tree as a spinlock and the PiEngineLock which is a pnp and power management lock. The PiEngineLock is being owned by the ZTEusbnet driver in order to pass down the power IRP.

0: kd> !thread fffffa80`07005660
THREAD fffffa8007005660  Cid 0004.0048  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
fffffa800d035ee8  NotificationEvent
IRP List:
fffffa8008f5cc10: (0006,03e8) Flags: 00000000  Mdl: 00000000
Not impersonating
DeviceMap                 fffff8a000008c10
Owning Process            fffffa8006f8d890       Image:         System
Attached Process          N/A            Image:         N/A
Wait Start TickCount      396427         Ticks: 38463 (0:00:10:00.026)
Context Switch Count      44059          IdealProcessor: 2  NoStackSwap
UserTime                  00:00:00.000
KernelTime                00:00:00.343
Win32 Start Address nt!ExpWorkerThread (0xfffff80003298150)
Stack Init fffff88003bd2c70 Current fffff88003bd2280
Base fffff88003bd3000 Limit fffff88003bcd000 Call 0
Priority 15 BasePriority 12 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`03bd22c0 fffff800`032845f2 : fffffa80`07005660 fffffa80`07005660 00000000`00000000 00000000`00000000 : nt!KiSwapContext+0x7a
fffff880`03bd2400 fffff800`0329599f : fffffa80`0d0df208 fffff880`0ae9e10b fffffa80`00000000 00000000`00000000 : nt!KiCommitThreadWait+0x1d2
fffff880`03bd2490 fffff880`0ae915dd : fffffa80`0d035000 00000000`00000000 fffffa80`0dd8ca00 00000000`00000000 : nt!KeWaitForSingleObject+0x19f
fffff880`03bd2530 fffff880`0ae92627 : fffffa80`0d035000 00000000`00000000 fffffa80`0c0891a0 fffff880`03bd2670 : ZTEusbnet+0x35dd
fffff880`03bd2580 fffff880`0215d809 : fffffa80`0c0891a0 fffff880`020f0ecd fffff880`03bd2670 fffffa80`091c5550 : ZTEusbnet+0x4627
fffff880`03bd25b0 fffff880`0215d7d0 : fffffa80`091c54a0 fffffa80`0c0891a0 fffff880`03bd2670 fffffa80`08fc2ac0 : ndis!NdisFDevicePnPEventNotify+0x89
fffff880`03bd25e0 fffff880`0215d7d0 : fffffa80`08fc2a10 fffffa80`0c0891a0 fffffa80`091f9010 fffffa80`091f90c0 : ndis!NdisFDevicePnPEventNotify+0x50
fffff880`03bd2610 fffff880`0219070c : fffffa80`0c0891a0 00000000`00000000 00000000`00000000 fffffa80`0c0891a0 : ndis!NdisFDevicePnPEventNotify+0x50
fffff880`03bd2640 fffff880`021a1da2 : 00000000`00000000 fffffa80`08f5cc10 00000000`00000000 fffffa80`0c0891a0 : ndis! ?? ::LNCPHCLB::`string’+0xddf
fffff880`03bd26f0 fffff800`034fb121 : fffffa80`091c7060 fffffa80`0c089050 fffff880`03bd2848 fffffa80`070bfa00 : ndis!ndisPnPDispatch+0x843
fffff880`03bd2790 fffff800`0367b3a1 : fffffa80`070bfa00 00000000`00000000 fffffa80`0dc19990 fffff880`03bd2828 : nt!IopSynchronousCall+0xe1
fffff880`03bd2800 fffff800`03675d78 : fffffa80`09196e00 fffffa80`070bfa00 00000000`0000030a 00000000`00000308 : nt!IopRemoveDevice+0x101
fffff880`03bd28c0 fffff800`0367aee7 : fffffa80`0dc19990 00000000`00000000 00000000`00000003 00000000`00000136 : nt!PnpSurpriseRemoveLockedDeviceNode+0x128
fffff880`03bd2900 fffff800`0367b000 : 00000000`00000000 fffff8a0`11d1c000 fffff8a0`049330d0 fffff880`03bd2a58 : nt!PnpDeleteLockedDeviceNode+0x37
fffff880`03bd2930 fffff800`0370b97f : 00000000`00000002 00000000`00000000 fffffa80`09122010 00000000`00000000 : nt!PnpDeleteLockedDeviceNodes+0xa0
fffff880`03bd29a0 fffff800`0370c53c : fffff880`03bd2b78 fffffa80`114ab700 fffffa80`07005600 fffffa80`00000000 : nt!PnpProcessQueryRemoveAndEject+0x6cf
fffff880`03bd2ae0 fffff800`035f573e : 00000000`00000000 fffffa80`114ab7d0 fffff8a0`123a25b0 00000000`00000000 : nt!PnpProcessTargetDeviceEvent+0x4c
fffff880`03bd2b10 fffff800`03298261 : fffff800`034f9f88 fffff8a0`11d1c010 fffff800`034342d8 fffff800`034342d8 : nt! ?? ::NNGAKEGL::`string’+0x54d9b
fffff880`03bd2b70 fffff800`0352b2ea : 00000000`00000000 fffffa80`07005660 00000000`00000080 fffffa80`06f8d890 : nt!ExpWorkerThread+0x111
fffff880`03bd2c00 fffff800`0327f8e6 : fffff880`03965180 fffffa80`07005660 fffff880`0396ffc0 00000000`00000000 : nt!PspSystemThreadStartup+0x5a
fffff880`03bd2c40 00000000`00000000 : fffff880`03bd3000 fffff880`03bcd000 fffff880`03bd2410 00000000`00000000 : nt!KxStartSystemThread+0x16

0: kd> !irp fffffa8008f5cc10
Irp is active with 10 stacks 10 is current (= 0xfffffa8008f5cf68)
No Mdl: No System Buffer: Thread fffffa8007005660:  Irp stack trace.
cmd  flg cl Device   File     Completion-Context
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
>[ 1b,17]   0  0 fffffa800c089050 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000

I’m not sure why but the ZTEusbnet driver isn’t processing the power IRP, it’s just leaving it and that’s what caused the system to crash.
It’d be nice to know exactly why it didn’t pass the power IRP on.
I’m not suprised though given the date of the driver.

0: kd> !devstack fffffa800c089050
!DevObj           !DrvObj            !DevExt           ObjectName
> fffffa800c089050  \Driver\ZTEusbnet  fffffa800c0891a0  NDMP14
fffffa80070bfa00  \Driver\usbccgp    fffffa80070bfb50  000000a8
!DevNode fffffa800dc19990 :
DeviceInst is “USB\VID_19D2&PID_0063&MI_04\6&200b5242&0&0004”
ServiceName is “ZTEusbnet”

We can see that it was meant to pass the power IRP down to the USB common class generic parent driver which, to put it simply exposes each USB composite device in order to seperate it to a single device. Passing it down to the USB bus driver should change the power state.

0: kd> lmvm ZTEusbnet
start             end                 module name
fffff880`0ae8e000 fffff880`0aebc000   ZTEusbnet   (no symbols)
Loaded symbol image file: ZTEusbnet.sys
Image path: \SystemRoot\system32\DRIVERS\ZTEusbnet.sys
Image name: ZTEusbnet.sys
Timestamp:        Mon Oct 13 06:50:10 2008 (48F2E192)
CheckSum:         000329ED
ImageSize:        0002E000
Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

This is a dump file from quite a while ago but if memory serves me correctly I think an update solved the issue.

Any other questions feel free to ask, I believe I’ve covered most things without going into detail about drivers.

Sources: Device nodes and device stacks (Windows Drivers)
Driver stacks (Windows Drivers)


I’ve not posted in a while but I found an interesting case on a forum and managed to acquire a Kernel memory dump.
I’m not going into detail about DPCs or interrupts as I have made blog posts on these in the past.

The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arg1: 0000000000000000, A single DPC or ISR exceeded its time allotment. The offending
    component can usually be identified with a stack trace.
Arg2: 0000000000000501, The DPC time count (in ticks).
Arg3: 0000000000000500, The DPC time allotment (in ticks).
Arg4: 0000000000000000

So here it states that we encountered a DPC which exceeded the allocated time for it to finish executing. The problem is that it went over this time, and as stated before DPCs can hold up the system when taking too long to execute which can result in lagging, a slow system or even sound cutting out.

So lets look at our stack trace.

ffffd001`50c93c98 fffff800`9238bcc2 : 00000000`00000133 00000000`00000000 00000000`00000501 00000000`00000500 : nt!KeBugCheckEx
ffffd001`50c93ca0 fffff800`92271115 : 00000000`00000000 00000000`00000000 00000000`00000000 fffff801`dceabf17 : nt! ?? ::FNODOBFM::`string’+0x18b12
ffffd001`50c93d30 fffff800`929a07b5 : ffffe001`00400a02 fffff800`922fcae6 fffff801`daed3cf8 ffffe001`00008201 : nt!KeClockInterruptNotify+0x95
ffffd001`50c93f40 fffff800`922e80e3 : ffffd001`50c93f60 00000000`00000008 ffff5377`5487cf7d 00000000`0000000c : hal!HalpTimerClockIpiRoutine+0x15
ffffd001`50c93f70 fffff800`9236412a : ffffe001`9c600500 ffffe001`9e8de1a0 00000000`00000000 00000000`00000000 : nt!KiCallInterruptServiceRoutine+0xa3
ffffd001`50c93fb0 fffff800`92364a9b : 44454c49`4146203a 696c6564`206f7420 6e657665`20726576 20212121`20352074 : nt!KiInterruptSubDispatchNoLockNoEtw+0xea
ffffd001`50c853a0 fffff800`922e8383 : ffffe001`9e92d030 ffffe001`9e968030 00000000`02290a8d 00000000`00000018 : nt!KiInterruptDispatchNoLockNoEtw+0xfb
ffffd001`50c85530 fffff801`dcfa5751 : ffffe001`9e66e7a0 ffffe001`00000000 ffffe001`9e92fbe0 00000000`fffff850 : nt!KeAcquireSpinLockRaiseToDpc+0x13
ffffd001`50c85560 fffff801`dcfa531d : ffffe001`9e96b840 fffff801`dcf2c48f ffffe001`9eb68490 fffff801`dcf2c550 : athwbx+0x161751
ffffd001`50c855f0 fffff801`dcf60c42 : ffffe001`9e96b840 ffffd001`50c85650 ffffd001`50c85654 00000000`00000000 : athwbx+0x16131d
ffffd001`50c85630 fffff801`dcf33472 : ffffe001`9e9bf030 fffff801`00000000 ffffd001`50c856d0 fffff801`dd074319 : athwbx+0x11cc42
ffffd001`50c85680 fffff801`dd0c129f : ffffe001`9e9bf030 ffffffff`ffffffff ffffe001`9e6d97e8 fffff801`dd011189 : athwbx+0xef472
ffffd001`50c856f0 fffff801`dd08679e : ffffe001`9e968030 00000000`00000000 00000000`00000000 00000000`00000000 : athwbx+0x27d29f
ffffd001`50c85720 fffff801`dae9e81e : ffffe001`9e961030 00000000`00000000 ffffd001`50c85790 00000000`00000000 : athwbx+0x24279e
ffffd001`50c85760 fffff800`92252130 : ffffd001`50c85b00 00000000`00000000 00000000`00000200 fffff800`92274ae0 : ndis!ndisInterruptDpc+0x269ce
ffffd001`50c85860 fffff800`9225134b : ffffd001`50c5c180 ffffe001`9e8f4010 ffffe001`9c46b900 ffffe001`a12f3080 : nt!KiExecuteAllDpcs+0x1b0
ffffd001`50c859b0 fffff800`923667ea : ffffd001`50c5c180 ffffd001`50c5c180 ffffd001`50c682c0 ffffe001`9dbbb540 : nt!KiRetireDpcList+0xdb
ffffd001`50c85c60 00000000`00000000 : ffffd001`50c86000 ffffd001`50c80000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x5a

So in this callstack we see our processor in an idle loop, when idle it tends to execute any DPCs if there are any waiting in the DPC queue.
It begins to execute all the DPCs in the queue (also known as draining) when get execute an [B]ndis dpc interrupt[/B], this begins to call network functions and then acquire a spinlock and raise to DPC/Dispatch IRQL level if it hasn’t already (this is the standard routine that is used, I can’t remember if it is required), we then recieve more interrupts followed by a clock interrupt and a bugcheck.

Okay so we know that we bugchecked because a DPC was taking too long to finish executing and risk holding up the system, especially where spinlocks are concerned.

The main thing that interests me is why is there a clock interrupt?

3: kd> !dpcs
CPU Type      KDPC       Function
 3: Normal  : 0xffffe0019e66e880 0xfffff801dae78eb0 ndis!ndisMTimerObjectDpc
 3: Normal  : 0xffffd00150c61668 0xfffff80092327b28 nt!PpmPerfAction
 3: Normal  : 0xffffd0015589a280 0xfffff80092258854 nt!PopExecuteProcessorCallback
 3: Threaded: 0xffffd00150c617c0 0xfffff8009231a0a0 nt!KiDpcWatchdog

I believe the ndis dpc interrupt is related to this timer object but I may be wrong, if it is related then the clock interrupt makes sense as the system requires intervals for clock interrupts to take place in order to keep track of system time and logical run time for threads and timers. Processes can modify the clock interrupt interval for their needs to process timers much quicker, I’ll not go into detail as I will talk about timers another time.

The only problem is that I ran into a dead end, I couldn’t find anything related to the network driver in terms of modifying the clock interrupt timer.

3: kd> !list “-e -x \”dt nt!_EPROCESS @$extret-@@(#FIELD_OFFSET(nt!_EPROCESS, TimerResolutionLink)) ImageFileName SmallestTimerResolution RequestedTimerResolution\” nt!ExpTimerResolutionListHead”
dt nt!_EPROCESS @$extret-@@(#FIELD_OFFSET(nt!_EPROCESS, TimerResolutionLink)) ImageFileName SmallestTimerResolution RequestedTimerResolution
   +0x438 ImageFileName            : [15]  “???”
   +0x638 RequestedTimerResolution : 0x9c3d2000
   +0x63c SmallestTimerResolution  : 0xffffe001

dt nt!_EPROCESS @$extret-@@(#FIELD_OFFSET(nt!_EPROCESS, TimerResolutionLink)) ImageFileName SmallestTimerResolution RequestedTimerResolution
   +0x438 ImageFileName            : [15]  “svchost.exe”
   +0x638 RequestedTimerResolution : 0
   +0x63c SmallestTimerResolution  : 0x2710

So I thought I’d look into this a bit more.

3: kd> u ndis!ndisInterruptDpc+0x269ce
fffff801`dae9e81e 488b75a7        mov     rsi,qword ptr [rbp-59h]
fffff801`dae9e822 e9e297fdff      jmp     ndis!ndisInterruptDpc+0x1b9 (fffff801`dae78009)
fffff801`dae9e827 33d2            xor     edx,edx
fffff801`dae9e829 488d4dd7        lea     rcx,[rbp-29h]
fffff801`dae9e82d 448d420d        lea     r8d,[rdx+0Dh]
fffff801`dae9e831 e8160c0000      call    ndis!ndisPcwEndCycleCounter (fffff801`dae9f44c)
fffff801`dae9e836 90              nop
fffff801`dae9e837 e9d797fdff      jmp     ndis!ndisInterruptDpc+0x1c3 (fffff801`dae78013)

It appears the interrupt routine is looping for some reason.

I can’t find anything on the cycle counter function as it is undocumented but I’ll take a guess and say that it’s keeping track of the time the interrupt has been executing, AFAIK this is don’t by using a counter on the currently executing thread to see how long it’s running.

3: kd> u nt!KeAcquireSpinLockRaiseToDpc+0x13
fffff800`922e8383 f605fcac270021  test    byte ptr [nt!PerfGlobalGroupMask+0x6 (fffff800`92563086)],21h
fffff800`922e838a 751f            jne     nt!KeAcquireSpinLockRaiseToDpc+0x3b (fffff800`922e83ab)
fffff800`922e838c f0480fba2900    lock bts qword ptr [rcx],0
fffff800`922e8392 7209            jb      nt!KeAcquireSpinLockRaiseToDpc+0x2d (fffff800`922e839d)
fffff800`922e8394 0fb6c3          movzx   eax,bl
fffff800`922e8397 4883c420        add     rsp,20h
fffff800`922e839b 5b              pop     rbx
fffff800`922e839c c3              ret

Here we can see the same DPC interrupt routine trying to acquire a spinlock yet it’s not managing to do it and therefore looping all whilst it is still running at DPC level and therefore preventing normal thread execution.

Eventually it seems it managed to acquire the spinlock and then call a clock interrupt in order to perform some operation, I suspect updating the system time in order to service the network driver with higher response times.
The system realised that it was taking too long to complete and therefore bugchecked.

3: kd> lmvm athwbx
start             end                 module name
fffff801`dce44000 fffff801`dd1ff000   athwbx     (no symbols)          
    Loaded symbol image file: athwbx.sys
    Image path: \SystemRoot\system32\DRIVERS\athwbx.sys
    Image name: athwbx.sys
    Timestamp:        Thu Oct 17 10:46:01 2013 (525FB1D9)
    CheckSum:         003BC161
    ImageSize:        003BB000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

So the network driver was quite outdated, he updated it and the blue screens stopped so it looks like it was an easy fix.

Thanks for reading.

Debugging and my story

I thought is write a blot post about myself and how I started debugging as I haven’t posted in a long time, the reason is I’m actually on holiday in Greece so I can’t write much about debugging as I don’t have access to my computer.

So here’s my story on how I got to where I am today…
Before time began… Wait, that’s not right.
Okay, I first got my own computer in March 2013 from a local computer shop, I saved up from Xmas and birthday money, my parents didn’t agree getting my own computer was a good idea but I insisted so I went to this shop and poured out a sum of £590 to pay for a gaming computer.
At the time I knew nothing, literally… I mean nothing about computers, I just wanted to play better games than what was on my Xbox at the time.
So here I carried it upstairs (my Uncle fetched the computer in his car for me at the time) into my bedroom, here comes the phone call of my mum ringing up and to her disbelief I told her I bought a computer.
She came home and to no surprise wasn’t happy at all, we had no where to put it at all, eventually she decided to move everything around to put a desk in.
Here’s where the irony comes in, she said it was a bad idea to buy the computer and especially from this local shop, so here I was setting it up, buying a few games on Steam, starting playing the games and… (You guessed it) blue screen!
Oh no, what am I going to do?
Panic, panic…
I eventually ring the shop and say this has happened, so I took it down to the shop, he looked at it and told me to come back in a few days, I came back and it ran fine until it happened again.
I remember him saying it was a driver issue and I’m installing something that’s causing problems, funny how he never told me what it was.
This is the point where I decided to go online and see if I can find any solutions, all gibberish as I knew nothing of computing.
That’s when I stumbled upon
I asked for help and I got all these solutions that wouldn’t work, from somebody called Arc, I then contacted somebody else called X Blue Robot to help me which happens to be a good friend now over at
Anyway, the problems still persisted so joined another forum which was and made a thread, Vir Gnarus and two others helped me and it was appearing to be a hardware failure, especially given it was a 0x124, (discussed in another post).
I still couldn’t fix the problem, I got numerous errors to the point where I contacted the supplier where the local shop got the computers from. They told me to send it to them free of charge, they contacted me days after and said they couldn’t replicate the issue but replaced the PSU and GPU as the GPU had a loose bearing as wasn’t actually supplied with the computer, the local shop bought that seperate so the 450w PSU couldn’t really handle it so that was replaced to an el cheapo 750w Ace switching PSU which can be bought from around £10, that’s good…
I gave up and changed what software I could as the supplier still couldn’t find any problems.
I started then posting BSOD instructions over at to help some BSOD analysts.
I got a few thanks, I then installed the Windows Debugger to take a look at some files, still gibberish, no luck in finding useful commands on my own, thanks to some of x BlueRobot’s posts (Harry Miller) I managed to use some commands to find simply BSOD cases.
I then started learning basics and reading blog posts by Harry on learning debugging. Afterwards I bought Windows Internals and read a bit of that.
Without causing more wars I got into a large disagreement over at and got banned, after already having an account on Sysnative about freezing I decided to take my knowledge over there.
Shorty after I made friends with two fellow BSOD analysts Patrick Barker and John Griffith.
I’ve been there ever since at my new online home, I then joined and helped people out there (which I still do).
And now here I am (in Greece) helping people out with BSODs and hopefully starting a Computer Science degree at Sheffield Hallam University this time next year.
Oh and over at Vir Gnarus who is now a good friend but he recently switched to IT Infrastructure as apposed to debugging but I’m hoping he’ll return soon.
I’m undecided in what to do as a career at the moment but an Escalation Engineer at Microsoft looks like a very interesting job.
So that’s my story so far, debugging is very interesting and so hard to believe that if I hadn’t had bought that specific computer I wouldn’t be here today.

Memory Management – Stacks

In this blog I’ll talk about stacks, what they are and how they are used in Windows.
We’ve come across the term before but we don’t know that much about them unless you really look into them.

So a stack is an abstract data type that is implemented as a LIFO structure which means Last In First Out.

So from good old wikipedia here’s a very good simple picture of a LIFO mechanism, we can see it uses Push to add data onto the stack and Pop to remove it, so now you know how the simple stack works lets go a bit more advanced.

A stack has a fixed origin within memory called (you know it) stack origin, it then uses a the push instruction to initialize the stack. It then contains a stack pointer which points to the address of the last item added to the stack. The pointer moves further away from the origin as more data is added, although this doesn’t necessarily mean it’s moving up, it can move down.
Now the pointer cannot cross the origin margine at any time, if this happens a stack underrun occurs, this is normally caused by using pop more times than it should be.
A stack overflow occurs when Push is used more times than allowed so the pointer moves into the boundary of another stack, in other words it spills data outside of the allocated region and goes into another stack.
This is a very big problem on Kernel stacks as there are no process address spaces to protect the memory, in Kernel mode everything is ran from a single system memory space that has access to the entire Kernel system of the OS. When this overflow happens it can and will corrupt data on another stack elsewhere that can be executing a thread completely different to the stack overflowing, the culprit on the current stack essentially flees and the stack being corrupt blames somebody else and a bugcheck is called once the corruption is detected, if this is the case Driver Verifier should be enabled.

A good picture to show how this works is as follows.

I rambled a bit here but I just tried to briefly explain how stack overflows cause corruption that can bring the system to a halt so device drivers and other kernel objects should be written carefully and correctly to prevent these situations from happening.

Stacks can also be implemented within arrays which involves the first element at offset zero being the stack origin and it builds from there.
Implementing stacks in linked lists differs in that AFAIK it doesn’t involve using the LIFO mechanism but rather removing nodes and replacing them with different ones in order to change the bottom element of the stack, I need to look into that a bit more though.

There are generally three major types of stacks: User Stacks, Kernel Stacks and DPC Stacks.

In User Stacks when a thread is created by the memory manager 1MB of memory is reserved which can be altered when calling the CreateThread function. Once the thread is created only the first page and a guard page is created, more data can then be added to the page until the guard page is hitwhen an exception occurs, this then allows it to grow with demand but it will never shrink back.

Kernel stacks are a lot smaller than user stacks, they typically range from 12KB (x86) to 16KB (x64), this excludes a guard page table entry which consumes an additional 4KB.
Kernel running code tends to have less recursion than user mode code and therefore contains more efficient code which keeps stack buffer sizes smaller. As stated before, Kernel code has a much larger impact on the system as it runs in a single system address space.

However, interactions between the graphics system and win32k.sys subsequent calls back into user mode are recursive the Kernel implements a way for stacks to be added when nearing the guard page, these stacks contain an additional 16KB, when calls are returned the memory manager frees the stacks afterwards.

 The DPC stack contains a processor stack (One for each processor) which is available for use everytime DPCs are executed, they stay in their own stack as it’s generally unrelated to the current kernel stack’s operations as it runs in an arbitary thread context.

I believe I’ve covered pretty much everything on stacks, I hope that’s helped your understanding.

Windows Internals

Interrupt dispatching and handling

In this post I’ll talk about interrupt dispatching and the type of interrupts. Interrupts have always been interesting yet slightly confusing at the same time so I’ll try and explain what they are and the different types they come in.

So what is an interrupt?

It’s kind of in the name, it’s an asynchronous event that diverts the processors flow of control.
They generally come in two forms, hardware interrupts and software interrupts.
 Interrupts can occur from I/O devices, timers or processor clocks.

Hardware Interrupts

These interrupts are external I/Os that come from lines in the interrupt controllers, so when an IRQ (Interrupt Request) is received it enters through a line on the interrupt controller which converts the IRQ into a number which is matched with the IDT index (Interrupt Dispatch Table), then the ISR (Interrupt Service Routine) trap handler is invoked to save the context of the currently executing thread, once the interrupt is completed the context is restored so the thread continues execution like nothing has ever happened.

Interrupt controllers

Hardware interrupts use interrupt controllers which generally speaking come in two forms, PIC (Programmable Interrupt Controller) and APIC (Advanced Programmable Interrupt Controller). The PIC is a uniprocessor controller that is generally used on x86 systems and uses 8 lines. However another PIC can be added called a slave which can add an additional 7 lines to the controller adding to a total of 15 lines.
The APIC is multiprossor interrupt controller which is generally used on x64 systems that contains 256 lines, with this in play the PIC is quickly being phased out.

Here is an example of the IDT which contains lots of different entries for specific interrupts, trap handlers for exceptions also use the IDT for events such as page faults.
I will discuss later on how page faults come into play with bugchecks and IRQs but for a more indepth explanation on how page faults are handled take a look at my friend Patrick’s post over at

Software Interrupts

Although interrupt controllers implement their own prioritisation mechanisms Windows uses it’s own technique for doing so, IRQLs.

These are IRQLs for x64, IA64 and x86 systems.
IRQLs are a way for interrupts to be prioritised appropriately, IRQLs aren’t implemented in a first in first serve technique but rather the higher the IRQL the higher the priority so an IRQL at 15 would get serviced before one at IRQL 2.

To put this into perspective an IRQ that is at IRQL 2 would have to wait for any IRQs at 3 or above to get serviced before the IRQL is lowered for it to be serviced as the level cannot be lowered when a new interrupt has occurred.
For example, if an interrupt is being serviced and another interrupt needs servicing two things can happen.

One is the current IRQ is but on a waiting list and the new one is serviced.
Two is the current IRQ is finished being serviced then the next IRQL further down the list is next.

It depends on the IRQL of the interrupts.

Back to page faults,
A page fault occurs when a request to memory that is not present happens, when a page fault occurs the page fault handler requests the memory being referenced is brought into memory but in order to do that the IRQL must be at 1 or below as this is when pageable memory can be accessed.
Now when the IRQL is higher than this servicing and interrupt and a page fault occurs this is when we bugcheck with either 0xA or 0xD1 (DRIVER_)IRQL_NOT_LESS_OR_EQUAL

So why can’t we just lower the IRQL to service the page fault or wait for the current interrupt to finish?

Well IRQLs cannot be lowered when an interrupt at that level is being serviced as that has priority, a page fault cannot wait as it must be serviced immediately.
You see the problem?
It’s an endless cycle so the system crashes as it can’t compute anything else.

I hope I’ve covered pretty much everything and I hope you’ve learned something.

I forgot to add, hardware interrupts (IRQs) can only be serviced above DPC/dispatch level, so anything at that level or below will not allow hardware interrupts to be serviced.

Instruction pointer misalignments

This time I’ll talk about instruction pointer misalignment.
So what is an instruction pointer misalignment?

Well, when an object references memory it uses a pointer to (you guessed it) point to a certain memory address, once it references the data inside that address it grabs the data from inside the address which is known as dereferencing.

When a pointer is misaligned it grabs data from the wrong address which causes a lot of problems by causing severe memory corruption depending on the contents of the address being referenced, if allowed to write it can completely corrupt the address, the culprit can escape and some innocent pointer comes along, tries to use the address and gets blamed by the computer police.
This is why bugchecks are called to prevent such memory corruption, now the way data structures are arranged and accessed it will read/write in chunks of 4 bytes (sometimes larger) so the memory offset size will be a multiple of the word size, the reason this is done is to maximise the performance by utilising way the CPU handles memory.

When the memory being referenced isn’t a multiple of 4 then that’s when things go wrong, it generally results in an alignment fault which is also known a bus error, a good example is this.

This instruction taken from a crash dump can explain this a little bit.

 So the nt!Mm is the module, in this case it’s a Memory Management Windows function.
The CleanProcessAddressSpace is the actual function, in this case it’s scrubbing a memory address ready for allocation.
 The +0xe6 is the offset which is like the address on a street, it’s the location which the function takes place.

 I was actually looking at the differences between a segmentation fault and a bus error as they both involve the CPU not physically being capable of addressing the memory being referenced.

  • The segmentation error (or access violation exception) occurs when memory outside of the allowed location is referenced (not to be confused with buffer overruns which involve writing outside allocated memory into another buffer).
  • The bus error occurs when an address which is not alligned is referenced, by this as you know is when they aren’t multiples of 4.

Another thing to note, in dump files you can see where it says misaligned pointer it mentions it’s probably caused by hardware. As I’ve mentioned, it’s probably due to the fact that the CPU cannot address memory that isn’t alligned with multiples of 4 so it looks like it’s due to the CPU not being able to read it at all.
Misaligned IPs don’t always result in a bus error, they can be caused by drivers writing more data in a buffer on a stack which results in a stack overflow, this also results in a bugcheck to prevent critical memory corruption.

     I hope this has helped you understand the differences and more about instruction pointers.

    Hexadecimal and Binary

    This blog will be a little different to my usual debugging blogs.
    I will be talking about hexadecimals and binary, it can be difficult to fully understand but we should be able to get through it.

    Now, at school, I was never really good at Maths, I struggled with a lot of things but I’ve picked up a few things with debugging as Windows Internals uses these figures to perform operations that would not be possible in decimal.

     Binary can be difficult to get your head round but computers use them to make things a lot more simpler.
    Remember at school when you had to use a T chart and count in tens.
    So “10” would be 1 ten and 0 ones, in binary “10” means 1 twos and 0 ones.
    “100” in binary would be 4 (2×2), “1000” would be eight (4×2) etc.
    Generally, binary is used for power states within computers because they’re multiples of 2 it would on and off.
    There would also be far less rules compared to decimal which actually simplifies things (for the computer), but for us we would need a compiler to convert the code for us to make sense of them.

    Hexadecimal is used because it’s easier to make smaller numbers, it’s mainly used to convert code into binary easier as it divides easier. Instead of multiples of 10 hexadecimal uses multiples of 16, so “10” in hexadecimal would be 16 as it’s 1 sixteen and 0 ones.
    “25” in hexadecimal would be 2 sixteens and 5 ones so it would be 37.

    But how does that work as there aren’t symbols for 10 to 16 in hexadecimal?
    This is why we have letters for 10 to 16, here’s a good conversion chart to help you understand.

    Here we can see how they all convert into each other, obviously the higher the figures the more difficult they become to understand.

    Hopefully this has helped a lot of you understand if you didn’t already.