Page fault from IRQL (0xD1)

This blog post is about an interesting dump file I came across on our forum.
So far an IRP was sent from the Logitech Webcam down the driver stack of the USB parent driver, and then the USB 3.0 host controller. The IRP was changing the power state of this device, and incurred a page fault at Dispatch IRQL, a big no no.

Continue reading

Cyclic Redundancy Checks

First of all, I want to apologise for not posting for a long time. I’ve been very busy with work at college, I’ll try to post more regularly.

In this post I want to talk about Cyclic Redundancy Checks, and what they are. In the most basic form, CRCs are an error checking method for digital transmission of data. In short, they use polynomial division to check for inconsistencies within data. As you know, data is in the form of binary, which has base two, so the bit is either 0 or 1. As well as the message, which is a series of bits, there is a check word value which is added as well as the message, this number is the divisor. The transmitter and receiver both have the value of the divisor, which is used to check for errors in the bits. When the message is sent, the receiver divides the divisor from dividend (message), if the remainder is the same, then the data should be intact.

The key word (divisor) is usually presented in the form of a generator polynomial, the coeffecients of this number will be the binary bits of K. This is probably difficult to picture, so I’ll explain more. So suppose we want to use the number 36 as K. This number is written as 100100 in binary, and as x^5 + x^2 + 1 as a polynomial expression. The bits of the remainder left of K will usually lead to k-1, so in this case, we are left with a 6 bit CRC. The higher the value of the divisor, the chances of a false positive occurring are much more slim.

Alright, so suppose we want to send message M, which has the value of 110101101000101100010111. We then perform long division on the bits by our generator polynomial K = 100100.  All we need to know is the XOR method, which essentially means if two bits are the same, they equal zero, and if they are different, they equal one.

0/0 = 0    1/0 = 1    0/1= 1    1/1= 0

binary crc

So here in this picture, I’ve used binary long division to work out the remainder, and this remainder is the CRC. So when a message is sent M, K is also added on to create a divisor which should have a specific remainder. When the data doesn’t match up, the remainder will be different to what it should be, thus the integrity of the data has been compromised.

So what happens if something like occurs in Windows? Well in the following scenario, there’s a bugcheck that keeps occurring, but the cause isn’t a driver, it’s a ahrdware component.

3: kd> .bugcheck
Bugcheck code 0000007A
Arguments fffff6fc`00e500a0 ffffffff`c000003f 00000003`182a3860 fffff801`ca014d84

//Kernel inpage error, kernel data couldn’t be brought in from disk

3: kd> !error ffffffffc000003f
Error code: (NTSTATUS) 0xc000003f (3221225535) – {Bad CRC}  A cyclic redundancy check (CRC) checksum error occurred.

//Here’s out redundancy check, the data is corrupt, it doesn’t match what was originally written.

3: kd> k
Child-SP          RetAddr           Call Site
ffffd000`3cd0d5c8 fffff803`881ade2f nt!KeBugCheckEx //Data is corrupt, bugcheck
ffffd000`3cd0d5d0 fffff803`8806a0ac nt!MiWaitForInPageComplete+0x3177f //Wait for the data to be paged in
ffffd000`3cd0d6c0 fffff803`88081c44 nt!MiIssueHardFault+0x184 //Data isn’t present in memory, hard fault, page //in from disk
ffffd000`3cd0d780 fffff803`8817642f nt!MmAccessFault+0x524 //Incur the memory manager page fault handler
ffffd000`3cd0d930 fffff801`ca014d84 nt!KiPageFault+0x12f //Hit a page fault
ffffd000`3cd0dac8 fffff801`c9ff4b92 dxgkrnl!DxgkMiracastQueryMiracastSupport
ffffd000`3cd0dad0 fffff803`881779b3 dxgkrnl!DxgkNetDispQueryMiracastDisplayDeviceSupport+0x1a //Internal DirectX functions
ffffd000`3cd0db00 00007ff8`bc8b15ea nt!KiSystemServiceCopyEnd+0x13 //Transition to kernel mode
000000e3`3521e508 00000000`00000000 0x00007ff8`bc8b15ea //User mode

fffff801ca0148d4-fffff801ca01491a  71 bytes – dxgkrnl!DxgkHandleMiracastEscape+380 (+0x06)
[ 85 c9 74 2e ff c9 74 15:00 c7 84 24 08 02 00 00 ]
fffff801ca01491c-fffff801ca014951  54 bytes – dxgkrnl!DxgkHandleMiracastEscape+3c8 (+0x48)
[ 00 41 bd 40 00 00 00 45:41 89 cc 45 2b e3 41 89 ]
WARNING: !chkimg output was truncated to 50 lines. Invoke !chkimg without ‘-lo [num_lines]’ to view  entire output.
3988 errors : !dxgkrnl (fffff801ca014000-fffff801ca014fff)

So we can see all these errors, 3988 to be precise, now was it the RAM or the disk? Well, it’s difficult to say with one dump file, so I analysed a few more, although the error was different. It was a failing disk that simply disappeared, thus crashing the system. The error in this case was almost certainly due to part of the disk which had stopped functioning correctly, and therefore corrupting kernel data.

Device/Driver Objects and Stacks

Today I thought I’d write a bit about device stacks and driver stacks and how they implement IRPs.
I’m not going into detail on how drivers function and the types of drivers as I would be here all day so I’ll save that for another time.

What is a device object and a driver object?

A device object is an opaque structure that represents a device or function. A device object is an instance of the DEVICE_OBJECT data structure which is used by the operating system to represent a device.
Some device objects don’t always represent a physical device, they can represent a logical device.

A driver object is just a Kernel image to represent the Kernel mode driver which includes a pointer to the driver’s routines.
When a driver initialises it creates a device object to represent physical or logical devices.

Device stacks and Device nodes

The Kernel organises drivers into a tree structure called the Plug and Play device tree containing device nodes that represent devices, do note that some nodes represent software components which don’t have any physical devices attached to them.

A device stack contains a PDO (Physical Device Object) which represents the physical device connected to a physical bus on the motherboard, in this case I’ll talk about the PCI bus as an example.
The PCI bus enumerates the child devices which are connected to the PCI bus on the motherboard, this creates the PDO for each device and is then represented by a device node in the PnP device tree.
Do note that depending on your perspective determines what type of driver the pci.sys driver is, for example if you’re looking at the PCI bus device node then it’s the function driver but if you’re looking at one of the PCI bus node child devices associated with it then it’s the bus driver.

After the device node has been associated with the new PDO the PnP manager then searches the registry for the driver(s) which needs to be part of the device stack, these drivers are called Function Drivers.

Here’s a small point about the drivers usually found in device stacks:

  • Bus drivers detect and inform the PnP manager about its devices on its bus as well as controlling the power to the bus. There is only allowed to be one bus driver at once and Microsoft normally supplies them.
  • Function drivers on the other hand are the main driver that represents the device and performs the basic operations for reading and writing, it’s the driver that knows the most about its device.
  • Filter drivers modify the device behaviour when needed and it’s located above or below the function driver. It normally fixes errors that are detected before it reaches the function driver on the stack.

0: kd> !devstack fffffa8004615680
!DevObj           !DrvObj            !DevExt           ObjectName
fffffa8005f8c150  \DRIVER\VERIFIER_FILTERfffffa8005f8c2a0
fffffa8005f8c390 *** ERROR: Module load completed but symbols could not be loaded for GEARAspiWDM.sys
\Driver\GEARAspiWDM
fffffa8005f8c4e0
fffffa8005f202e0  \DRIVER\VERIFIER_FILTERfffffa8005f20430
fffffa8005eec060  \Driver\cdrom      fffffa8005f77b80  CdRom0
fffffa80057379b0  \Driver\ACPI       fffffa80047f6a00
> fffffa8004615680  \Driver\atapi      fffffa80046157d0  IdeDeviceP0T1L0-5
!DevNode fffffa8005742900 :
DeviceInst is “IDE\CdRomATAPI_iHAS124___B_______________________AL0R____\5&f437ab5&0&0.1.0”
ServiceName is “cdrom”

This is the device stack for the cd drive in the computer which shows the associated device objects and driver objects within it.

  • Atapi provides the interface to enable support for cd players.
  • ACPI is the bus filter driver that enables Power Management for the operating system so when devices are not in use (In this case the cd player) it will be powered off.
  • cdrom is the function driver for the cd drive that allows discs to be read and written to.
  • GEARAspiWDM.sys is the cdrom 3rd party filter driver.
  • VERIFIER_FILTER are filter drivers used by Driver Verifier which is enabled to monitor driver routines and operations to make sure everything is working correctly.

For more information on Driver Verifier see here: Driver Verifier (Windows Drivers)

Driver Stacks are determined by how many drivers are present when processing an IRP by passing it down a device stack or in some cases multiple device stacks.
A driver object can be associated with multiple different device objects and therefore lots of device stacks, this shows that an IRP can be passed down lots of device stacks but only being serviced by a few drivers.

0: kd> !drvobj \Driver\ACPI
Driver object (fffffa80039a6af0) is for:
\Driver\ACPI
Driver Extension List: (id , addr)

Device Object list:
fffffa80057379b0  fffffa800573a9b0  fffffa80057399b0  fffffa800572fc20
fffffa800572fe40  fffffa800572ea00  fffffa800572ec20  fffffa800572ee40
fffffa800572da00  fffffa800572dc20  fffffa800572de40  fffffa8005819e40
fffffa8005814e40  fffffa800580fe40  fffffa800572a9b0  fffffa80057289b0
fffffa8005720e40  fffffa800571c920  fffffa800571cb20  fffffa800571bc40
fffffa800571be40  fffffa8005713bc0  fffffa8005700e40  fffffa80056ffa40
fffffa80056ffc40  fffffa80056ffe40  fffffa80056fea40  fffffa80056fec40
fffffa80056fee40  fffffa80056fda40  fffffa80056fdc40  fffffa80056fde40
fffffa80056fca40  fffffa80056fcc40  fffffa80056fce40  fffffa8004616330
fffffa8004616040  fffffa8004616c20  fffffa8004616e40  fffffa80047f9770
fffffa80039eadb0  fffffa8004be1060  fffffa80047fe170  fffffa80047fe390
fffffa80047fe5b0  fffffa80047fe7d0  fffffa80047fe9f0  fffffa80047fec10
fffffa80039a7040

So as proven here we can clearly see that the ACPI.sys driver is associated with a lot of device objects as it can’t just represent one device otherwise one hardware component would use ACPI and everything else would be powered on all the time, think about how many USB devices would be turned on.
So our CD drive is just one component that uses ACPI.

Finally we can see information about the IRP being sent by looking at the IRP data structure.

0: kd> dt nt!_IRP fffff9801c458dc0
+0x000 Type             : 0n6
+0x002 Size             : 0x238
+0x008 MdlAddress       : (null)
+0x010 Flags            : 0x40000000
+0x018 AssociatedIrp    :
+0x020 ThreadListEntry  : _LIST_ENTRY [ 0xfffff980`1c458de0 – 0xfffff980`1c458de0 ]
+0x030 IoStatus         : _IO_STATUS_BLOCK
+0x040 RequestorMode    : 0 ”
+0x041 PendingReturned  : 0 ”
+0x042 StackCount       : 5 ”
+0x043 CurrentLocation  : 1 ”
+0x044 Cancel           : 0 ”
+0x045 CancelIrql       : 0 ”
+0x046 ApcEnvironment   : 0 ”
+0x047 AllocationFlags  : 0x80 ”
+0x048 UserIosb         : (null)
+0x050 UserEvent        : (null)
+0x058 Overlay          :
+0x068 CancelRoutine    : (null)
+0x070 UserBuffer       : (null)
+0x078 Tail             :

Some of the entries are pretty obvious from the name and some aren’t documented, the ones that are can be found here:

IRP (Windows Drivers)

Power IRPs

I found an old dump file which was a 0x9F Kernel dump file caused by a power IRP not synchronising with the pnp manager.
Power IRPs are used to change the power state for a device and therefore they must reach the bottom of the device stack which is the physical device object.

DRIVER_POWER_STATE_FAILURE (9f)
A driver has failed to complete a power IRP within a specific time.
Arguments:
Arg1: 0000000000000004, The power transition timed out waiting to synchronize with the Pnp
subsystem.

Arg2: 0000000000000258, Timeout in seconds.
Arg3: fffffa8007005660, The thread currently holding on to the Pnp lock.
Arg4: fffff800053e83d0, nt!TRIAGE_9F_PNP on Win7 and higher

So we can see our 0x9F bugcheck with a power IRP failing to synchronise with the PnP manager because the IRP hasn’t reached the bottom of the stack.

0: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks..

Resource @ nt!IopDeviceTreeLock (0xfffff80003492ce0)    Shared 1 owning threads
Contention Count = 1
Threads: fffffa8007005660-01
KD: Scanning for held locks.

Resource @ nt!PiEngineLock (0xfffff80003492be0)    Exclusively owned
Contention Count = 21
NumberOfExclusiveWaiters = 1
Threads: fffffa8007005660-01
Threads Waiting On Exclusive Access:
fffffa800f308b50

KD: Scanning for held locks……..
18855 total locks, 2 locks currently held

We can see two locks have been held, IopDeviceTreeLock is to synchronise the device tree as a spinlock and the PiEngineLock which is a pnp and power management lock. The PiEngineLock is being owned by the ZTEusbnet driver in order to pass down the power IRP.

0: kd> !thread fffffa80`07005660
THREAD fffffa8007005660  Cid 0004.0048  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
fffffa800d035ee8  NotificationEvent
IRP List:
fffffa8008f5cc10: (0006,03e8) Flags: 00000000  Mdl: 00000000
Not impersonating
DeviceMap                 fffff8a000008c10
Owning Process            fffffa8006f8d890       Image:         System
Attached Process          N/A            Image:         N/A
Wait Start TickCount      396427         Ticks: 38463 (0:00:10:00.026)
Context Switch Count      44059          IdealProcessor: 2  NoStackSwap
UserTime                  00:00:00.000
KernelTime                00:00:00.343
Win32 Start Address nt!ExpWorkerThread (0xfffff80003298150)
Stack Init fffff88003bd2c70 Current fffff88003bd2280
Base fffff88003bd3000 Limit fffff88003bcd000 Call 0
Priority 15 BasePriority 12 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`03bd22c0 fffff800`032845f2 : fffffa80`07005660 fffffa80`07005660 00000000`00000000 00000000`00000000 : nt!KiSwapContext+0x7a
fffff880`03bd2400 fffff800`0329599f : fffffa80`0d0df208 fffff880`0ae9e10b fffffa80`00000000 00000000`00000000 : nt!KiCommitThreadWait+0x1d2
fffff880`03bd2490 fffff880`0ae915dd : fffffa80`0d035000 00000000`00000000 fffffa80`0dd8ca00 00000000`00000000 : nt!KeWaitForSingleObject+0x19f
fffff880`03bd2530 fffff880`0ae92627 : fffffa80`0d035000 00000000`00000000 fffffa80`0c0891a0 fffff880`03bd2670 : ZTEusbnet+0x35dd
fffff880`03bd2580 fffff880`0215d809 : fffffa80`0c0891a0 fffff880`020f0ecd fffff880`03bd2670 fffffa80`091c5550 : ZTEusbnet+0x4627
fffff880`03bd25b0 fffff880`0215d7d0 : fffffa80`091c54a0 fffffa80`0c0891a0 fffff880`03bd2670 fffffa80`08fc2ac0 : ndis!NdisFDevicePnPEventNotify+0x89
fffff880`03bd25e0 fffff880`0215d7d0 : fffffa80`08fc2a10 fffffa80`0c0891a0 fffffa80`091f9010 fffffa80`091f90c0 : ndis!NdisFDevicePnPEventNotify+0x50
fffff880`03bd2610 fffff880`0219070c : fffffa80`0c0891a0 00000000`00000000 00000000`00000000 fffffa80`0c0891a0 : ndis!NdisFDevicePnPEventNotify+0x50
fffff880`03bd2640 fffff880`021a1da2 : 00000000`00000000 fffffa80`08f5cc10 00000000`00000000 fffffa80`0c0891a0 : ndis! ?? ::LNCPHCLB::`string’+0xddf
fffff880`03bd26f0 fffff800`034fb121 : fffffa80`091c7060 fffffa80`0c089050 fffff880`03bd2848 fffffa80`070bfa00 : ndis!ndisPnPDispatch+0x843
fffff880`03bd2790 fffff800`0367b3a1 : fffffa80`070bfa00 00000000`00000000 fffffa80`0dc19990 fffff880`03bd2828 : nt!IopSynchronousCall+0xe1
fffff880`03bd2800 fffff800`03675d78 : fffffa80`09196e00 fffffa80`070bfa00 00000000`0000030a 00000000`00000308 : nt!IopRemoveDevice+0x101
fffff880`03bd28c0 fffff800`0367aee7 : fffffa80`0dc19990 00000000`00000000 00000000`00000003 00000000`00000136 : nt!PnpSurpriseRemoveLockedDeviceNode+0x128
fffff880`03bd2900 fffff800`0367b000 : 00000000`00000000 fffff8a0`11d1c000 fffff8a0`049330d0 fffff880`03bd2a58 : nt!PnpDeleteLockedDeviceNode+0x37
fffff880`03bd2930 fffff800`0370b97f : 00000000`00000002 00000000`00000000 fffffa80`09122010 00000000`00000000 : nt!PnpDeleteLockedDeviceNodes+0xa0
fffff880`03bd29a0 fffff800`0370c53c : fffff880`03bd2b78 fffffa80`114ab700 fffffa80`07005600 fffffa80`00000000 : nt!PnpProcessQueryRemoveAndEject+0x6cf
fffff880`03bd2ae0 fffff800`035f573e : 00000000`00000000 fffffa80`114ab7d0 fffff8a0`123a25b0 00000000`00000000 : nt!PnpProcessTargetDeviceEvent+0x4c
fffff880`03bd2b10 fffff800`03298261 : fffff800`034f9f88 fffff8a0`11d1c010 fffff800`034342d8 fffff800`034342d8 : nt! ?? ::NNGAKEGL::`string’+0x54d9b
fffff880`03bd2b70 fffff800`0352b2ea : 00000000`00000000 fffffa80`07005660 00000000`00000080 fffffa80`06f8d890 : nt!ExpWorkerThread+0x111
fffff880`03bd2c00 fffff800`0327f8e6 : fffff880`03965180 fffffa80`07005660 fffff880`0396ffc0 00000000`00000000 : nt!PspSystemThreadStartup+0x5a
fffff880`03bd2c40 00000000`00000000 : fffff880`03bd3000 fffff880`03bcd000 fffff880`03bd2410 00000000`00000000 : nt!KxStartSystemThread+0x16

0: kd> !irp fffffa8008f5cc10
Irp is active with 10 stacks 10 is current (= 0xfffffa8008f5cf68)
No Mdl: No System Buffer: Thread fffffa8007005660:  Irp stack trace.
cmd  flg cl Device   File     Completion-Context
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
>[ 1b,17]   0  0 fffffa800c089050 00000000 00000000-00000000
\Driver\ZTEusbnet
Args: 00000000 00000000 00000000 00000000

I’m not sure why but the ZTEusbnet driver isn’t processing the power IRP, it’s just leaving it and that’s what caused the system to crash.
It’d be nice to know exactly why it didn’t pass the power IRP on.
I’m not suprised though given the date of the driver.

0: kd> !devstack fffffa800c089050
!DevObj           !DrvObj            !DevExt           ObjectName
> fffffa800c089050  \Driver\ZTEusbnet  fffffa800c0891a0  NDMP14
fffffa80070bfa00  \Driver\usbccgp    fffffa80070bfb50  000000a8
!DevNode fffffa800dc19990 :
DeviceInst is “USB\VID_19D2&PID_0063&MI_04\6&200b5242&0&0004”
ServiceName is “ZTEusbnet”

We can see that it was meant to pass the power IRP down to the USB common class generic parent driver which, to put it simply exposes each USB composite device in order to seperate it to a single device. Passing it down to the USB bus driver should change the power state.

0: kd> lmvm ZTEusbnet
start             end                 module name
fffff880`0ae8e000 fffff880`0aebc000   ZTEusbnet   (no symbols)
Loaded symbol image file: ZTEusbnet.sys
Image path: \SystemRoot\system32\DRIVERS\ZTEusbnet.sys
Image name: ZTEusbnet.sys
Timestamp:        Mon Oct 13 06:50:10 2008 (48F2E192)
CheckSum:         000329ED
ImageSize:        0002E000
Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

This is a dump file from quite a while ago but if memory serves me correctly I think an update solved the issue.

Any other questions feel free to ask, I believe I’ve covered most things without going into detail about drivers.

Sources: Device nodes and device stacks (Windows Drivers)
Driver stacks (Windows Drivers)

0x133 DPC_WATCHDOG_VIOLATION

I’ve not posted in a while but I found an interesting case on a forum and managed to acquire a Kernel memory dump.
I’m not going into detail about DPCs or interrupts as I have made blog posts on these in the past.

DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000000, A single DPC or ISR exceeded its time allotment. The offending
    component can usually be identified with a stack trace.
Arg2: 0000000000000501, The DPC time count (in ticks).
Arg3: 0000000000000500, The DPC time allotment (in ticks).
Arg4: 0000000000000000

So here it states that we encountered a DPC which exceeded the allocated time for it to finish executing. The problem is that it went over this time, and as stated before DPCs can hold up the system when taking too long to execute which can result in lagging, a slow system or even sound cutting out.

So lets look at our stack trace.

ffffd001`50c93c98 fffff800`9238bcc2 : 00000000`00000133 00000000`00000000 00000000`00000501 00000000`00000500 : nt!KeBugCheckEx
ffffd001`50c93ca0 fffff800`92271115 : 00000000`00000000 00000000`00000000 00000000`00000000 fffff801`dceabf17 : nt! ?? ::FNODOBFM::`string’+0x18b12
ffffd001`50c93d30 fffff800`929a07b5 : ffffe001`00400a02 fffff800`922fcae6 fffff801`daed3cf8 ffffe001`00008201 : nt!KeClockInterruptNotify+0x95
ffffd001`50c93f40 fffff800`922e80e3 : ffffd001`50c93f60 00000000`00000008 ffff5377`5487cf7d 00000000`0000000c : hal!HalpTimerClockIpiRoutine+0x15
ffffd001`50c93f70 fffff800`9236412a : ffffe001`9c600500 ffffe001`9e8de1a0 00000000`00000000 00000000`00000000 : nt!KiCallInterruptServiceRoutine+0xa3
ffffd001`50c93fb0 fffff800`92364a9b : 44454c49`4146203a 696c6564`206f7420 6e657665`20726576 20212121`20352074 : nt!KiInterruptSubDispatchNoLockNoEtw+0xea
ffffd001`50c853a0 fffff800`922e8383 : ffffe001`9e92d030 ffffe001`9e968030 00000000`02290a8d 00000000`00000018 : nt!KiInterruptDispatchNoLockNoEtw+0xfb
ffffd001`50c85530 fffff801`dcfa5751 : ffffe001`9e66e7a0 ffffe001`00000000 ffffe001`9e92fbe0 00000000`fffff850 : nt!KeAcquireSpinLockRaiseToDpc+0x13
ffffd001`50c85560 fffff801`dcfa531d : ffffe001`9e96b840 fffff801`dcf2c48f ffffe001`9eb68490 fffff801`dcf2c550 : athwbx+0x161751
ffffd001`50c855f0 fffff801`dcf60c42 : ffffe001`9e96b840 ffffd001`50c85650 ffffd001`50c85654 00000000`00000000 : athwbx+0x16131d
ffffd001`50c85630 fffff801`dcf33472 : ffffe001`9e9bf030 fffff801`00000000 ffffd001`50c856d0 fffff801`dd074319 : athwbx+0x11cc42
ffffd001`50c85680 fffff801`dd0c129f : ffffe001`9e9bf030 ffffffff`ffffffff ffffe001`9e6d97e8 fffff801`dd011189 : athwbx+0xef472
ffffd001`50c856f0 fffff801`dd08679e : ffffe001`9e968030 00000000`00000000 00000000`00000000 00000000`00000000 : athwbx+0x27d29f
ffffd001`50c85720 fffff801`dae9e81e : ffffe001`9e961030 00000000`00000000 ffffd001`50c85790 00000000`00000000 : athwbx+0x24279e
ffffd001`50c85760 fffff800`92252130 : ffffd001`50c85b00 00000000`00000000 00000000`00000200 fffff800`92274ae0 : ndis!ndisInterruptDpc+0x269ce
ffffd001`50c85860 fffff800`9225134b : ffffd001`50c5c180 ffffe001`9e8f4010 ffffe001`9c46b900 ffffe001`a12f3080 : nt!KiExecuteAllDpcs+0x1b0
ffffd001`50c859b0 fffff800`923667ea : ffffd001`50c5c180 ffffd001`50c5c180 ffffd001`50c682c0 ffffe001`9dbbb540 : nt!KiRetireDpcList+0xdb
ffffd001`50c85c60 00000000`00000000 : ffffd001`50c86000 ffffd001`50c80000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x5a

So in this callstack we see our processor in an idle loop, when idle it tends to execute any DPCs if there are any waiting in the DPC queue.
It begins to execute all the DPCs in the queue (also known as draining) when get execute an [B]ndis dpc interrupt[/B], this begins to call network functions and then acquire a spinlock and raise to DPC/Dispatch IRQL level if it hasn’t already (this is the standard routine that is used, I can’t remember if it is required), we then recieve more interrupts followed by a clock interrupt and a bugcheck.

Okay so we know that we bugchecked because a DPC was taking too long to finish executing and risk holding up the system, especially where spinlocks are concerned.

The main thing that interests me is why is there a clock interrupt?

3: kd> !dpcs
CPU Type      KDPC       Function
 3: Normal  : 0xffffe0019e66e880 0xfffff801dae78eb0 ndis!ndisMTimerObjectDpc
 3: Normal  : 0xffffd00150c61668 0xfffff80092327b28 nt!PpmPerfAction
 3: Normal  : 0xffffd0015589a280 0xfffff80092258854 nt!PopExecuteProcessorCallback
 3: Threaded: 0xffffd00150c617c0 0xfffff8009231a0a0 nt!KiDpcWatchdog

I believe the ndis dpc interrupt is related to this timer object but I may be wrong, if it is related then the clock interrupt makes sense as the system requires intervals for clock interrupts to take place in order to keep track of system time and logical run time for threads and timers. Processes can modify the clock interrupt interval for their needs to process timers much quicker, I’ll not go into detail as I will talk about timers another time.

The only problem is that I ran into a dead end, I couldn’t find anything related to the network driver in terms of modifying the clock interrupt timer.

3: kd> !list “-e -x \”dt nt!_EPROCESS @$extret-@@(#FIELD_OFFSET(nt!_EPROCESS, TimerResolutionLink)) ImageFileName SmallestTimerResolution RequestedTimerResolution\” nt!ExpTimerResolutionListHead”
dt nt!_EPROCESS @$extret-@@(#FIELD_OFFSET(nt!_EPROCESS, TimerResolutionLink)) ImageFileName SmallestTimerResolution RequestedTimerResolution
   +0x438 ImageFileName            : [15]  “???”
   +0x638 RequestedTimerResolution : 0x9c3d2000
   +0x63c SmallestTimerResolution  : 0xffffe001

dt nt!_EPROCESS @$extret-@@(#FIELD_OFFSET(nt!_EPROCESS, TimerResolutionLink)) ImageFileName SmallestTimerResolution RequestedTimerResolution
   +0x438 ImageFileName            : [15]  “svchost.exe”
   +0x638 RequestedTimerResolution : 0
   +0x63c SmallestTimerResolution  : 0x2710

So I thought I’d look into this a bit more.

3: kd> u ndis!ndisInterruptDpc+0x269ce
ndis!ndisInterruptDpc+0x269ce:
fffff801`dae9e81e 488b75a7        mov     rsi,qword ptr [rbp-59h]
fffff801`dae9e822 e9e297fdff      jmp     ndis!ndisInterruptDpc+0x1b9 (fffff801`dae78009)
fffff801`dae9e827 33d2            xor     edx,edx
fffff801`dae9e829 488d4dd7        lea     rcx,[rbp-29h]
fffff801`dae9e82d 448d420d        lea     r8d,[rdx+0Dh]
fffff801`dae9e831 e8160c0000      call    ndis!ndisPcwEndCycleCounter (fffff801`dae9f44c)
fffff801`dae9e836 90              nop
fffff801`dae9e837 e9d797fdff      jmp     ndis!ndisInterruptDpc+0x1c3 (fffff801`dae78013)

It appears the interrupt routine is looping for some reason.

I can’t find anything on the cycle counter function as it is undocumented but I’ll take a guess and say that it’s keeping track of the time the interrupt has been executing, AFAIK this is don’t by using a counter on the currently executing thread to see how long it’s running.

3: kd> u nt!KeAcquireSpinLockRaiseToDpc+0x13
nt!KeAcquireSpinLockRaiseToDpc+0x13:
fffff800`922e8383 f605fcac270021  test    byte ptr [nt!PerfGlobalGroupMask+0x6 (fffff800`92563086)],21h
fffff800`922e838a 751f            jne     nt!KeAcquireSpinLockRaiseToDpc+0x3b (fffff800`922e83ab)
fffff800`922e838c f0480fba2900    lock bts qword ptr [rcx],0
fffff800`922e8392 7209            jb      nt!KeAcquireSpinLockRaiseToDpc+0x2d (fffff800`922e839d)
fffff800`922e8394 0fb6c3          movzx   eax,bl
fffff800`922e8397 4883c420        add     rsp,20h
fffff800`922e839b 5b              pop     rbx
fffff800`922e839c c3              ret

Here we can see the same DPC interrupt routine trying to acquire a spinlock yet it’s not managing to do it and therefore looping all whilst it is still running at DPC level and therefore preventing normal thread execution.

Eventually it seems it managed to acquire the spinlock and then call a clock interrupt in order to perform some operation, I suspect updating the system time in order to service the network driver with higher response times.
The system realised that it was taking too long to complete and therefore bugchecked.

3: kd> lmvm athwbx
start             end                 module name
fffff801`dce44000 fffff801`dd1ff000   athwbx     (no symbols)          
    Loaded symbol image file: athwbx.sys
    Image path: \SystemRoot\system32\DRIVERS\athwbx.sys
    Image name: athwbx.sys
    Timestamp:        Thu Oct 17 10:46:01 2013 (525FB1D9)
    CheckSum:         003BC161
    ImageSize:        003BB000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

So the network driver was quite outdated, he updated it and the blue screens stopped so it looks like it was an easy fix.

Thanks for reading.

Memory Management – Stacks

In this blog I’ll talk about stacks, what they are and how they are used in Windows.
We’ve come across the term before but we don’t know that much about them unless you really look into them.

So a stack is an abstract data type that is implemented as a LIFO structure which means Last In First Out.

So from good old wikipedia here’s a very good simple picture of a LIFO mechanism, we can see it uses Push to add data onto the stack and Pop to remove it, so now you know how the simple stack works lets go a bit more advanced.

A stack has a fixed origin within memory called (you know it) stack origin, it then uses a the push instruction to initialize the stack. It then contains a stack pointer which points to the address of the last item added to the stack. The pointer moves further away from the origin as more data is added, although this doesn’t necessarily mean it’s moving up, it can move down.
Now the pointer cannot cross the origin margine at any time, if this happens a stack underrun occurs, this is normally caused by using pop more times than it should be.
A stack overflow occurs when Push is used more times than allowed so the pointer moves into the boundary of another stack, in other words it spills data outside of the allocated region and goes into another stack.
This is a very big problem on Kernel stacks as there are no process address spaces to protect the memory, in Kernel mode everything is ran from a single system memory space that has access to the entire Kernel system of the OS. When this overflow happens it can and will corrupt data on another stack elsewhere that can be executing a thread completely different to the stack overflowing, the culprit on the current stack essentially flees and the stack being corrupt blames somebody else and a bugcheck is called once the corruption is detected, if this is the case Driver Verifier should be enabled.

A good picture to show how this works is as follows.

I rambled a bit here but I just tried to briefly explain how stack overflows cause corruption that can bring the system to a halt so device drivers and other kernel objects should be written carefully and correctly to prevent these situations from happening.

Stacks can also be implemented within arrays which involves the first element at offset zero being the stack origin and it builds from there.
Implementing stacks in linked lists differs in that AFAIK it doesn’t involve using the LIFO mechanism but rather removing nodes and replacing them with different ones in order to change the bottom element of the stack, I need to look into that a bit more though.

There are generally three major types of stacks: User Stacks, Kernel Stacks and DPC Stacks.

In User Stacks when a thread is created by the memory manager 1MB of memory is reserved which can be altered when calling the CreateThread function. Once the thread is created only the first page and a guard page is created, more data can then be added to the page until the guard page is hitwhen an exception occurs, this then allows it to grow with demand but it will never shrink back.

Kernel stacks are a lot smaller than user stacks, they typically range from 12KB (x86) to 16KB (x64), this excludes a guard page table entry which consumes an additional 4KB.
Kernel running code tends to have less recursion than user mode code and therefore contains more efficient code which keeps stack buffer sizes smaller. As stated before, Kernel code has a much larger impact on the system as it runs in a single system address space.

However, interactions between the graphics system and win32k.sys subsequent calls back into user mode are recursive the Kernel implements a way for stacks to be added when nearing the guard page, these stacks contain an additional 16KB, when calls are returned the memory manager frees the stacks afterwards.

 The DPC stack contains a processor stack (One for each processor) which is available for use everytime DPCs are executed, they stay in their own stack as it’s generally unrelated to the current kernel stack’s operations as it runs in an arbitary thread context.

I believe I’ve covered pretty much everything on stacks, I hope that’s helped your understanding.

References:
http://en.wikipedia.org/wiki/Stack_%28abstract_data_type%29
Windows Internals

Interrupt dispatching and handling

In this post I’ll talk about interrupt dispatching and the type of interrupts. Interrupts have always been interesting yet slightly confusing at the same time so I’ll try and explain what they are and the different types they come in.

So what is an interrupt?

It’s kind of in the name, it’s an asynchronous event that diverts the processors flow of control.
They generally come in two forms, hardware interrupts and software interrupts.
 Interrupts can occur from I/O devices, timers or processor clocks.

Hardware Interrupts

These interrupts are external I/Os that come from lines in the interrupt controllers, so when an IRQ (Interrupt Request) is received it enters through a line on the interrupt controller which converts the IRQ into a number which is matched with the IDT index (Interrupt Dispatch Table), then the ISR (Interrupt Service Routine) trap handler is invoked to save the context of the currently executing thread, once the interrupt is completed the context is restored so the thread continues execution like nothing has ever happened.

Interrupt controllers

Hardware interrupts use interrupt controllers which generally speaking come in two forms, PIC (Programmable Interrupt Controller) and APIC (Advanced Programmable Interrupt Controller). The PIC is a uniprocessor controller that is generally used on x86 systems and uses 8 lines. However another PIC can be added called a slave which can add an additional 7 lines to the controller adding to a total of 15 lines.
The APIC is multiprossor interrupt controller which is generally used on x64 systems that contains 256 lines, with this in play the PIC is quickly being phased out.

Here is an example of the IDT which contains lots of different entries for specific interrupts, trap handlers for exceptions also use the IDT for events such as page faults.
I will discuss later on how page faults come into play with bugchecks and IRQs but for a more indepth explanation on how page faults are handled take a look at my friend Patrick’s post over at Sysnative.com

http://www.sysnative.com/forums/bsod-kernel-dump-analysis-debugging-information/10551-page-faults-explained.html

Software Interrupts

Although interrupt controllers implement their own prioritisation mechanisms Windows uses it’s own technique for doing so, IRQLs.

These are IRQLs for x64, IA64 and x86 systems.
IRQLs are a way for interrupts to be prioritised appropriately, IRQLs aren’t implemented in a first in first serve technique but rather the higher the IRQL the higher the priority so an IRQL at 15 would get serviced before one at IRQL 2.

To put this into perspective an IRQ that is at IRQL 2 would have to wait for any IRQs at 3 or above to get serviced before the IRQL is lowered for it to be serviced as the level cannot be lowered when a new interrupt has occurred.
For example, if an interrupt is being serviced and another interrupt needs servicing two things can happen.

One is the current IRQ is but on a waiting list and the new one is serviced.
Two is the current IRQ is finished being serviced then the next IRQL further down the list is next.

It depends on the IRQL of the interrupts.

Back to page faults,
A page fault occurs when a request to memory that is not present happens, when a page fault occurs the page fault handler requests the memory being referenced is brought into memory but in order to do that the IRQL must be at 1 or below as this is when pageable memory can be accessed.
Now when the IRQL is higher than this servicing and interrupt and a page fault occurs this is when we bugcheck with either 0xA or 0xD1 (DRIVER_)IRQL_NOT_LESS_OR_EQUAL

So why can’t we just lower the IRQL to service the page fault or wait for the current interrupt to finish?

Well IRQLs cannot be lowered when an interrupt at that level is being serviced as that has priority, a page fault cannot wait as it must be serviced immediately.
You see the problem?
It’s an endless cycle so the system crashes as it can’t compute anything else.

I hope I’ve covered pretty much everything and I hope you’ve learned something.

I forgot to add, hardware interrupts (IRQs) can only be serviced above DPC/dispatch level, so anything at that level or below will not allow hardware interrupts to be serviced.

Instruction pointer misalignments

This time I’ll talk about instruction pointer misalignment.
So what is an instruction pointer misalignment?

Well, when an object references memory it uses a pointer to (you guessed it) point to a certain memory address, once it references the data inside that address it grabs the data from inside the address which is known as dereferencing.

When a pointer is misaligned it grabs data from the wrong address which causes a lot of problems by causing severe memory corruption depending on the contents of the address being referenced, if allowed to write it can completely corrupt the address, the culprit can escape and some innocent pointer comes along, tries to use the address and gets blamed by the computer police.
This is why bugchecks are called to prevent such memory corruption, now the way data structures are arranged and accessed it will read/write in chunks of 4 bytes (sometimes larger) so the memory offset size will be a multiple of the word size, the reason this is done is to maximise the performance by utilising way the CPU handles memory.

When the memory being referenced isn’t a multiple of 4 then that’s when things go wrong, it generally results in an alignment fault which is also known a bus error, a good example is this.

This instruction taken from a crash dump can explain this a little bit.
    nt!MmCleanProcessAddressSpace+0xe6

 So the nt!Mm is the module, in this case it’s a Memory Management Windows function.
The CleanProcessAddressSpace is the actual function, in this case it’s scrubbing a memory address ready for allocation.
 The +0xe6 is the offset which is like the address on a street, it’s the location which the function takes place.

 I was actually looking at the differences between a segmentation fault and a bus error as they both involve the CPU not physically being capable of addressing the memory being referenced.

  • The segmentation error (or access violation exception) occurs when memory outside of the allowed location is referenced (not to be confused with buffer overruns which involve writing outside allocated memory into another buffer).
  • The bus error occurs when an address which is not alligned is referenced, by this as you know is when they aren’t multiples of 4.

Another thing to note, in dump files you can see where it says misaligned pointer it mentions it’s probably caused by hardware. As I’ve mentioned, it’s probably due to the fact that the CPU cannot address memory that isn’t alligned with multiples of 4 so it looks like it’s due to the CPU not being able to read it at all.
Misaligned IPs don’t always result in a bus error, they can be caused by drivers writing more data in a buffer on a stack which results in a stack overflow, this also results in a bugcheck to prevent critical memory corruption.

     I hope this has helped you understand the differences and more about instruction pointers.

    Hexadecimal and Binary

    This blog will be a little different to my usual debugging blogs.
    I will be talking about hexadecimals and binary, it can be difficult to fully understand but we should be able to get through it.

    Now, at school, I was never really good at Maths, I struggled with a lot of things but I’ve picked up a few things with debugging as Windows Internals uses these figures to perform operations that would not be possible in decimal.

     Binary can be difficult to get your head round but computers use them to make things a lot more simpler.
    Remember at school when you had to use a T chart and count in tens.
    So “10” would be 1 ten and 0 ones, in binary “10” means 1 twos and 0 ones.
    “100” in binary would be 4 (2×2), “1000” would be eight (4×2) etc.
    Generally, binary is used for power states within computers because they’re multiples of 2 it would on and off.
    There would also be far less rules compared to decimal which actually simplifies things (for the computer), but for us we would need a compiler to convert the code for us to make sense of them.

    Hexadecimal is used because it’s easier to make smaller numbers, it’s mainly used to convert code into binary easier as it divides easier. Instead of multiples of 10 hexadecimal uses multiples of 16, so “10” in hexadecimal would be 16 as it’s 1 sixteen and 0 ones.
    “25” in hexadecimal would be 2 sixteens and 5 ones so it would be 37.

    But how does that work as there aren’t symbols for 10 to 16 in hexadecimal?
    This is why we have letters for 10 to 16, here’s a good conversion chart to help you understand.

    Here we can see how they all convert into each other, obviously the higher the figures the more difficult they become to understand.

    Hopefully this has helped a lot of you understand if you didn’t already.

    0x7F (memory leak)

    In this post, we will be looking at a memory leak caused by a program called NotMyFault which is supplied by Sysinternals, they have some excellent tools you should check out if interested.
    To download NotMyFault then here’s the link.

    http://live.sysinternals.com/Files/NotMyFault.zip

    Let’s take a look.

    BugCheck 7F, {8, 80050033, 406f8, fffff80002e69f2c}

    This bugcheck indicates the Kernel encountered a trap which it’s not allowed to catch, this means that it cannot be resolved and must bugcheck. In this case the cause of the crash was a double fault, this cannot be resolved and crashes the system.
    A double fault occurs when an exception is takes place during the processing of another exception,  if an exception occurs when processing a double fault a triple fault can occur.

    So looking at the callstack this is what we see, do note this is only a small snippet as the callstack is massive with repeats of Nvidia driver functions at the same address.

    fffff880`02fddce8 fffff800`02ec7169 : 00000000`0000007f 00000000`00000008 00000000`80050033 00000000`000406f8 : nt!KeBugCheckEx
    fffff880`02fddcf0 fffff800`02ec5632 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x69
    fffff880`02fdde30 fffff800`02e69f2c : fffffa80`035d4000 00000000`00000000 00000000`00000000 fffff800`02ff947c : nt!KiDoubleFaultAbort+0xb2
    fffff880`009ab000 fffff800`02ff947c : 00000000`00000000 fffff880`009ab080 00000000`00000000 00000000`00000000 : nt!MiExpandNonPagedPool+0x14
    fffff880`009ab020 fffff800`02ffbf26 : fffff800`030586c0 00000000`00000003 00000000`00000000 fffff880`049f9c05 : nt!MiAllocatePoolPages+0xdfd
    fffff880`009ab160 fffff880`04a1ea55 : 00000000`00000000 00000000`00000001 fffff880`009ab2b8 fffff880`00000000 : nt!ExAllocatePoolWithTag+0x316
    fffff880`009ab250 fffff880`04a1b6e8 : fffffa80`05b75000 00000000`00000002 00000000`00000002 fffffa80`036a7000 : nvlddmkm+0x1bfa55
    fffff880`009ab280 fffff880`04ae392a : fffff880`009ab318 fffffa80`00000018 fffffa80`036a7000 fffffa80`05b75000 : nvlddmkm+0x1bc6e8
    fffff880`009ab2e0 fffff880`04b9f804 : 00000000`00100005 00000000`00000000 00000000`00100006 fffffa80`05b75000 : nvlddmkm+0x28492a
    fffff880`009ab310 fffff880`04b9f827 : 00000000`00100004 00000000`00100006 fffffa80`05b75000 fffffa80`05b75000 : nvlddmkm+0x340804
    fffff880`009ab350 fffff880`04b9f827 : 00000000`00100004 00000000`00100006 fffffa80`05b75000 fffffa80`05b75000 : nvlddmkm+0x340827
    fffff880`009ab390 fffff880`04b9f827 : 00000000`00100004 00000000`00100006 fffffa80`05b75000 fffffa80`05b75000 : nvlddmkm+0x340827
    fffff880`009ab3d0 fffff880`04b9f827 : 00000000`00100004 00000000`00100006 fffffa80`05b75000 fffffa80`05b75000 : nvlddmkm+0x340827
    fffff880`009ab410 fffff880`04b9f827 : 00000000`00100004 00000000`00100006 fffffa80`05b75000 fffffa80`05b75000 : nvlddmkm+0x340827
    fffff880`009ab450 fffff880`04b9f827 : 00000000`00100004 00000000`00100006 fffffa80`05b75000 fffffa80`05b75000 : nvlddmkm+0x340827
    fffff880`009ab490 fffff880`04b9f827 : 00000000`00100004 00000000`00100006 fffffa80`05b75000 fffffa80`05b75000 : nvlddmkm+0x340827
    fffff880`009ab4d0 fffff880`04b9f827 : 00000000`00100004 00000000`00100006 fffffa80`05b75000 fffffa80`05b75000 : nvlddmkm+0x340827

     So what is happening is the Nvidia driver is being blamed (probably due to it being in the stack when the last context was saved which was the exception) and is calling lots of function with what appears to be allocating more pages until a double fault in initiated, I suspect the double fault occurred due to memory not being able to be allocated which caused an exception then another exception occurred.

    So looking at the virtual memory usage we can see the following.

    3: kd> !vm

    *** Virtual Memory Usage ***
        Physical Memory:     1036418 (   4145672 Kb)
        Page File: \??\C:\pagefile.sys
          Current:   4145672 Kb  Free Space:   3702732 Kb
          Minimum:   4145672 Kb  Maximum:     12437016 Kb
        Available Pages:      100902 (    403608 Kb)
        ResAvail Pages:       209219 (    836876 Kb)
        Locked IO Pages:           0 (         0 Kb)
        Free System PTEs:   33504448 ( 134017792 Kb)
        Modified Pages:         4479 (     17916 Kb)
        Modified PF Pages:      4364 (     17456 Kb)
        NonPagedPool Usage:   764909 (   3059636 Kb)
        NonPagedPool Max:     764972 (   3059888 Kb)
        ********** Excessive NonPaged Pool Usage *****

    We can see that the non paged pool memory has been completed depleted which caused the system to crash.
    Now you might be asking, can’t it just put the memory onto disk to stop it crashing?
    Well moving memory from RAM onto disk is known as paging which is used to save space when the memory usage is high. However, Kernel memory is mainly divided into two main categories:

    -Paged Pool
    -Non Paged Pool

    Paged pool is for applications and other memory allocations that when not in use can be moved to disk to save storage space, non paged pool on the other hand can’t be moved to disk under any circumstances as device drivers and other critical operating system components use these dynamic memory allocations to function correctly, they must be available immediately for use.

    So why can’t they just page the memory back from disk when needed?

    Well it’s not that simple, paging can be very expensive in that it takes time and puts a lot of pressure on the drive which is much slower than RAM.
    Not only that but the IRQL must be at 1 or below in order to page files, when the IRQL is higher than 1 paging is not allowed. Just say for example we get a system call that needs servicing quickly at an IRQL of 7 for example, that may need the device driver to perform certain tasks but it can’t because it’s paged out, we can’t page it in because the IRQL needs to be at 1 or below.
    Now we can’t just lower the IRQL because the higher the IRQL the higher the priority which causes a bugcheck of 0xA or 0xD1.

    So why is the memory being leaked and what is it?

    A memory leak occur when an object acquires memory but doesn’t free it after it’s being used which prevents those pages from being allocated as they need to be freed but they’re not in use.
    If the object keeps calling ExAllocatePool then it keeps allocating memory but not using it, just because they’re not in use doesn’t mean they can be used by anything else.
    So when the last of the non paged memory pools have been used up the system cannot function anymore as critical objects cannot allocate memory to function so the system crashes.

    We can look at the assembly instructions to see what is happening.

    3: kd> .trap fffff880`02fdde30
    NOTE: The trap frame does not contain all registers.
    Some register values may be zeroed or incorrect.
    rax=00000000000bac2c rbx=0000000000000000 rcx=0000000000000001
    rdx=fffff880009ab0b8 rsi=0000000000000000 rdi=0000000000000000
    rip=fffff80002e69f2c rsp=fffff880009ab000 rbp=fffff880009ab080
     r8=ffffffffffffffff  r9=fffffa80035eb5b8 r10=00000000ffffffff
    r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
    r14=0000000000000000 r15=0000000000000000
    iopl=0         nv up ei pl zr na po nc
    nt!MiExpandNonPagedPool+0x14:
    fffff800`02e69f2c 4156            push    r14

     So it’s calling a function which I believe tries to expand non paged pool to allow objects to allocate it as it might be too small for use.

     3: kd> u nt!MiExpandNonPagedPool+0x14
    nt!MiExpandNonPagedPool+0x14:
    fffff800`02e69f2c 4156            push    r14
    fffff800`02e69f2e 4157            push    r15
    fffff800`02e69f30 4881ecd0000000  sub     rsp,0D0h
    fffff800`02e69f37 488db9ff010000  lea     rdi,[rcx+1FFh]
    fffff800`02e69f3e 4881e700feffff  and     rdi,0FFFFFFFFFFFFFE00h
    fffff800`02e69f45 483bf9          cmp     rdi,rcx
    fffff800`02e69f48 0f82cfbffeff    jb      nt! ?? ::FNODOBFM::`string’+0x1e009 (fffff800`02e55f1d)
    fffff800`02e69f4e 488b0513762100  mov     rax,qword ptr [nt!MiSystemVaTypeCount+0x28 (fffff800`03081568)]

     So here we can see push instructions which adds data onto the stack but because there is no more memory left it stops adding data and crashes.

    0x9F DRIVER_POWER_STATE_FAILURE

    First off I’d like to say I’m sorry I’ve not been posting in a while but I’ll try to post a bit more.

    BugCheck 9F, {4, 258, fffffa8007005660, fffff800053e83d0}

     There are two types of 0x9F bugchecks, indicated by the first parameter, the first one is indicated with a 3 which means an IRP has been held onto for too long so the system bugchecked as it holds everything else up.
    The second one is what we’re going to look at which indicates a thread is holding onto a power IRP for too long which causes it to timeout and bugcheck.

    So in more detail this bugcheck indicates a power IRP failed to synchronise with the PnP manager, basically a power IRP is an I/O Request Packet that sends power transitions down a device stack.
    All power IRPs must reach the PDO (Physical Device Object) at the bottom of the stack to ensure power transitions are done correctly.
    When one doesn’t reach the bottom for any reason the system bugchecks, in this case a thread was blocking the IRP so it couldn’t be completed within the allocated time interval.

    So lets look at the locks on the system which are blocking the IRP.
    To understand what this means we need to know what a lock is (ERESOURCE structure), locks are synchronisation mechanisms that allow drivers to access resources efficiently.
    There are two main types of locks, exclusive and shared where the exclusive lock is the owner and shared can be implemented across multiple threads.
    They contain a read/write mechanism where only one thread can write but multiple threads can read simultaneously.
    Acquiring a thread exclusivly requires no threads can be currently sharing it, for thread to acquire a lock it must be put into a wait state until it is available.

    This was only a brief explanation, for more information check out this article:

    http://msdn.microsoft.com/en-us/library/ff548046.aspx

    Back to the topic, looking at the locks.

    0: kd> !locks
    **** DUMP OF ALL RESOURCE OBJECTS ****
    KD: Scanning for held locks..

    Resource @ nt!IopDeviceTreeLock (0xfffff80003492ce0)    Shared 1 owning threads
        Contention Count = 1
         Threads: fffffa8007005660-01
    KD: Scanning for held locks.

    Resource @ nt!PiEngineLock (0xfffff80003492be0)    Exclusively owned
        Contention Count = 21
        NumberOfExclusiveWaiters = 1
         Threads: fffffa8007005660-01
         Threads Waiting On Exclusive Access:
                  fffffa800f308b50      

    KD: Scanning for held locks…………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
    18855 total locks, 2 locks currently held

    Let’s look at the exclusive thread owning the lock.

    0: kd> !thread fffffa8007005660
    THREAD fffffa8007005660  Cid 0004.0048  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
        fffffa800d035ee8  NotificationEvent
    IRP List:
        fffffa8008f5cc10: (0006,03e8) Flags: 00000000  Mdl: 00000000
    Not impersonating
    DeviceMap                 fffff8a000008c10
    Owning Process            fffffa8006f8d890       Image:         System
    Attached Process          N/A            Image:         N/A
    Wait Start TickCount      396427         Ticks: 38463 (0:00:10:00.026)
    Context Switch Count      44059          IdealProcessor: 2  NoStackSwap
    UserTime                  00:00:00.000
    KernelTime                00:00:00.343
    Win32 Start Address nt!ExpWorkerThread (0xfffff80003298150)
    Stack Init fffff88003bd2c70 Current fffff88003bd2280
    Base fffff88003bd3000 Limit fffff88003bcd000 Call 0
    Priority 15 BasePriority 12 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
    Child-SP          RetAddr           : Args to Child                                                           : Call Site
    fffff880`03bd22c0 fffff800`032845f2 : fffffa80`07005660 fffffa80`07005660 00000000`00000000 00000000`00000000 : nt!KiSwapContext+0x7a
    fffff880`03bd2400 fffff800`0329599f : fffffa80`0d0df208 fffff880`0ae9e10b fffffa80`00000000 00000000`00000000 : nt!KiCommitThreadWait+0x1d2
    fffff880`03bd2490 fffff880`0ae915dd : fffffa80`0d035000 00000000`00000000 fffffa80`0dd8ca00 00000000`00000000 : nt!KeWaitForSingleObject+0x19f
    fffff880`03bd2530 fffff880`0ae92627 : fffffa80`0d035000 00000000`00000000 fffffa80`0c0891a0 fffff880`03bd2670 : ZTEusbnet+0x35dd
    fffff880`03bd2580 fffff880`0215d809 : fffffa80`0c0891a0 fffff880`020f0ecd fffff880`03bd2670 fffffa80`091c5550 : ZTEusbnet+0x4627
    fffff880`03bd25b0 fffff880`0215d7d0 : fffffa80`091c54a0 fffffa80`0c0891a0 fffff880`03bd2670 fffffa80`08fc2ac0 : ndis!NdisFDevicePnPEventNotify+0x89
    fffff880`03bd25e0 fffff880`0215d7d0 : fffffa80`08fc2a10 fffffa80`0c0891a0 fffffa80`091f9010 fffffa80`091f90c0 : ndis!NdisFDevicePnPEventNotify+0x50
    fffff880`03bd2610 fffff880`0219070c : fffffa80`0c0891a0 00000000`00000000 00000000`00000000 fffffa80`0c0891a0 : ndis!NdisFDevicePnPEventNotify+0x50
    fffff880`03bd2640 fffff880`021a1da2 : 00000000`00000000 fffffa80`08f5cc10 00000000`00000000 fffffa80`0c0891a0 : ndis! ?? ::LNCPHCLB::`string’+0xddf
    fffff880`03bd26f0 fffff800`034fb121 : fffffa80`091c7060 fffffa80`0c089050 fffff880`03bd2848 fffffa80`070bfa00 : ndis!ndisPnPDispatch+0x843
    fffff880`03bd2790 fffff800`0367b3a1 : fffffa80`070bfa00 00000000`00000000 fffffa80`0dc19990 fffff880`03bd2828 : nt!IopSynchronousCall+0xe1
    fffff880`03bd2800 fffff800`03675d78 : fffffa80`09196e00 fffffa80`070bfa00 00000000`0000030a 00000000`00000308 : nt!IopRemoveDevice+0x101
    fffff880`03bd28c0 fffff800`0367aee7 : fffffa80`0dc19990 00000000`00000000 00000000`00000003 00000000`00000136 : nt!PnpSurpriseRemoveLockedDeviceNode+0x128
    fffff880`03bd2900 fffff800`0367b000 : 00000000`00000000 fffff8a0`11d1c000 fffff8a0`049330d0 fffff880`03bd2a58 : nt!PnpDeleteLockedDeviceNode+0x37
    fffff880`03bd2930 fffff800`0370b97f : 00000000`00000002 00000000`00000000 fffffa80`09122010 00000000`00000000 : nt!PnpDeleteLockedDeviceNodes+0xa0
    fffff880`03bd29a0 fffff800`0370c53c : fffff880`03bd2b78 fffffa80`114ab700 fffffa80`07005600 fffffa80`00000000 : nt!PnpProcessQueryRemoveAndEject+0x6cf
    fffff880`03bd2ae0 fffff800`035f573e : 00000000`00000000 fffffa80`114ab7d0 fffff8a0`123a25b0 00000000`00000000 : nt!PnpProcessTargetDeviceEvent+0x4c
    fffff880`03bd2b10 fffff800`03298261 : fffff800`034f9f88 fffff8a0`11d1c010 fffff800`034342d8 fffff800`034342d8 : nt! ?? ::NNGAKEGL::`string’+0x54d9b
    fffff880`03bd2b70 fffff800`0352b2ea : 00000000`00000000 fffffa80`07005660 00000000`00000080 fffffa80`06f8d890 : nt!ExpWorkerThread+0x111
    fffff880`03bd2c00 fffff800`0327f8e6 : fffff880`03965180 fffffa80`07005660 fffff880`0396ffc0 00000000`00000000 : nt!PspSystemThreadStartup+0x5a
    fffff880`03bd2c40 00000000`00000000 : fffff880`03bd3000 fffff880`03bcd000 fffff880`03bd2410 00000000`00000000 : nt!KxStartSystemThread+0x16

    A brief explanation is looking at the callstack we can see ndis functions calling the ZTEusbnet network driver about a PnP event, this looks like it’s due to the power IRP being sent down the stack but it’s being blocked by the network driver so it cannot get to the bottom f the stack, which I believe in this case is the pci.sys but I’m not too sure given it’s a USB network card and not a pci card.

    So let’s look at the IRP.

    0: kd> !irp fffffa8008f5cc10 7
    Irp is active with 10 stacks 10 is current (= 0xfffffa8008f5cf68)
     No Mdl: No System Buffer: Thread fffffa8007005660:  Irp stack trace. 
    Flags = 00000000
    ThreadListEntry.Flink = fffffa8007005a50
    ThreadListEntry.Blink = fffffa8007005a50
    IoStatus.Status = c00000bb
    IoStatus.Information = 00000000
    RequestorMode = 00000000
    Cancel = 00
    CancelIrql = 0
    ApcEnvironment = 00
    UserIosb = fffff88003bd27c0
    UserEvent = fffff88003bd27d0
    Overlay.AsynchronousParameters.UserApcRoutine = 00000000
    Overlay.AsynchronousParameters.UserApcContext = 00000000
    Overlay.AllocationSize = 00000000 – 00000000
    CancelRoutine = 00000000  
    UserBuffer = 00000000
    &Tail.Overlay.DeviceQueueEntry = fffffa8008f5cc88
    Tail.Overlay.Thread = fffffa8007005660
    Tail.Overlay.AuxiliaryBuffer = 00000000
    Tail.Overlay.ListEntry.Flink = 00000000
    Tail.Overlay.ListEntry.Blink = 00000000
    Tail.Overlay.CurrentStackLocation = fffffa8008f5cf68
    Tail.Overlay.OriginalFileObject = 00000000
    Tail.Apc = 00000000
    Tail.CompletionKey = 00000000
         cmd  flg cl Device   File     Completion-Context
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                Args: 00000000 00000000 00000000 00000000
    >[ 1b,17]   0  0 fffffa800c089050 00000000 00000000-00000000   
               \Driver\ZTEusbnet

                Args: 00000000 00000000 00000000 00000000
    IO verifier information:
    No information available – the verifier is probably disabled

    So here we can see the IRP reached ZTEusbnet but stopped there, so I think this driver is to blame.
    One last thing, let’s look at the device object.

    0: kd> !devobj fffffa800c089050
    Device object (fffffa800c089050) is for:
     NDMP14 \Driver\ZTEusbnet DriverObject fffffa800deeae70
    Current Irp 00000000 RefCount 0 Type 00000017 Flags 00002050
    Dacl fffff9a10009b881 DevExt fffffa800c0891a0 DevObjExt fffffa800c08a8c0
    ExtensionFlags (0x00000800)  DOE_DEFAULT_SD_PRESENT
    Characteristics (0x00000100)  FILE_DEVICE_SECURE_OPEN
    AttachedTo (Lower) fffffa80070bfa00 \Driver\usbccgp
    Device queue is not busy.

    So we can see the network driver is the upper layer and the usbccgp is the lower layer which is a USB bus driver.
    The way around this I believe would be to update the driver as I’ve had no reply from the OP since.

    I checked the timestamp for the network driver and it’s very outdated which is probably why it was causing such issues.

    0: kd> lm vm ZTEusbnet
    start             end                 module name
    fffff880`0ae8e000 fffff880`0aebc000   ZTEusbnet   (no symbols)          
        Loaded symbol image file: ZTEusbnet.sys
        Image path: \SystemRoot\system32\DRIVERS\ZTEusbnet.sys
        Image name: ZTEusbnet.sys
        Timestamp:        Mon Oct 13 06:50:10 2008 (48F2E192)
        CheckSum:         000329ED
        ImageSize:        0002E000
        Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

    Hope you enjoyed reading.