What is the difference between
cached I/O, user non-cached I/O, and paging I/O?
In a file
system or file system filter driver, read and write operations fall into
several different categories. For the purpose of discussing them, we normally consider the following types:
- Cached I/O. This includes normal user I/O, both via the Fast I/O path as well as via the IRP_MJ_READ and IRP_MJ_WRITE path. It also includes the MDL operations (where the caller requests the FSD return an MDL pointing to the data in the cache).
- Non-cached user I/O. This includes all non-cached I/O operations that originate outside the virtual memory system.
- Paging I/O. These are I/O operations initiated by the virtual memory system in order to satisfy the needs of the demand paging system.
Cached I/O is any I/O that can be satisfied by the file system data cache. In such a case, the operation is normally to copy the data from the virtual cache buffer into the user buffer. If the virtual cache buffer contents are resident in memory, the copy is fast and the results returned to the application quickly. If the virtual cache buffer contents are not all resident in memory, then the copy process will trigger a page fault, which generates a second re-entrant I/O operation via the paging mechanism.
Non-cached user I/O is I/O that must bypass the cache - even if the data is present in the cache. For read operations, the FSD can retrieve the data directly from the storage device without making any changes to the cache. For write operations, however, an FSD must ensure that the cached data is properly invalidated (if this is even possible, which it will not be if the file is also memory mapped).
Paging I/O is I/O that must be satisfied from the storage device (whether local to the system or located on some "other" computer system) and it is being requested by the virtual memory system as part of the paging mechanism (and hence has special rules that apply to its behavior as well as its serialization).
I see I/O requests with the
IRP_MN_MDL minor function code? What does this mean?
How should I handle it in my file system or filter driver?
callers of the read and write interface (IRP_MJ_READ and IRP_MJ_WRITE) can
utilize an interface that allows retrieval of a pointer to the data as it is
located in the file system data cache. This allows the kernel mode caller to
retrieve the data for the file without an additional data copy.
For example, the AFD file system driver has an API function it exports that takes a socket handle and a file handle. The file contents are "copied" directly to the corresponding communications socket. The AFD driver accomplishes this task by sending an IRP_MJ_READ with the IRP_MN_MDL minor operation. The FSD then retrieves an MDL describing the cached data (at Irp->MdlAddress) and completes the request. When AFD has completed processing the operation it must return the MDL to the FSD by sending an IRP_MJ_READ with the IRP_MN_MDL_COMPLETE minor operation specified.
For a file system filter driver, the write operation may be a bit more confusing. When a caller specifies IRP_MJ_WRITE/IRP_MN_MDL the returned MDL may point to uninitialized data regions within the cache. That is because the cache manager will refrain from reading the current data in from disk (unless necessary) in anticipation of the caller replacing the data. When the caller has updated the data, it releases the buffer by calling IRP_MJ_WRITE/IRP_MN_MDL_COMPLETE. At that point the data has been written back to the cache.
An FSD that is integrated with the cache manager can implement these minor functions by calling CcMdlRead and CcPrepareMdlWrite. The corresponding functions for completing these are CcMdlReadComplete and CcMdlWriteComplete. An FSD that is not integrated with the cache manager can either indicate these operations are not supported (in which case the caller must send a buffer and support the "standard" read/write mechanism) or it can implement them in some other manner that is appropriate.
FILE_COMPLETE_IF_OPLOCKED in a filter driver.
driver may be called by an application that has
indicated the FILE_COMPLETE_IF_OPLOCKED. If the filter in turn calls ZwCreateFile it may cause the
thread to deadlock.
A common problem for file systems is that of reentrancy. The Windows operating system supports numerous reentrant operations. For example, an asynchronous procedure call (APC) can be invoked within a given thread context as needed. However, suppose an APC is delivered to a thread while it is processing a file system request. Imagine that this APC in turn issues another call into the file system. Recall that file systems utilize resource locks internally to ensure correct operation. The file systems must also ensure the correct order of lock acquisition in order to eliminate the possibility of deadlocks arising. However it is not possible for the file system to define a locking order in the face of arbitrary reentrancy!
To resolve this problem, the file systems disable certain types of reentrancy that would not be safe. They do this by calling FsRtlEnterFileSystem when they enter a region of code that is not reentrant. When they leave that region of code, they call FsRtlExitFileSystem to enable reentrancy.
This is important to an understanding of this problem because the CIFS file server uses oplocks as part of its cache consistency mechanism between remote clients and local clients. This is done using a "callback" mechanism, which is implemented using APCs.
Normally, the FSD will block waiting for the completion of the APC that breaks an oplock. Under certain circumstances, however, the CIFS server thread that issued the operation requiring an oplock break is also the thread that must process the APC. Since the file system has blocked APC delivery, and now the thread is blocked awaiting completion of the APC, this approach leads to deadlock. Because of this, the Windows file system developers introduced an additional option that advises the file system that if an oplock break is required to process the IRP_MJ_CREATE operation, it should not block, but instead should return a special status code STATUS_OPLOCK_BREAK_IN_PROGRESS. This return value then tells the caller that the file is not completely opened. Instead, a subsequent call to the file system, using the FSCTL_OPLOCK_BREAK_NOTIFY, must be made to ensure that the oplock break has been completed.
Of course, this works because by returning this status code the APC can be delivered, once the thread exits the file system driver.
Note that FSCTL_OPLOCK_BREAK_NOTIFY, and the other calls for the oplock protocol, are documented in the Windows Platform SDK.
What are the rules for my file
system/filter driver for handling paging I/O? What about paging file I/O?
The rules for
handling page faults are quite strict because incorrect handling can lead to
catastrophic system failure. For this reason, there are specific rules used to
ensure correct cooperation between file systems (and file system filter
drivers) and the virtual memory system. This is necessary because page faults are trapped by the VM system, but are then ultimately
satisfied by the file system (and associated storage stack). Thus, the file
system must not generate additional page faults because this may lead to an
"infinite" recursion. Normally, hardware platforms have a finite
limit to the number of page faults they will handle in a "stacked"
Thus, the most reliable of the paging paths is that for the paging file. Any access to the paging file must not generate any additional page faults. Further, to avoid serialization problems, the file system is not responsible for serializing this access - the paging file belongs to the VM system, and the VM system is responsible for serializing access to this file (this eliminates some of the re-entrant locking problems that occur with general file access). To achieve this, the drivers involved must ensure they will not generate a page fault. They must not call any routines that could generate a page fault (e.g., code modules that can be paged). The file system may only be called at APC_LEVEL but it should only call those routines that are safe to call at DISPATCH_LEVEL, since such routines are guaranteed not to cause page faults. None of the data being used through this path should be pagable.
For all other files, paging I/O has less stringent rules. For a file system driver the code paths cannot be paged - otherwise, the page fault might be to fetch the very piece of code needed to satisfy the page fault! For any file system, data may be paged, since such a page fault can always be resolved eventually by retrieving the correct contents off disk. Paging activity occurs at APC_LEVEL and thus limits arbitrary re-entrancy in order to prevent calling code paths that could generate yet more page faults.
IRP_MJ_CLEANUP vs IRP_MJ_CLOSE
of IRP_MJ_CLEANUP is to indicate that the last handle reference against the
given file object has been released. The purpose of IRP_MJ_CLOSE is to indicate
that the last system reference against the given file object has been released.
This is because the operating system uses two distinct reference counts for any
object, including the file object. These values are stored within the object
header, with the HandleCount representing the number
of open handles against the object and the PointerCount representing the number of references against the object. Since the HandleCount always implies a reference (from a handle
table) to the object, the HandleCount is less than or
equal to the PointerCount.
Any kernel mode component may maintain a private reference to the object. Routines such as ObReferenceObject, ObReferenceObjectByHandle, and IoGetDeviceObjectPointer all bump the reference count on a specific object. A kernel mode driver releases that reference by using ObDereferenceObject to decrement the PointerCount on the given object.
A file system, or file system filter driver, will often see a long delay between the IRP_MJ_CLEANUP and IRP_MJ_CLOSE because a component within the operating system is maintaining a reference against the file object. Frequently, this is because the memory manager maintains a reference against a file object that is backing a section object. So long as the section object remains "in use" the file object will be referenced. Section objects, in turn, remain referenced for extended periods of time because they are used by the memory manager in tracking the usage of memory for file-backed shared memory regions (e.g., executables, DLLs, memory mapped files). For example, the cache manager uses the section object as part of its mappings of file system data within the cache. Thus, the period of time between the IRP_MJ_CLEANUP and the IRP_MJ_CLOSE can be arbitrarily long.
The other complication here is that the memory manager uses only a single file object to back the section object. Any subsequent file object created to access that file will not be used to back the section and thus for these new file objects the IRP_MJ_CLEANUP is typically followed by an IRP_MJ_CLOSE. Thus, the first file object may be used for an extended period of time, while subsequent file objects have a relatively short lifespan.
What are the rules for managing MDLs
and User Buffers? How do I substitute my own buffer in an IRP?
fairness, there are no "rules" for managing MDLs and user buffers.
There are suggestions that we can offer based upon observed behavior of the
file systems. First, we note that there are two basic sources of I/O operations
for a file system - the applications layer, and other operating system
For applications programs, most IRPs are still buffered. The operations for which this is not necessarily the case are those for which larger amounts of data are transferred. These are IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL,? IRP_MJ_QUERY_EA, IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, IRP_MJ_SET_QUOTA,?and the per-control code options of IRP_MJ_DEVICE_CONTROL, IRP_MJ_INTERNAL_DEVICE_CONTROL, and IRP_MJ_FILE_SYSTEM_CONTROL. If the Flags field in the device object specifies DO_DIRECT_IO then the buffer for IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL,? IRP_MJ_QUERY_EA,?IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, and IRP_MJ_SET_QUOTA, ?is specified as a memory descriptor list (MDL) pointed to by the MdlAddress field of the IRP. If the Flags field in the device object specifies DO_BUFFERED_IO then the buffer for IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL,? IRP_MJ_QUERY_EA, IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, and IRP_MJ_SET_QUOTA,?is a non-paged pool buffer pointed to by the AssociatedIrp.SystemBuffer field of the IRP. The most common case for a file system driver is that neither of these two flags is specified, in which case the buffer for IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL, IRP_MJ_QUERY_SECURITY, IRP_MJ_SET_SECURITY, and IRP_MJ_QUERY_EA, IRP_MJ_SET_EA is a direct pointer to the caller-supplied buffer via the UserBuffer field of the IRP.
for IRP_MJ_QUERY_SECURITY and IRP_MJ_SET_SECURITY the buffer is always passed
as METHOD_NEITHER.? Thus, the user buffer is pointed
to by Irp->UserBuffer.? The file system is responsible for validating and
managing that buffer directly.
control operations (IRP_MJ_DEVICE_CONTROL, IRP_MJ_INTERNAL_DEVICE_CONTROL, and IRP_MJ_FILE_SYSTEM_CONTROL) the buffer descriptions are a
function of the specified control code, of which there are four:
METHOD_BUFFERED, METHOD_IN_DIRECT, METHOD_OUT_DIRECT and METHOD_NEITHER.
? METHOD_BUFFERED - in this case the input data is in the buffer pointed to by AssociatedIrp.SystemBuffer. Upon completion of the operation the output data is in the same buffer. Transferring data between user and kernel mode is handled by the I/O Manager.
? METHOD_IN_DIRECT - in this case the input data is in the buffer pointed to by AssociatedIrp.SystemBuffer. The secondary buffer is described by the MDL in MdlAddress. When the I/O Manager probed and locked the memory corresponding to the memory for the original buffer, it probed it for read access (hence the IN part of this transfer description). Typically, this is confused because it is referred to as the output buffer (although, probing it for input would suggest it is being used as a secondary input buffer).
? METHOD_OUT_DIRECT - in this case the input data is in the buffer pointed to by AssociatedIrp.SystemBuffer. The secondary buffer is described by the MDL in MdlAddress. When the I/O Manager probed and locked the memory corresponding to the memory for the original buffer, it probed it for write access (hence the OUT part of this transfer description). ? METHOD_NEITHER - in this case the input data is described by a pointer in the I/O stack location (Parameters.DeviceIoControl.Type3InputBuffer). The output data is described by a pointer in the IRP (UserBuffer). In both cases, these pointers are direct virtual address references to the original buffer. The memory may, or may not, be valid.
Regardless of the type of transfer, any pointers stored within these buffers will always be direct virtual memory references. For example, a driver that accepts a further buffer pointer will need to treat them as ordinary direct access to application address space.
For any driver, attempting to access a user buffer directly can lead to an invalid memory reference. Such references cause the memory manager to throw exceptions (such as is done using ExRaiseStatus for instance). If a driver has not protected against such exceptions, the default kernel exception handler will be invoked. This handler will call KeBugCheckEx indicating KMODE_EXCEPTION_NOT_HANDLED. The exception will be indicated as STATUS_ACCESS_VIOLATION (0xC0000005). Protecting against such exceptions requires the use of structured exception handling, which is described elsewhere (See Question Number 1.38 for more information).
A normal driver will associate an MDL it creates to describe the user buffer with the IRP. This is useful because it ensures that when the IRP is completed the I/O Manager will clean up these MDLs, which eliminates the need for the driver to provide special handling for such clean-up. As a result, it is normal that if there is both an MdlAddress and UserBuffer address, the MDL describes the corresponding user address (you can confirm this by comparing the value returned by MmGetMdlVirtualAddress with the value stored in the UserBuffer field). Of course, it is possible that a driver might associate multiple MDLs with a single IRP (using the Next field of the MDL itself). This could be done explicitly (by setting the field directly) or implicitly (by using IoAllocateMdl and indicating TRUE for the SecondaryBuffer parameter). This could be problematic for file system filter drivers, should a file system be implemented to exploit this behavior.
The other source of I/O operations is from other OS components. It is acceptable for such OS components to use the same access mechanism used by user-mode components, in which case the earlier analysis still applies. In addition, kernel mode components may utilize direct I/O - regardless of the value in the Flags field of the DEVICE_OBJECT for the given file system. For example, paging I/O operations are always submitted to the file system by utilizing MDLs that describe the new physical memory that will be used to store the data. File systems should not reference the memory pointed to by Irp->UserBuffer although this address will appear to be valid (and will even be the same as is returned by MmGetMdlVirtualAddress). The address are not, in fact valid, but may be used by the file system when constructing multiple sub-operations. Memory Manager MDLs cannot be used for direct reference to that memory, as those buffers have not been mapped into memory since they do not yet contain the correct data.
For a file system filter driver that wishes to modify the data in some way, it is important to keep in mind the use of that memory. For example, a traditional mistake for an encryption filter is to trap the IRP_MJ_WRITE where the IRP_NOCACHE bit is set (which catches both user non-cached I/O as well as paging I/O) and, using the provided MDL or user buffer, encrypt the data in-place. The risk here is that some other thread will gain access to that memory in its encrypted state. For example, if the file is memory mapped, the application will observe the modified data, rather than the original, cleartext data. Thus, there are a few rules that need to be observed by file system filter drivers that choose to modify the data buffers associated with a given IRP:
? The IRP_PAGING_IO bit changes the completion behavior of the I/O Manager. MDLs in such IRPs are not discarded or cleaned up by the I/O Manager, because they belong to the Memory Manager (see IoPageRead as an example). Thus, filter drivers should be careful when setting this bit (e.g., if they create a new IRP and send it down with the resulting data).
? The Irp->UserBuffer must have the same value as is returned by MmGetMdlVirtualAddress. If it does not and the underlying file system must break up the I/O operation into a series of sub-operations, it will do so incorrectly (see how this is handled in deviosup.c within the FAT file system example in the IFS Kit, where it builds a partial MDL using IoBuildPartialMdl. It uses Irp->UserBuffer as an index reference for Irp->MdlAddress). For example, if substituting a new piece of memory (such as for the encryption driver), make sure that this parameter is set correctly.
? Never modify the buffer provided by the caller unless you are willing to make those changes visible to the caller immediately. Keep in mind that in a multi-threaded shared memory operating system the change is - literally - available and visible to other threads/processes/processors as you make them. Changes should always be made to a separate buffer component. That buffer can then be used in lieu of the original buffer, either within the original IRP, or by using a new IRP for that task.
? Use the correct routine for the type of buffer (e.g., MmBuildMdlForNonPagedPool if the memory is allocated from non-paged pool).
? Any reference to a pointer within the user's address space must be protected using __try and __except in order to prevent invalid user addresses from causing the system to crash.
What are the issues with respect to
IRQL APC_LEVEL? What does it do? Why should I use (or not use) FsRtlEnterFileSystem?
Windows is designed to be a fully re-entrant operating system. Thus,
in general, kernel components may make calls back into the OS without worrying
about deadlocks or other potential re-entrancy problems.
Windows also provides out-of-band execution of operations, such as asynchronous procedure calls (APC). And APC is a mechanism that allows the operating system to invoke a given routine within a specific thread context. This is, in turn, done by using a queue of pending APC objects that is stored in the control structure used by the OS for tracking a thread's state. Periodically, the kernel checks to see if there are pending APC objects that need to be processed by the given thread (where "periodically" is an arbitrary decision of the operating system). From a pragmatic programming standpoint, an APC can be "delivered" (that is, the routine can be called) by a thread between any two instructions. The delivery of APCs can be blocked by kernel code using one of two mechanisms:
? Kernel APC delivery may be disabled by using KeEnterCriticalRegion and re-enabled by using KeLeaveCriticalRegion. Note that Special Kernel APCs are not disabled using this mechanism.
? Special Kernel APC delivery may be disabled by raising the IRQL of the thread to APC_LEVEL (or higher).
There are numerous uses for this, but some of the reasons for this are because of the nature of threads and APCs on Windows. First, we note that a given thread is restricted to running on only a single processor at any given time. Thus, the operating system can eliminate multi-processor serialization issues by requiring that an operation be done in one specific thread context. For example, each thread maintains a list of outstanding I/O operations it has initiated (this is ETHREAD->IrpList). This list is only modified in the given thread's context. By exploiting this, and by raising to APC_LEVEL the list can be safely modified without resorting to more expensive locking mechanisms (such as spin locks).
The primary disadvantage to APC_LEVEL is that it disables I/O completion APC routines from running. This in turn means that the driver must be careful to handle completion correctly (that is, it cannot use the Zw routines, and it must use its own signaling mechanism from its completion routine when sending an IRP so that it may signal completion of the I/O).