Download EaseFilter Filter Driver SDK Setup File Download EaseFilter Filter Driver SDK Zip File
A standard
technique in developing operating systems has been to cache data in order to
improve access times. This approach works well because data that has been recently accessed will be accessed again soon. For
a system where there is only ever one program accessing data, caching works
fine. However, once a second program attempts to access the same data the issue
of cache consistency becomes critical.
Because there are now potentially many
copies of the data ? each one
in a different cache ? there must be some mechanism to
keep these copies in sync with one another. If not different users of the data
will have different views and thus there are now many copies of the data, with
each one different than the others.
When accessing files across a network,
the simplest scheme is to always store all data back
on the file server. Thus, whenever an application program reads data, that read
request is satisfied from the file server ? across the network. This ensures that there is a consistent
view of data since there is only a single copy of the data in existence ? on the file server.
Unfortunately, the performance
characteristics of such systems are not ideal. While file servers can be built
to work very fast there are numerous bottlenecks between the client accessing the data and the file server storing the data, not to mention the
added latency necessary to fetch the data from the file server each time.
Studying this problem at length reveals
that the majority of data being retrieved from the file server is never modified ? it is only read. Data
that is being modified is almost always being modified by a single program ? like a word processing
document. Data that is being modified by multiple
programs represent an incredibly tiny amount of the total data traffic. In
spite of this, users do expect their file systems to ensure that any data they
access is correct ? not most
of the time, but all of the time.
The typical access characteristics for
such data make network file system clients ideal candidates to cache data ? and many of them do so,
using a variety of different techniques to ensure cache consistency. For
example, the venerable NFS file system protocol utilizes a scheme of checking
file system time stamps on the remote file server to detect when data in its
cache may become stale. This solution is not a perfect one since there is a
window in which old data may be cached by the NFS client,
it has worked well for many years.
For Windows NT, network
file system caching is implemented by the LanManager redirector (the "client") and file server (the "server").
In order to ensure correctness of the cached data, LanManager implements a basic cache consistency scheme which covers the entire file contents. Where files are being simultaneously accessed
across the network by multiple users for both read and write access, caching is disabled ? clients must fetch
data from the file server each time it is read, and must store it back
immediately each time it is written. However, in the vast majority of cases,
the client will cache data locally. This minimizes network traffic and vastly
improves performance for most file access on Windows NT.
This is
implemented by Windows NT using a cache consistency scheme known as
opportunistic locking. An
opportunistic lock is known as an "oplock" in the parlance of Windows NT file systems.
Further, the implementation of oplocks by Microsoft impacts both their network and local file systems. Because
the details of the local implementation are tightly coupled to how oplocks are used by network file systems, we
describe the network implementation initially and then return to discussing
issues associated with their local implementation for NT file systems.
In
Figure 1 below, we provide our basic reference diagram for this discussion of oplocks. Oplocks are granted by
SRV to instances of RDR running on systems across the network
? possibly even including the same system on
which SRV is running.
When
a client opens a file across the network, it is typically the only user
accessing that file. In this very common case, the network client need not
store data back to the server immediately, nor need it fetch data repeatedly
from the server. Allowing this optimization minimizes unnecessary network traffic which in turn provides overall better perceived
performance for both the network client and all other clients using the
network.
Figure
1
"Cache
consistency" requires that any two clients on the network must see the
same information in the file at the same point in time. Thus, if one client is not writing data back to the file server on a regular
basis, a second client reading data from the server would receive stale data.
This would violate the requirement that two clients on the network see the same
information in the file at a given point in time.
To
allow client-side caching without suffering from cache consistency problems
requires a cache consistency protocol ? a mechanism
whereby the client keeping data locally rather than writing it back to disk or refetching it from the server each time it needs it can be
informed when it must write the data back to disk or reread it from the file
server.
On
Windows NT this is done via the "opportunistic locking" protocol, or oplock. In the balance of this section we describe the various types of oplocks, their uses,
and how an FSD should deal with them.
There
are three types of oplocks: level 1, batch, and level
2. Both the level 1 and batch oplocks are
"exclusive access" opens. They are used slightly differently, however, and hence have somewhat different semantics. A
level 2 oplock is a "shared access" grant
on the file.
Level
1 is used by a remote client that wishes to modify the data.
Once granted a Level 1 oplock, the remote client may
cache the data, modify the data in its cache and need not write it back to the
server immediately.
Batch oplocks are used by remote clients for accessing script
files where the file is opened, read or written, and then closed
? repeatedly. Thus, a batch oplock corresponds not to a particular application opening the file, but rather to a
remote clients network file system caching the file because it knows something
about the semantics of the given file access. The name "batch" comes
from the fact that this behavior was observed by Microsoft with "batch files" being processed by command line utilities. Log
files especially exhibit this behavior ? when a script it being processed each command is executed in
turn. If the output of the script is redirected to a log file the file fits the pattern described earlier, namely open/write/close. With many
lines in a file this pattern can be repeated hundreds
of times.
Level
2 is used by a remote client that merely wishes to read the data.
Once granted a Level 2 oplock, the remote client may
cache the data and need not worry that the data on the remote file server will
change without it being advised of that change.
An oplock must be broken whenever the cache consistency
guarantee provided by the oplock can
no longer be provided. Thus, whenever a second network client attempts
to access data in the same file across the network, the file server is
responsible for "breaking" the oplocks and
only then allowing the remote client to access the file. This ensures that the
data is guaranteed to be consistent and hence we have
preserved the consistency guarantees essential to proper operation.
An oplock break occurs whenever SRV detects that some
condition necessary to maintaining the oplock has
ceased to be correct. In that case, SRV begins breaking the oplock.
Depending upon the type of oplock being broken, SRV
may have to engage in a multi-message protocol to complete the oplock break.
The
simplest oplock break is for a level 2 oplock. In this case, SRV merely advises the remote client
that it must invalidate any cached data it has and reread it from the file
server.
Figure
2
Breaking
a level 1 oplock, however, is a bit more complicated.
In that case the client may have in memory data that
must be written back to the file server before the oplock break should be considered complete. A graphical description of the control
flow between SRV and RDR is shown in Figure 2. It
demonstrates the call from SRV indicating that an oplock break is in progress. In that case, the remote client initiates a series of
write operations back to the server. The write back process can consist of many
operations between the server and client. Once all data has
been written back to the server, the client then acknowledges the oplock break. Microsoft?s protocol allows the server to grant a Level 2 oplock to the client if the client so desires. This would
allow the client to retain the data in its cache (as it is valid) minimizing
unnecessary network traffic.
Breaking
a batch oplock is initiated by the file server (SRV)
which indicates to the client that an oplock break is
in progress. The client (RDR) then writes any dirty cached data
back to the file server. When that is completed, the client then closes the
file. This causes the file to be reread from the file
server on a subsequent access.
In
fact, closing a file always releases an oplock on the
given file. A client is no longer interested in cache consistency once the file
has been closed ? no data may
be cached by the client if the file is not open.
The oplock protocol itself is sufficient to ensure cache
consistency between clients anywhere on the network. There is one case,
however, that is not covered by this mechanism ? the case of local file system access, perhaps from a local
application program. In this case, the application will call directly into the FSD without using either the server (SRV) or client (RDR)
components.
This
detail of course is essential to our fundamental requirement for cache
consistency. It is the requirement that NT support local client access for
cache consistency that requires oplocks be implemented in the FSD. Thus, an inherently network
activity (remote caching of data) has an important impact on file systems.
Now
we turn our attention to describing the mundane details of how to take
advantage of oplocks in the local file system.
A level 1 oplock is an exclusive oplock on the file. That is, it gives the holder of the
lock the right to cache the data and to modify the data in its cache.
Essentially, no other process (on any system in the network) may be accessing
the file.
An
FSD will grant such an oplock when the
file is only opened by a single process. Thus, if the
file is already opened by two or more clients when a request for a level
1 lock is made, the request will be denied.
Similarly,
if a level 1 lock is already held by the remote client and a second client opens the file, the level 1 lock previously granted must be
revoked. This will trigger a write-back of any dirty data stored by the first
client before the oplock break is completed.
An
interesting requirement of the oplock protocol is
that it requires the interface be implemented synchronously.
The oplock is granted when
STATUS_PENDING is returned to the IRP containing the oplock request. Thus, an FSD must complete the processing of the original IRP
synchronously because returning STATUS_PENDING would indicate the oplock grant was successful to the caller.
Once
an oplock has been granted,
the IRP representing that oplock is queued and held.
The oplock break processing is
implemented by completing the original IRP that requested the oplock. The IRP must be completed by setting the Information field of the IoStatus field to either FILE_OPLOCK_BROKEN_TO_LEVEL_2 or FILE_OPLOCK_BROKEN_TO_NONE.
However,
the oplock break at this stage has not completed.
Instead, the owner of the oplock must do any internal
processing required. Once that processing has completed, the oplock owner must acknowledge the oplock break. If FILE_OPLOCK_BROKEN_TO_LEVEL_2 was returned,
the owner of the oplock may either indicate
FSCTL_OPLOCK_BREAK_ACKNOWLEDGE, in which case the acknowledgment IRP is treated
as a request for a level 2 oplock (c.f., Section 0.)
Alternatively, the oplock owner may acknowledge the
IRP but decline the offer of a level 2 oplock (c.f.,
Section 0) by indicating FSCTL_OPLOCK_BREAK_ACK_NO_2.
The
principal reason a level 1 oplock is broken is because another caller opens the file. Normally a caller who
wishes to open the file will block until the oplock break is completed. However, SRV (the LanManager file
server) requires, for internal deadlock prevention reasons, that a create be completed before the oplock break is completed. This is done by setting (in the create request) the
FILE_COMPLETE_IF_OPLOCKED bit in the option flags.
However,
before SRV can use the file thus created, it must later verify that the oplock break has really completed. It does this by making a
subsequent call to the FSD to wait until the oplock break on the given file is completed (c.f., Section 0).
A level 2 oplock is a shared oplock on the
contents of the file. It allows a network client (RDR) to cache the data in
memory without fear that the data will change.
As
with a level 1 oplock, the oplock is requested via an IRP and the oplock is granted when the FSD returns STATUS_PENDING. Unlike a level 1 oplock, however, a level 2 oplock may be granted when a file has previously been opened.
Further, a level 2 oplock may be
granted even when other opens of the file allow write access. This point
is really very important. As it turns out, many
applications will open a file for write access, even if they never intend on
modifying the contents of the file.
Thus,
an FSD must check when a write is done to a file to ensure that no level 2 oplocks have been granted against the file
? and hence need to be invalidated. This
ensures that if the remote client did cache data that it will be properly
invalidated.
The oplock is broken by completing the
pending IRP. In the case of a level 2 oplock nothing
is set in the Information field - the IRP is simply completed with STATUS_SUCCESS. This ensures that the oplock holder has received notification that their cached data is now stale and must
be refreshed prior to subsequent use.
A batch oplock is an exclusive oplock against
a files contents and against changes in the attributes of the file (notably,
but not exclusively, its name.) It allows a network client to keep a file
"oplocked" even though the application on
the remote client is opening and closing the file repeatedly (as is the case
for a batch file and hence the name of the oplock.)
A
batch oplock can only be granted under the same circumstances as a level 1 oplock (c.f., Section 0.) The oplock itself is requested via an IRP. Returning STATUS_PENDING for that
IRP indicates the oplock itself has
been granted.
Breaking
a batch oplock is different than breaking a level 1 oplock.? The emphasis with a batch oplock is protecting the
file attributes while with a levle 1 oplock the emphasis is protecting the data within the
file. Batch oplocks can be held across open instances
of the file (that is, you can open the file, acquire the batch oplock, close the file, and re-open the file and still hold
the batch oplock) which you cannot do with a level 1 oplock.? Thus, in addition to breaking a batch oplock whenever
the data itself has changed a batch oplock must also
be broken whenever the name of the file changes. This is
because a batch oplock covers the file even thought the client may be opening and closing the
file repeatedly. Were that the case and a rename occurred, the client needs to
be advised that the file handle it is using no longer represents the file it
used to represent.
One
interesting side-effect to using batch oplocks is that certain CREATE operations may fail with the
Information field set to FILE_OPBATCH_BREAK_UNDERWAY. This occurs when the
caller indicated they were unwilling to wait for the oplock break to complete by setting the FILE_COMPLETE_IF_OPLOCKED options flag, as is
typically the case for SRV, the LanManager file
server. In this case the create operation will fail (typically with
STATUS_SHARING_VIOLATION) to indicate to the caller that the problem is with a
batch oplock presently held on the file and that a
blocking call to CREATE would not necessarily fail.
FSCTL_OPLOCK_BREAK_ACKNOWLEDGE
Once an
exclusive (level 1 or batch) oplock has been broken,
other file system requests cannot continue until the oplock break is acknowledged. This can be done one of two ways ? either by a subsequent call
to the FSD indicating a control code of FSCTL_OPLOCK_BREAK_ACKNOWLEDGE or by
closing the file handle.
A
batch oplock break is normally acknowledged by the
file object being closed. A level 1 oplock is normally acknowledged by way of this call.
When SRV opens a
new file, indicating that it does not wish to wait for the oplock break to complete (c.f. Section 0) it must subsequently make a call to the
underlying FSD to ensure that the oplock break has
successfully completed.
This is accomplished by indicating FSCTL_OPLOCK_BREAK_NOTIFY as
the control code in the IRP. This IRP will then block waiting for any oplock break activity to complete on the file. Once this
call returns (STATUS_SUCCESS) the FSD may use the file
object safely.
For
SRV, proper implementation of these semantics by the FSD is essential to
correct behavior. If a normally asynchronous CREATE operation by SRV is forced to be synchronous (perhaps by a filter driver) SRV
will experience internal deadlock conditions.
FSCTL_OPBATCH_ACK_CLOSE_PENDING
Earlier we
mentioned that an oplock break could
be acknowledged by closing the file. This control code
is used by the oplock owner to indicate the oplock break has been acknowledged and a close of the file
is imminent.
In
this case, a level 2 oplock is not necessary. No
further use should be made of this file object except to close the file.
This control
code is a variation on the general acknowledgment operation. In this instance,
the owner of the oplock is declining the offer (by
the FSD) of a level 2 oplock. This is typically
because the owner of the oplock does not use or
support level 2 oplocks.