Advisory File Locking – My take on POSIX and BSD locks – Loonytek

I had a chance to do several Advisory File-Locking experiments on NFS and local file system. In this article, I would like to give some insight into the concept of advisory locking and share some useful results from my tests.

Some Background

File Locking is generally useful in an environment where multiple applications or processes may want to access(read/write) the same file. In such cases, it becomes important to protect integrity of file data and provide consistent view of data to all the processes. There are two types of locks:

Read or Shared locks – Multiple reader processes can hold read locks on the same file.
Write or Exclusive locks – At a time only one writer process can hold a write lock on the file. So no other process should be able to acquire a read or write lock on a file already write locked by a different process.

Advisory Locking – requires cooperation from all the involved processes. Once a process(say P1) has taken a read/shared lock on a file “F”, there is really nothing done by Operating System to prevent a process(say P2) from going ahead and issuing a write system call. What ? So, how does the locking work ? As mentioned, this method requires the participating processes to obey some locking protocol/API and hence cooperate with each other. As long as these processes first make use of the provided locking protocol, and respect its result before issuing read or write system calls, advisory locking scheme is going to work. An example in the form of pseudo code will explain it better.

For the example, lets just assume that there is libc function called “lock” that under the hood makes use of some system call to set file locks. The lock function takes in two arguments — open file-descriptor and type of lock requested(shared/exclusive). It returns 0 if lock was successfully set and -1 if lock was not granted due to some other process holding a conflicting lock on the same file.

Conflicting lock is an incompatible lock. Two situations explain lock conflicts.

A process P1 trying to set a read or a write lock on some file “F”, but a process P2 already holds a write lock on the same file. So P1’s request conflicts with the currently placed lock on the file by P2.
A process P1 trying to set a write lock on some file “F”, but a process P2 already holds a read lock on the same file. Again, P1’s request conflicts with the currently placed lock on the file by P2.

Now back to the example:

Example 1 – Participating processes do not cooperate. Even if a process owns a lock, it turns out to be useless.

Process P1

int fd = open(path, O_RDONLY);
// Use the lock API to get a read lock
int ret = lock(fd, shared);
if(!ret)
{
   // Lock Acquired. Now read the file
   while(count)
   {
      int num_bytes = read(fd, buffer, sizeof(buffer));
   }
}
close(fd);

Process P2

int fd = open(path, O_WRONLY);
// Does not use the lock API(non-cooperative) and issues the write system call.
int num_bytes = write(fd, buffer, sizeof(buffer));
close(fd);

Let’s take the OS scheduling of these 2 processes and context switches in a pre-determined manner to highlight my point.Process P1 took a read lock and then issues multiple read system calls. Say context switch happens and process P2 gets scheduled. Process P2 turns out to be non-cooperative as it does not use the lock protocol and writes to the file. When P1 gets scheduled later and does a read, it might see some unexpected or garbage data written by P2. So even though P1 acquired a read lock, it actually turned out to be useless.

Now let’s see the correct usage of advisory locking with the same scenario.

Example 2 – Participating processes cooperate. Advisory locking works fine.

Process P2

int fd = open(path, O_WRONLY);
// Use the lock API to try and get a write lock before writing to the file(cooperative).
int ret = lock(fd, exclusive);
if(!ret)
{
  // Only if a write lock is taken, it writes to the file.
  int num_bytes = write(fd, buffer, sizeof(buffer)); 
}
else
{
  printf("Lock Conflict. Failed to get the lock");
  // Do something or exit.
}
close(fd);

Here we see process P2 writes to the file only if it gets a write lock on the file. The “lock” API will make sure to fail P2’s request as P1 is already having a read lock on the same file. So none of the processes disregard the locking protocol and do reads/writes from/to the file only after taking a lock. This is what is meant by “cooperating processes”.

On most Unixes, advisory locking is the default scheme provided to applications.

Locks can be placed over file in two different forms:

Full-File locks – are placed over the entire file. It implies that entire region starting from first byte offset to the largest byte offset of the file is locked by the process.
Byte-Range locks – are placed over specific byte-offset ranges in the file. For example, a process P1 can take a read lock on the offset range [0-400] and, at the same time a process P2 can hold a write lock on the offset range [501-599].

fcntl system call – is the most popular way of using advisory locks on Unix systems. By default, locks set by fcntl are advisory in nature. Locks set by fcntl system call are usually referred to as POSIX locks as it adheres to POSIX standards of file locking, and is available on most compliant operating systems.

fcntl supports both advisory and mandatory locks, and advisory locks is the default behavior. Both full-file and byte-range locks are allowed.

Locks given by fcntl are not associated with the file-descriptor or open-file table entries. Instead, they are bound to the process itself. For example, a process has multiple open file descriptors for a particular file and gets a read/write lock using any one of these descriptors. Now closing any of these file descriptors will release the lock, the process holds on the file. The descriptor that was used to acquire the lock in the first place might still be open, but the process will loose its lock. So, it does not require an explicit unlock or a close ONLY on the descriptor that was used to acquire the lock in fcntl call. Doing unlock or close on any of the open file descriptors will release the lock owned by the process on the particular file.

This brings out a terrible flaw in the design of POSIX locks. Imagine a process having locked a file happens to invoke some library functions that do an open-close on the same file for some random reason. Whatever the reason is, the process will accidentally loose its lock on the file, and this is one behavior that programmers or developers will rarely be happy with. It is fairly common to have applications calling library routines, where the routines do an <open, do-something, close> on the file also opened by the calling process. But, in case the process happens to have a lock on the file, it should be ready to loose its lock.

Lock conversion or up-gradation happens when a process already having a lock on the file(full-file or specific region) makes another lock request for a different lock type on the same file(full-file or same locked region). So, a process submitting a write lock request on a file already read locked by it will get its existing read lock on the file converted to write lock. Note that the request may come on a different open file descriptor, but that really doesn’t matter for the reason mentioned above. This is non-understandable to me in terms of design. I would take the same library example mentioned before. Say a process opens a file and write locks it. Now it invokes some library function which opens the same file and attempts to get a read lock on it. Result ? The existing write lock owned by the process on the file will silently get converted to read lock and who knows when the process would come to know about it. This can easily lead to chaos on systems running multiple applications. A process had a write lock which accidentally got converted to read lock, and thus allowed other applications(if any) waiting to get a read lock to go ahead and get the lock, but they were really not supposed to get one.

Due to the association of POSIX locks with the process and not the file descriptors, they are not safe enough to be used in a multi-threaded environment. Even if the child threads deal with their separate individual file descriptors for a file “F”, they will still be able to release/convert lock owned by process on the file “F”.

For example, a process opened a file “F”, acquired a write lock on it and then created multiple child threads. Now one of these child threads opens the file “F” for some reason, and then closes it. Result ? The parent process(and of-course all the child threads) loses its lock on the file. As a different example, the child thread might make a read lock request on the descriptor it just opened. The consequence would be parent process’s existing write lock gets converted to a read lock without the parent process even knowing about it.These examples bring forth a highly undesirable behavior of POSIX locks, and renders them useless for multi-threaded applications. They are not at all safe for synchronizing between the threads of the same process.

Note that when we talk about synchronizing between the threads of the same process, we assume that the child threads are using their own separate file descriptors. Why ? Because if the threads use the same file descriptor as opened by the process(which they can as they share the file descriptor table with the process), any action (close/unlock/lock-conversion etc) performed on the file descriptor will have its effects on the lock. So, there is really nothing to be done about it.

What we could have hoped for is that as long as child threads are using separate open file descriptors, they are not going to mess with the lock already set by the parent process on the same file. Unfortunately, POSIX locking design does not satisfy this requirement because it associates the locks with processes and not with individual open file descriptors.

As the locks are not bound to the descriptors, but are possessed by the process, any forked child process will not inherit the locks owned by the parent process. We know that the file descriptor(s) opened by the parent process gets duplicated for the child process — the entries from file-descriptor tables of parent and child will point to the same entry in the global file table. But, the descriptor or the file-table entry as such does not have any information about the lock. So any attempt(unlock, lock conversion) on the duplicated descriptor by the child process will have no effect on the properties of lock owned by the parent process. Any lock request made on the duplicated descriptor by the child process will result in “Lock Conflict” just like it will for any other process. Also, closing the file-descriptor by the child process will not make the parent process lose its lock. The reference in the global file table still remains open for the parent process and it still holds a valid lock on the concerned file. I know I digressed a bit from the topic by pointing out how the descriptor sharing works between the parent and child, but I would just like to be clear with my points.

What about the lock operations performed by the process on its duplicated file descriptors — that is, the descriptors it acquired by invoking “dup” system call on original descriptor ? Just like all the open file descriptors for a file are treated equally, duplicated descriptors bring out no different behavior. Any unlock, lock-conversion or close performed on the duplicated descriptor for a file “F” will affect the lock owned by the process on file “F”.

NOTE: The code examples are more of pseudo-code type. I would like to refrain myself from making this as a tutorial on how to use file locking

int fd1 = open(path, O_RDONLY);
// get lock on file using fd1.
int ret = fcntl(fd1, F_SETLK, lock);
// use open and dup to get multiple descriptors
int fd2 = open(path, O_RDONLY);
int fd3 = open(path, O_RDONLY);
int dup1 = dup(fd1);
int dup2 = dup(fd1);
// any of the following statements will result in unlock.
ret = fcntl(fd2, F_SETLK, unlock);
ret = fcntl(fd3, F_SETLK, unlock);
close(fd2);
close(fd3);
ret = fcntl(dup1, F_SETLK, unlock);
ret = fcntl(dup2, F_SETLK, unlock);
close(dup1);
close(dup2);
// lock released. Note that descriptor fd1 is still open and no unlock has been performed on it yet.

To summarize, programmers really need to be aware of the design of POSIX locks in order to ensure the correctness of their applications. We need to be careful in handling the multiple open or duplicated file descriptors. An unlock or conversion might happen silently and it may impact the correctness. On a similar note, I would suggest to refrain from using POSIX file locking in multi-threaded environment. Nowadays, we rarely see single-threaded programs, so there are not many use cases for POSIX locking.

On the good side, POSIX locks provide byte-range locking and works consistently across NFS mounts. They also detect deadlocks — process makes blocking lock requests and putting the process to sleep will deadlock the clients. In such cases, blocking requests fail and return then being blocked indefinitely.

flock system call – another system call originally implemented in 4.2BSD Operating System. Locks set by flock system call are usually referred to as BSD locks.

flock supports only advisory locking. Also, it does not provide byte-range locks, instead all locks are full-file locks.

Unlike POSIX locks, BSD locks are associated with individual file descriptors and not to processes. So, all the different file descriptors acquired by the process for a particular file are treated independently. For example, a process opens a file “F” and takes a write lock on it. It now opens the same file again, and gets another descriptor. Any lock attempt by the process on this file descriptor will result in a “Lock-Conflict” error. Note that a similar operation in case of POSIX locks would have resulted in lock conversion. Even if the second descriptor is closed, it will not result in unlock unlike what happens in POSIX locking.

This behavior makes BSD locks more suitable for use in multi-threaded environment. The child threads will not be able to release or convert locks acquired by the parent process. So, applications can synchronize between threads of the same process. For synchronization, my previous assumption “child threads use their own separate open file descriptors” still holds for the same reason stated above.

There are some more things to be noted in the treatment of different descriptors.

1. Any lock operation(unlock, conversion etc) on a descriptor (acquired through “open” system call) will not affect the lock that process had obtained using a different open file descriptor.

int fd1 = open(path, O_RDONLY);
// get lock on file using fd1.
int ret = flock(fd1, LOCK_SH);
// open multiple descriptors
int fd2 = open(path, O_RDONLY);
int fd3 = open(path, O_RDONLY);
// doing unlock or closing these additional descriptors will not release the lock
ret = flock(fd2, LOCK_UN);
ret = flock(fd3, LOCK_UN);
close(fd2);
close(fd3);
// either do an unlock on fd1 or close it to release the lock.
ret = flock(fd1, LOCK_UN);

2. An unlock on a duplicate descriptor (acquired through “dup” system call) will release the lock(if any), the process holds on the corresponding file. Similarly, the lock can be altered or converted using any of the duplicate descriptors.

int fd1 = open(path, O_RDONLY);
// get lock on file using fd1.
int ret = flock(fd1, LOCK_SH);
// duplicate the descriptor.
int dup1 = dup(fd1);
int dup2 = dup(fd1);
// using any of the below three statements will release the lock.
ret = flock(dup1, LOCK_UN);
ret = flock(dup2, LOCK_UN);
ret = flock(fd1, LOCK_UN);

3. Closing the duplicate descriptor will not release the lock. All such duplicate descriptors(including the original that was duplicated) should be closed in order to release the lock.

This implies that in case of duplicated descriptors, BSD locks will be held by a process as long as a reference remains in the global file table or we do an unlock on any of the duplicated descriptors. Reason ? All the duplicated descriptors including the original one refer to the same entry in global file table. So they basically refer to the same lock. Hence, an unlock on any one of these will actually release the lock. On the other hand, if the process is not doing an explicit “unlock” operation and relying on “close” system call, then all such descriptors need to be closed.

int fd1 = open(path, O_RDONLY);
// get lock on file using fd1.
int ret = flock(fd1, LOCK_SH);
// duplicate the descriptor.
int dup1 = dup(fd1);
int dup2 = dup(fd1);
// lock will be released only after all such descriptors have been closed.
close(dup1); 
close(dup2); 
close(fd1);
// lock released.

As the locks are associated with file-descriptors, forked process acquires the locks(if any) from parent process. Let’s take an example:

Example 1 – A process P1 opens a file “F”, and gets a read lock on it. It now forks a child process P2. As a result, P2’s file-descriptor table will get an entry for “F”. Both the entries in file-descriptor tables of P1 (say fd1) and P2 (say fd2) respectively will point to the same entry in global file table. So, both P1 and P2 own a read lock on file “F”.

What if any of the processes closes its descriptor ? It will release the lock owned by that process on the file “F”.

What if any of the processes does an unlock on its descriptor ? It will release the lock for both the processes. Surprising !!

What if any of the processes tries to convert the lock using its descriptor ? It will convert the lock for both the processes. So, if P2 makes a write lock request on its file descriptor fd2. It will result in both P1 and P2 owning a write lock on file “F”. Even more surprising !!. I fail to understand why would any programmer be okay with an exclusive lock owned by two different processes at the same time.

The last 2 observations above are unacceptable and I find it hard to imagine a use case for them. Even though BSD locks are bound to file descriptors and not processes, the above example brings out the incompleteness of this implementation with respect to duplicate file descriptors. It kind of degenerates into POSIX like behavior once the process has duplicated the descriptor. Lock can be released or altered by doing an operation on any of the duplicate descriptors. We had the same problem in POSIX locks. It was more worse and universal though !!

I would say BSD locks only partially solve the problem associated with POSIX locks. They associated the locks with file descriptors, but the lock-data needs to be moved out of the global file table to process specific file descriptor table. This can be a probable way to fix the issue associated with duplicate descriptors.

BSD locks or locks acquired through flock system call do not work on NFS. In fact some file locking implementations on NFS are so horrible that a BSD lock request made on the client side is transparently converted to a POSIX lock on the server side. The client application might have used the flock system call, but the lock granted by NFS server may be a POSIX lock. BSD locks do not support byte-range locks which renders them almost useless for database softwares. They are also not as portable as POSIX locks, and do not detect deadlocks.

Summary:

fcntl and flock style locks are completely orthogonal to each other. Any system providing both(Linux does) will be treating locks obtained through each one of them independently. For example, a process uses fcntl and gets a read lock on file “F”. Another process using flock will still be able to get a write lock on the same file. So, refrain from using both of them together

Both these APIs do not maintain a count of lock requests issued. For example, a process having read locked a file “F” might again request read lock on the same whole file or same region “n” number of times. Doing this does not make the kernel increment the lock count. All the subsequent requests are treated as redundant ones and a single “unlock” is enough to release the lock on the file or the region. I do not see any particular problem with this, but I feel that either these APIs were not tested completely or may be they were just designed this way. If the subsequent lock requests are treated as redundant ones or NOOP’s, then the corresponding operations should really return error or some message that lets the application know about it. Also, even though a single unlock is enough to release the lock, subsequent unlock calls still do not return errors. Similar to redundant lock calls, they also return “0”. Why would you allow the process to do an unlock when it no longer holds the lock ? The operation is meaningless, and it is always good to have the client applications know about it. I don’t see any impacts of this, but its just not the ideal design is what I feel.

Both POSIX and BSD locks are automatically released when the process exits or is aborted. So, in the event of failures like memory corruption, exceptions etc which lead to abnormal termination of the process, kernel takes care of closing all the open file descriptors of the process. Hence, kernel does the required lock cleanup in such conditions.

Both POSIX and BSD locks are preserved across execve calls except when the process has set FD_CLOEXEC flag forcing the file descriptor to be closed, and not inherited by the new process.

Some of you may want to read this for more information on differences between these two APIs with respect to NFS. It is a nice read and brings out interesting facts about potential security risks, and the usage of POSIX and BSD locks on NFS mounts.

Also, Jeffrey in his article highlights similar problems and even mentions an approach to solve some of the issues.

Another API:

lockf API – lockf is library function, and a wrapper over fcntl on LINUX systems. As mentioned here, POSIX.1-2001 does not specify any relation between lockf and fcntl. My experiments were mostly conducted on Linux, so for this discussion we can assume that behavior of locks placed through fcntl and lockf is same.

POSIX locks have a major disadvantage in the fact that they are bound to processes, and this messes up various things. Otherwise, they are more usable than BSD locks as they work over NFS, provide byte-range locks along with both advisory and mandatory locking, and also detect deadlocks. If they were to be redesigned, then I would suggest to bind the locks to file descriptors in a way that a given lock SHOULD BE released, altered,or converted ONLY when a corresponding operation is performed on the file descriptor that was used to obtain the lock. BSD locks are bound to descriptors, but the situation with duplicate descriptors is again undesirable. I am not sure of the feasibility of this approach, but this is what struck me first, after having all these observations.

POSIX locks already have a variety of features, and if we can do this enhancement, then they will find far more use cases than BSD locks.

Share this:

Related

Leave a comment Cancel reply