Lines Matching refs:the

2 	      Overview of the Linux Virtual File System
11 This file is released under the GPLv2.
17 The Virtual File System (also known as the Virtual Filesystem Switch)
18 is the software layer in the kernel that provides the filesystem
20 within the kernel which allows different filesystem implementations to
25 in the document Documentation/filesystems/Locking.
31 The VFS implements the open(2), stat(2), chmod(2), and similar system
32 calls. The pathname argument that is passed to them is used by the VFS
33 to search through the directory entry cache (also known as the dentry
39 most computers cannot fit all dentries in the RAM at the same time,
40 some bits of the cache are missing. In order to resolve your pathname
41 into a dentry, the VFS may have to resort to creating dentries along
42 the way, and then loading the inode. This is done by looking up the
51 beasts. They live either on the disc (for block device filesystems)
52 or in the memory (for pseudo filesystems). Inodes that live on the
53 disc are copied into the memory when required and changes to the inode
57 To look up an inode requires that the VFS calls the lookup() method of
58 the parent directory inode. This method is installed by the specific
59 filesystem implementation that the inode lives in. Once the VFS has
60 the required dentry (and hence the inode), we can do all those boring
61 things like open(2) the file, or stat(2) it to peek at the inode
62 data. The stat(2) operation is fairly simple: once the VFS has the
63 dentry, it peeks at the inode data and passes some of it back to
71 structure (this is the kernel-side implementation of file
73 a pointer to the dentry and a set of file operation member functions.
74 These are taken from the inode data. The open() file method is then
75 called so the specific filesystem implementation can do its work. You
76 can see that this is another switch performed by the VFS. The file
77 structure is placed into the file descriptor table for the process.
80 is done by using the userspace file descriptor to grab the appropriate
81 file structure, and then calling the required file structure method to
82 do whatever is required. For as long as the file is open, it keeps the
83 dentry in use, which in turn means that the VFS inode is still in use.
89 To register and unregister a filesystem, use the following API
99 the VFS will call the appropriate mount() method for the specific
100 filesystem. New vfsmount referring to the tree returned by ->mount()
101 will be attached to the mountpoint, so that when pathname resolution
102 reaches the mountpoint it will jump into the root of that vfsmount.
104 You can see all filesystems that are registered to the kernel in the
111 This describes the filesystem. As of kernel 2.6.39, the following
127 name: the name of the filesystem type, such as "ext2", "iso9660",
132 mount: the method to call when a new instance of this
135 kill_sb: the method to call when an instance of this filesystem
145 The mount() method has the following arguments:
147 struct file_system_type *fs_type: describes the filesystem, partly initialized
148 by the specific filesystem code
152 const char *dev_name: the device name we are mounting.
157 The mount() method must return the root dentry of the tree requested by
158 caller. An active reference to its superblock must be grabbed and the
164 contains a suitable filesystem image the method creates and initializes
168 doesn't have to create a new one. The main result from the caller's
169 point of view is a reference to dentry at the root of (sub)tree to
172 The most interesting member of the superblock structure that the
173 mount() method fills in is the "s_op" field. This is a pointer to
174 a "struct super_operations" which describes the next level of the
177 Usually, a filesystem uses one of the generic mount() implementations
184 mount_single: mount a filesystem which shares the instance between
187 A fill_super() callback implementation has the following arguments:
189 struct super_block *sb: the superblock structure. The callback
207 This describes how the VFS can manipulate the superblock of your
208 filesystem. As of kernel 2.6.22, the following members are defined:
251 dirty_inode: this method is called by the VFS to mark an inode dirty.
253 write_inode: this method is called when the VFS needs to write an
254 inode to disc. The second parameter indicates whether the write
257 drop_inode: called when the last access to the inode is dropped,
258 with the inode->i_lock spinlock held.
263 called regardless of the value of i_nlink)
265 The "generic_delete_inode()" behavior is equivalent to the
266 old practice of using "force_delete" in the put_inode() case,
267 but does not have the races that the "force_delete()" approach
270 delete_inode: called when the VFS wants to delete an inode
272 put_super: called when the VFS wishes to free the superblock
273 (i.e. unmount). This is called with the superblock lock held
276 a superblock. The second parameter indicates whether the method
277 should wait until the write out has been completed. Optional.
281 used by the Logical Volume Manager (LVM).
286 statfs: called when the VFS needs to get filesystem statistics.
288 remount_fs: called when the filesystem is remounted. This is called
289 with the kernel lock held
291 clear_inode: called then the VFS clears the inode. Optional
293 umount_begin: called when the VFS is unmounting a filesystem.
295 show_options: called by the VFS to show mount options for
298 quota_read: called by the VFS to read from filesystem quota file.
300 quota_write: called by the VFS to write to filesystem quota file.
302 nr_cached_objects: called by the sb cache shrinking function for the
303 filesystem to return the number of freeable cached objects it contains.
306 free_cache_objects: called by the sb cache shrinking function for the
307 filesystem to scan the number of objects indicated to try to free them.
311 We can't do anything with any errors that the filesystem might
312 encountered, hence the void return type. This will never be called if
313 the VM is trying to reclaim under GFP_NOFS conditions, hence this
317 scanning loop that is done. This allows the VFS to determine
322 Whoever sets up the inode is responsible for filling in the "i_op" field. This
323 is a pointer to a "struct inode_operations" which describes the methods that
330 An inode object represents an object within the filesystem.
336 This describes how the VFS can manipulate an inode in your
337 filesystem. As of kernel 2.6.22, the following members are defined:
373 create: called by the open(2) and creat(2) system calls. Only
376 dentry). Here you will probably call d_instantiate() with the
377 dentry and the newly created inode
379 lookup: called when the VFS needs to look up an inode in a parent
380 directory. The name to look for is found in the dentry. This
381 method must call d_add() to insert the found inode into the
382 dentry. The "i_count" field in the inode structure should be
383 incremented. If the named inode does not exist a NULL inode
384 should be inserted into the dentry (this is called a negative
388 If you wish to overload the dentry methods then you should
389 initialise the "d_dop" field in the dentry; this is a pointer
391 This method is called with the directory inode semaphore held
393 link: called by the link(2) system call. Only required if you want
395 d_instantiate() just as you would in the create() method
397 unlink: called by the unlink(2) system call. Only required if you
400 symlink: called by the symlink(2) system call. Only required if you
402 d_instantiate() just as you would in the create() method
404 mkdir: called by the mkdir(2) system call. Only required if you want
406 call d_instantiate() just as you would in the create() method
408 rmdir: called by the rmdir(2) system call. Only required if you want
411 mknod: called by the mknod(2) system call to create a device (char,
415 in the create() method
417 rename: called by the rename(2) system call to rename the object to
418 have the parent and name given by the second inode and dentry.
421 If no flags are supported by the filesystem then this method
422 need not be implemented. If some flags are supported then the
424 flags. Currently the following flags are implemented:
425 (1) RENAME_NOREPLACE: this flag indicates that if the target
426 of the rename exists the rename should fail with -EEXIST
427 instead of replacing the target. The VFS already checks for
428 existence, so for local filesystems the RENAME_NOREPLACE
431 exist; this is checked by the VFS. Unlike plain rename,
434 readlink: called by the readlink(2) system call. Only required if
437 follow_link: called by the VFS to follow a symbolic link to the
442 put_link: called by the VFS to release resources allocated by
444 to this method as the last parameter. It is used by
446 (i.e. page that was installed when the symbolic link walk
447 started might not be in the page cache at the end of the
450 permission: called by the VFS to check for access rights on a POSIX-like
454 mode, the filesystem must check the permission without blocking or
455 storing to the inode.
460 setattr: called by the VFS to set attributes for a file. This method
463 getattr: called by the VFS to get attributes of a file. This method
466 setxattr: called by the VFS to set an extended attribute for a file.
470 getxattr: called by the VFS to retrieve the value of an extended
474 listxattr: called by the VFS to list all extended attributes for a
477 removexattr: called by the VFS to remove an extended attribute from
480 update_time: called by the VFS to update a specific time or the i_version of
481 an inode. If this is not defined the VFS will update the inode itself
484 atomic_open: called on the last component of an open. Using this optional
485 method the filesystem can look up, possibly create and open the file in
486 one atomic operation. If it cannot perform this (e.g. the file type
488 usual 0 or -ve . This method is only called if the last component is
490 f_op->open(). If the file was created, the FILE_CREATED flag should be
491 set in "opened". In case of O_EXCL the method must only succeed if the
494 tmpfile: called in the end of O_TMPFILE open(). Optional, equivalent to
500 The address space object is used to group and manage pages in the page
501 cache. It can be used to keep track of the pages in a file (or
502 anything else) and also track the mapping of sections of the file into
510 The first can be used independently to the others. The VM can try to
512 pages in order to reuse them. To do this it can call the ->writepage
515 references will be released without notice being given to the
519 lru_cache_add and mark_page_active needs to be called whenever the
523 maintains information about the PG_Dirty and PG_Writeback status of
527 The Dirty tag is primarily used by mpage_writepages - the default
528 ->writepages method. It uses the tag to find dirty pages to call
529 ->writepage on. If mpage_writepages is not used (i.e. the address
530 provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
533 writing out the whole address_space.
541 typically using the 'private' field in the 'struct page'. If such
542 information is attached, the PG_Private flag should be set. This will
543 cause various VM routines to make extra calls into the address_space
547 application. Data is read into the address space a whole page at a
548 time, and provided to the application either by copying of the page,
549 or by memory-mapping the page.
550 Data is written into the address space by the application, and then
551 written-back to storage typically in whole pages, however the
556 set_page_dirty to write data into the address_space, and writepage,
559 Adding and removing pages to/from an address_space is protected by the
562 When data is written to a page, the PG_Dirty flag should be set. It
573 This describes how the VFS can manipulate mapping of a file to page cache in
594 /* migrate the contents of a page to the specified target */
605 writepage: called by the VM to write a dirty page to backing store.
611 and should make sure the page is unlocked, either synchronously
612 or asynchronously when the write operation completes.
616 other pages from the mapping if that is easier (e.g. due to
618 should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
621 See the file "Locking" for more details.
623 readpage: called by the VM to read a page from backing store.
625 unlocked and marked uptodate once the read completes.
626 If ->readpage discovers that it needs to unlock the page for
628 In this case, the page will be relocated, relocked and if
631 writepages: called by the VM to write out pages associated with the
633 the writeback_control will specify a range of pages that must be
637 instead. This will choose pages from the address space that are
640 set_page_dirty: called by the VM to set a page dirty.
645 If defined, it should set the PageDirty flag, and the
646 PAGECACHE_TAG_DIRTY tag in the radix tree.
648 readpages: called by the VM to read pages associated with the address_space
656 Called by the generic buffered write code to ask the filesystem to
657 prepare to write len bytes at the given offset in the file. The
658 address_space should check that the write will be able to complete,
660 housekeeping. If the write will update parts of any basic-blocks on
662 read already) so that the updated blocks can be written out properly.
664 The filesystem must return the locked pagecache page for the specified
665 offset, in *pagep, for the caller to write into.
667 It must be able to cope with short writes (where the length passed to
668 write_begin is greater than the number of bytes copied into the page).
676 Returns 0 on success; < 0 on failure (which is the error code), in
680 be called. len is the original len passed to write_begin, and copied
681 is the amount that was able to be copied (copied == len is always true
682 if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag).
684 The filesystem must take care of unlocking the page and releasing it
687 Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
690 bmap: called by the VFS to map a logical block offset within object to
691 physical block number. This method is used by the FIBMAP
693 a file, the file must have a stable mapping to a block
694 device. The swap system does not go through the filesystem
695 but instead uses bmap to find out where the blocks in the file
699 alternative to f_op->open(), the difference is that this method may open
700 a file not necessarily originating from the same filesystem as the one
706 will be called when part or all of the page is to be removed
707 from the address space. This generally corresponds to either a
708 truncation, punch hole or a complete invalidation of the address
709 space (in the latter case 'offset' will always be 0 and 'length'
710 will be PAGE_CACHE_SIZE). Any private data associated with the page
712 length is PAGE_CACHE_SIZE, then the private data should be released,
713 because the page must be able to be completely discarded. This may
714 be done by calling the ->releasepage function, but in this case the
718 that the page should be freed if possible. ->releasepage
719 should remove any private data from the page and clear the
723 first is when the VM finds a clean page with no active users and
724 wants to make it a free page. If ->releasepage succeeds, the
725 page will be removed from the address_space and become free.
729 through the fadvice(POSIX_FADV_DONTNEED) system call or by the
731 they believe the cache may be out of date with storage) by
733 If the filesystem makes such a call, and needs to be certain
735 need to ensure this. Possibly it can clear the PageUptodate
738 freepage: freepage is called once the page is no longer visible in
739 the page cache in order to allow the cleanup of any private
740 data. Since it may be called by the memory reclaimer, it
741 should not assume that the original address_space mapping still
744 direct_IO: called by the generic read/write routines to perform
745 direct_IO - that is IO requests which bypass the page cache
746 and transfer data directly between the storage and the
749 migrate_page: This is used to compact the physical memory usage.
750 If the VM wants to relocate a page (maybe off a memory card
754 that it has to the page.
756 launder_page: Called before freeing a page - it writes back the dirty page. To
757 prevent redirtying the page, it is kept locked during the whole
760 is_partially_uptodate: Called by the VM when reading a file through the
761 pagecache when the underlying blocksize != pagesize. If the required
762 block is up to date then the read can complete without needing the IO
763 to bring the whole page up to date.
765 is_dirty_writeback: Called by the VM when attempting to reclaim a page.
771 allows a filesystem to indicate to the VM if a page should be
772 treated as dirty or writeback for the purposes of stalling.
780 space if necessary and pin the block lookup information in
799 This describes how the VFS can manipulate an open file. As of kernel
800 3.12, the following members are defined:
835 llseek: called when the VFS needs to move the file position index
845 iterate: called when the VFS needs to read the directory contents
847 poll: called by the VFS when a process wants to check if there is
849 is activity. Called by the select(2) and poll(2) system calls
851 unlocked_ioctl: called by the ioctl(2) system call.
853 compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
856 mmap: called by the mmap(2) system call
858 open: called by the VFS when an inode should be opened. When the VFS
859 opens a file, it creates a new "struct file". It then calls the
860 open method for the newly allocated file structure. You might
861 think that the open method really belongs in
863 done the way it is because it makes filesystems simpler to
864 implement. The open() method is a good place to initialize the
865 "private_data" member in the file structure if you want to point
868 flush: called by the close(2) system call to flush a file
870 release: called when the last reference to an open file is closed
872 fsync: called by the fsync(2) system call
874 fasync: called by the fcntl(2) system call when asynchronous
877 lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
880 get_unmapped_area: called by the mmap(2) system call
882 check_flags: called by the fcntl(2) system call for F_SETFL command
884 flock: called by the flock(2) system call
886 splice_write: called by the VFS to splice data from a pipe to a file. This
887 method is used by the splice(2) system call
889 splice_read: called by the VFS to splice data from file to a pipe. This
890 method is used by the splice(2) system call
892 setlease: called by the VFS to set or release a file lock lease. setlease
894 the lease in the inode after setting it.
896 fallocate: called by the VFS to preallocate blocks or punch a hole.
898 Note that the file operations are implemented by the specific
899 filesystem in which the inode resides. When opening a device node
901 support routines in the VFS which will locate the required device
902 driver information. These support routines replace the filesystem file
903 operations with those for the device driver, and then proceed to call
904 the new open() method for the file. This is how opening a device file
905 in the filesystem eventually ends up calling the device driver open()
916 This describes how a filesystem can overload the standard dentry
917 operations. Dentries and the dcache are the domain of the VFS and the
920 the VFS uses a default. As of kernel 2.6.22, the following members are
937 d_revalidate: called when the VFS needs to revalidate a dentry. This
938 is called whenever a name look-up finds a dentry in the
940 dentries in the dcache are valid. Network filesystems are different
941 since things can change on the server without the client necessarily
944 This function should return a positive value if the dentry is still
948 If in rcu-walk mode, the filesystem must revalidate the dentry without
949 blocking or storing to the dentry, d_parent and d_inode should not be
956 d_weak_revalidate: called when the VFS needs to revalidate a "jumped" dentry.
958 doing a lookup in the parent directory. This includes "/", "." and "..",
961 In this case, we are less concerned with whether the dentry is still
962 fully correct, but rather that the inode is still valid. As with
966 This function has the same return code semantics as d_revalidate.
970 d_hash: called when the VFS adds a dentry to the hash table. The first
971 dentry passed to d_hash is the parent directory that the name is
978 dentry is the parent of the dentry to be compared, the second is
979 the child dentry. len and name string are properties of the dentry
980 to be compared. qstr is the name to compare it with.
983 possible, and should not or store into the dentry.
984 Should not dereference pointers outside the dentry without
987 However, our vfsmount is pinned, and RCU held, so the dentries and
994 d_delete: called when the last reference to a dentry is dropped and the
996 immediately, or 0 to cache the dentry. Default is NULL which means to
1003 being deallocated). The default when this is NULL is that the
1007 d_dname: called when the pathname of a dentry should be generated.
1010 it's done only when the path is needed.). Real filesystems probably
1013 held, d_dname() should not try to modify the dentry itself, unless
1016 at the end of the buffer, and returns a pointer to the first char.
1020 This should create a new VFS mount record and return the record to the
1021 caller. The caller is supplied with a path parameter giving the
1022 automount directory to describe the automount target and the parent
1024 be returned if someone else managed to make the automount first. If
1025 the vfsmount creation failed, then an error code should be returned.
1026 If -EISDIR is returned, then the directory will be treated as an
1029 If a vfsmount is returned, the caller will attempt to mount it on the
1030 mountpoint and will remove the vfsmount from its expiration list in
1031 the case of failure. The vfsmount should be returned with 2 refs on
1032 it to prevent automatic expiration - the caller will clean up the
1035 This function is only used if DCACHE_NEED_AUTOMOUNT is set on the
1036 dentry. This is set by __d_instantiate() if S_AUTOMOUNT is set on the
1039 d_manage: called to allow the filesystem to manage the transition from a
1041 waiting to explore behind a 'mountpoint' whilst letting the daemon go
1042 past and construct the subtree there. 0 should be returned to let the
1045 mounted on it and not to check the automount flag. Any other error
1048 If the 'rcu_walk' parameter is true, then the caller is doing a
1050 and the caller can be asked to leave it and call again by returning
1054 This function is only used if DCACHE_MANAGE_TRANSIT is set on the
1077 the usage count)
1079 dput: close a handle for a dentry (decrements the usage count). If
1080 the usage count drops to 0, and the dentry is still in its
1081 parent's hash, the "d_delete" method is called to check whether
1082 it should be cached. If it should not be cached, or if the dentry
1087 subsequent call to dput() will deallocate the dentry if its
1091 the dentry then the dentry is turned into a negative dentry
1092 (the d_iput() method is called). If there are other
1098 d_instantiate: add a dentry to the alias hash list for the inode and
1099 updates the "d_inode" member. The "i_count" member in the
1100 inode structure should be set/incremented. If the inode
1101 pointer is NULL, the dentry is called a "negative
1106 It looks up the child of that given name from the dcache
1107 hash table. If it is found, the reference count is incremented
1108 and the dentry is returned. The caller must use dput()
1109 to free the dentry when it finishes using it.
1117 On mount and remount the filesystem is passed a string containing a
1132 to show all the currently active options. The rules are:
1135 from the default
1140 Options used only internally between a mount helper and the kernel
1141 (such as file descriptors), or which only have an effect during the
1142 mounting (such as ones controlling the creation of a journal) are exempt
1143 from the above rules.
1145 The underlying reason for the above rules is to make sure, that a
1147 based on the information found in /proc/mounts.
1150 them is provided with the save_mount_options() and
1158 (Note some of these resources are not up-to-date with the latest kernel
1167 A tour of the Linux VFS by Michael K. Johnson. 1996
1170 A small trail through the Linux kernel by Andries Brouwer. 2001