vfs.txt - OpenGrok cross reference for /linux-4.1.27/Documentation/filesystems/vfs.txt

Lines Matching refs:the
2 	      Overview of the Linux Virtual File System
11   This file is released under the GPLv2.
17 The Virtual File System (also known as the Virtual Filesystem Switch)
18 is the software layer in the kernel that provides the filesystem
20 within the kernel which allows different filesystem implementations to
25 in the document Documentation/filesystems/Locking.
31 The VFS implements the open(2), stat(2), chmod(2), and similar system
32 calls. The pathname argument that is passed to them is used by the VFS
33 to search through the directory entry cache (also known as the dentry
39 most computers cannot fit all dentries in the RAM at the same time,
40 some bits of the cache are missing. In order to resolve your pathname
41 into a dentry, the VFS may have to resort to creating dentries along
42 the way, and then loading the inode. This is done by looking up the
51 beasts.  They live either on the disc (for block device filesystems)
52 or in the memory (for pseudo filesystems). Inodes that live on the
53 disc are copied into the memory when required and changes to the inode
57 To look up an inode requires that the VFS calls the lookup() method of
58 the parent directory inode. This method is installed by the specific
59 filesystem implementation that the inode lives in. Once the VFS has
60 the required dentry (and hence the inode), we can do all those boring
61 things like open(2) the file, or stat(2) it to peek at the inode
62 data. The stat(2) operation is fairly simple: once the VFS has the
63 dentry, it peeks at the inode data and passes some of it back to
71 structure (this is the kernel-side implementation of file
73 a pointer to the dentry and a set of file operation member functions.
74 These are taken from the inode data. The open() file method is then
75 called so the specific filesystem implementation can do its work. You
76 can see that this is another switch performed by the VFS. The file
77 structure is placed into the file descriptor table for the process.
80 is done by using the userspace file descriptor to grab the appropriate
81 file structure, and then calling the required file structure method to
82 do whatever is required. For as long as the file is open, it keeps the
83 dentry in use, which in turn means that the VFS inode is still in use.
89 To register and unregister a filesystem, use the following API
99 the VFS will call the appropriate mount() method for the specific
100 filesystem.  New vfsmount referring to the tree returned by ->mount()
101 will be attached to the mountpoint, so that when pathname resolution
102 reaches the mountpoint it will jump into the root of that vfsmount.
104 You can see all filesystems that are registered to the kernel in the
111 This describes the filesystem. As of kernel 2.6.39, the following
127   name: the name of the filesystem type, such as "ext2", "iso9660",
132   mount: the method to call when a new instance of this
135   kill_sb: the method to call when an instance of this filesystem
145 The mount() method has the following arguments:
147   struct file_system_type *fs_type: describes the filesystem, partly initialized
148   	by the specific filesystem code
152   const char *dev_name: the device name we are mounting.
157 The mount() method must return the root dentry of the tree requested by
158 caller.  An active reference to its superblock must be grabbed and the
164 contains a suitable filesystem image the method creates and initializes
168 doesn't have to create a new one.  The main result from the caller's
169 point of view is a reference to dentry at the root of (sub)tree to
172 The most interesting member of the superblock structure that the
173 mount() method fills in is the "s_op" field. This is a pointer to
174 a "struct super_operations" which describes the next level of the
177 Usually, a filesystem uses one of the generic mount() implementations
184   mount_single: mount a filesystem which shares the instance between
187 A fill_super() callback implementation has the following arguments:
189   struct super_block *sb: the superblock structure. The callback
207 This describes how the VFS can manipulate the superblock of your
208 filesystem. As of kernel 2.6.22, the following members are defined:
251   dirty_inode: this method is called by the VFS to mark an inode dirty.
253   write_inode: this method is called when the VFS needs to write an
254 	inode to disc.  The second parameter indicates whether the write
257   drop_inode: called when the last access to the inode is dropped,
258 	with the inode->i_lock spinlock held.
263 	called regardless of the value of i_nlink)
265 	The "generic_delete_inode()" behavior is equivalent to the
266 	old practice of using "force_delete" in the put_inode() case,
267 	but does not have the races that the "force_delete()" approach
270   delete_inode: called when the VFS wants to delete an inode
272   put_super: called when the VFS wishes to free the superblock
273 	(i.e. unmount). This is called with the superblock lock held
276   	a superblock. The second parameter indicates whether the method
277 	should wait until the write out has been completed. Optional.
281   	used by the Logical Volume Manager (LVM).
286   statfs: called when the VFS needs to get filesystem statistics.
288   remount_fs: called when the filesystem is remounted. This is called
289 	with the kernel lock held
291   clear_inode: called then the VFS clears the inode. Optional
293   umount_begin: called when the VFS is unmounting a filesystem.
295   show_options: called by the VFS to show mount options for
298   quota_read: called by the VFS to read from filesystem quota file.
300   quota_write: called by the VFS to write to filesystem quota file.
302   nr_cached_objects: called by the sb cache shrinking function for the
303 	filesystem to return the number of freeable cached objects it contains.
306   free_cache_objects: called by the sb cache shrinking function for the
307 	filesystem to scan the number of objects indicated to try to free them.
311 	We can't do anything with any errors that the filesystem might
312 	encountered, hence the void return type. This will never be called if
313 	the VM is trying to reclaim under GFP_NOFS conditions, hence this
317 	scanning loop that is done. This allows the VFS to determine
322 Whoever sets up the inode is responsible for filling in the "i_op" field. This
323 is a pointer to a "struct inode_operations" which describes the methods that
330 An inode object represents an object within the filesystem.
336 This describes how the VFS can manipulate an inode in your
337 filesystem. As of kernel 2.6.22, the following members are defined:
373   create: called by the open(2) and creat(2) system calls. Only
376 	dentry). Here you will probably call d_instantiate() with the
377 	dentry and the newly created inode
379   lookup: called when the VFS needs to look up an inode in a parent
380 	directory. The name to look for is found in the dentry. This
381 	method must call d_add() to insert the found inode into the
382 	dentry. The "i_count" field in the inode structure should be
383 	incremented. If the named inode does not exist a NULL inode
384 	should be inserted into the dentry (this is called a negative
388 	If you wish to overload the dentry methods then you should
389 	initialise the "d_dop" field in the dentry; this is a pointer
391 	This method is called with the directory inode semaphore held
393   link: called by the link(2) system call. Only required if you want
395 	d_instantiate() just as you would in the create() method
397   unlink: called by the unlink(2) system call. Only required if you
400   symlink: called by the symlink(2) system call. Only required if you
402 	d_instantiate() just as you would in the create() method
404   mkdir: called by the mkdir(2) system call. Only required if you want
406 	call d_instantiate() just as you would in the create() method
408   rmdir: called by the rmdir(2) system call. Only required if you want
411   mknod: called by the mknod(2) system call to create a device (char,
415 	in the create() method
417   rename: called by the rename(2) system call to rename the object to
418 	have the parent and name given by the second inode and dentry.
421 	If no flags are supported by the filesystem then this method
422 	need not be implemented.  If some flags are supported then the
424 	flags.  Currently the following flags are implemented:
425 	(1) RENAME_NOREPLACE: this flag indicates that if the target
426 	of the rename exists the rename should fail with -EEXIST
427 	instead of replacing the target.  The VFS already checks for
428 	existence, so for local filesystems the RENAME_NOREPLACE
431 	exist; this is checked by the VFS.  Unlike plain rename,
434   readlink: called by the readlink(2) system call. Only required if
437   follow_link: called by the VFS to follow a symbolic link to the
442   put_link: called by the VFS to release resources allocated by
444   	to this method as the last parameter.  It is used by
446   	(i.e. page that was installed when the symbolic link walk
447   	started might not be in the page cache at the end of the
450   permission: called by the VFS to check for access rights on a POSIX-like
454         mode, the filesystem must check the permission without blocking or
455 	storing to the inode.
460   setattr: called by the VFS to set attributes for a file. This method
463   getattr: called by the VFS to get attributes of a file. This method
466   setxattr: called by the VFS to set an extended attribute for a file.
470   getxattr: called by the VFS to retrieve the value of an extended
474   listxattr: called by the VFS to list all extended attributes for a
477   removexattr: called by the VFS to remove an extended attribute from
480   update_time: called by the VFS to update a specific time or the i_version of
481   	an inode.  If this is not defined the VFS will update the inode itself
484   atomic_open: called on the last component of an open.  Using this optional
485   	method the filesystem can look up, possibly create and open the file in
486   	one atomic operation.  If it cannot perform this (e.g. the file type
488 	usual 0 or -ve .  This method is only called if the last component is
490 	f_op->open().  If the file was created, the FILE_CREATED flag should be
491 	set in "opened".  In case of O_EXCL the method must only succeed if the
494   tmpfile: called in the end of O_TMPFILE open().  Optional, equivalent to
500 The address space object is used to group and manage pages in the page
501 cache.  It can be used to keep track of the pages in a file (or
502 anything else) and also track the mapping of sections of the file into
510 The first can be used independently to the others.  The VM can try to
512 pages in order to reuse them.  To do this it can call the ->writepage
515 references will be released without notice being given to the
519 lru_cache_add and mark_page_active needs to be called whenever the
523 maintains information about the PG_Dirty and PG_Writeback status of
527 The Dirty tag is primarily used by mpage_writepages - the default
528 ->writepages method.  It uses the tag to find dirty pages to call
529 ->writepage on.  If mpage_writepages is not used (i.e. the address
530 provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
533 writing out the whole address_space.
541 typically using the 'private' field in the 'struct page'.  If such
542 information is attached, the PG_Private flag should be set.  This will
543 cause various VM routines to make extra calls into the address_space
547 application.  Data is read into the address space a whole page at a
548 time, and provided to the application either by copying of the page,
549 or by memory-mapping the page.
550 Data is written into the address space by the application, and then
551 written-back to storage typically in whole pages, however the
556 set_page_dirty to write data into the address_space, and writepage,
559 Adding and removing pages to/from an address_space is protected by the
562 When data is written to a page, the PG_Dirty flag should be set.  It
573 This describes how the VFS can manipulate mapping of a file to page cache in
594 	/* migrate the contents of a page to the specified target */
605   writepage: called by the VM to write a dirty page to backing store.
611       and should make sure the page is unlocked, either synchronously
612       or asynchronously when the write operation completes.
616       other pages from the mapping if that is easier (e.g. due to
618       should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
621       See the file "Locking" for more details.
623   readpage: called by the VM to read a page from backing store.
625        unlocked and marked uptodate once the read completes.
626        If ->readpage discovers that it needs to unlock the page for
628        In this case, the page will be relocated, relocked and if
631   writepages: called by the VM to write out pages associated with the
633   	the writeback_control will specify a range of pages that must be
637   	instead.  This will choose pages from the address space that are
640   set_page_dirty: called by the VM to set a page dirty.
645 	If defined, it should set the PageDirty flag, and the
646         PAGECACHE_TAG_DIRTY tag in the radix tree.
648   readpages: called by the VM to read pages associated with the address_space
656 	Called by the generic buffered write code to ask the filesystem to
657 	prepare to write len bytes at the given offset in the file. The
658 	address_space should check that the write will be able to complete,
660 	housekeeping.  If the write will update parts of any basic-blocks on
662 	read already) so that the updated blocks can be written out properly.
664         The filesystem must return the locked pagecache page for the specified
665 	offset, in *pagep, for the caller to write into.
667 	It must be able to cope with short writes (where the length passed to
668 	write_begin is greater than the number of bytes copied into the page).
676         Returns 0 on success; < 0 on failure (which is the error code), in
680         be called. len is the original len passed to write_begin, and copied
681         is the amount that was able to be copied (copied == len is always true
682 	if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag).
684         The filesystem must take care of unlocking the page and releasing it
687         Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
690   bmap: called by the VFS to map a logical block offset within object to
691   	physical block number. This method is used by the FIBMAP
693   	a file, the file must have a stable mapping to a block
694   	device.  The swap system does not go through the filesystem
695   	but instead uses bmap to find out where the blocks in the file
699 	alternative to f_op->open(), the difference is that this method may open
700 	a file not necessarily originating from the same filesystem as the one
706         will be called when part or all of the page is to be removed
707 	from the address space.  This generally corresponds to either a
708 	truncation, punch hole  or a complete invalidation of the address
709 	space (in the latter case 'offset' will always be 0 and 'length'
710 	will be PAGE_CACHE_SIZE). Any private data associated with the page
712 	length is PAGE_CACHE_SIZE, then the private data should be released,
713 	because the page must be able to be completely discarded.  This may
714 	be done by calling the ->releasepage function, but in this case the
718         that the page should be freed if possible.  ->releasepage
719         should remove any private data from the page and clear the
723 	first is when the VM finds a clean page with no active users and
724         wants to make it a free page.  If ->releasepage succeeds, the
725         page will be removed from the address_space and become free.
729         through the fadvice(POSIX_FADV_DONTNEED) system call or by the
731         they believe the cache may be out of date with storage) by
733 	If the filesystem makes such a call, and needs to be certain
735         need to ensure this.  Possibly it can clear the PageUptodate
738   freepage: freepage is called once the page is no longer visible in
739         the page cache in order to allow the cleanup of any private
740 	data. Since it may be called by the memory reclaimer, it
741 	should not assume that the original address_space mapping still
744   direct_IO: called by the generic read/write routines to perform
745         direct_IO - that is IO requests which bypass the page cache
746         and transfer data directly between the storage and the
749   migrate_page:  This is used to compact the physical memory usage.
750         If the VM wants to relocate a page (maybe off a memory card
754         that it has to the page.
756   launder_page: Called before freeing a page - it writes back the dirty page. To
757   	prevent redirtying the page, it is kept locked during the whole
760   is_partially_uptodate: Called by the VM when reading a file through the
761 	pagecache when the underlying blocksize != pagesize. If the required
762 	block is up to date then the read can complete without needing the IO
763 	to bring the whole page up to date.
765   is_dirty_writeback: Called by the VM when attempting to reclaim a page.
771 	allows a filesystem to indicate to the VM if a page should be
772 	treated as dirty or writeback for the purposes of stalling.
780 	space if necessary and pin the block lookup information in
799 This describes how the VFS can manipulate an open file. As of kernel
800 3.12, the following members are defined:
835   llseek: called when the VFS needs to move the file position index
845   iterate: called when the VFS needs to read the directory contents
847   poll: called by the VFS when a process wants to check if there is
849 	is activity. Called by the select(2) and poll(2) system calls
851   unlocked_ioctl: called by the ioctl(2) system call.
853   compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
856   mmap: called by the mmap(2) system call
858   open: called by the VFS when an inode should be opened. When the VFS
859 	opens a file, it creates a new "struct file". It then calls the
860 	open method for the newly allocated file structure. You might
861 	think that the open method really belongs in
863 	done the way it is because it makes filesystems simpler to
864 	implement. The open() method is a good place to initialize the
865 	"private_data" member in the file structure if you want to point
868   flush: called by the close(2) system call to flush a file
870   release: called when the last reference to an open file is closed
872   fsync: called by the fsync(2) system call
874   fasync: called by the fcntl(2) system call when asynchronous
877   lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
880   get_unmapped_area: called by the mmap(2) system call
882   check_flags: called by the fcntl(2) system call for F_SETFL command
884   flock: called by the flock(2) system call
886   splice_write: called by the VFS to splice data from a pipe to a file. This
887 		method is used by the splice(2) system call
889   splice_read: called by the VFS to splice data from file to a pipe. This
890 	       method is used by the splice(2) system call
892   setlease: called by the VFS to set or release a file lock lease. setlease
894 	    the lease in the inode after setting it.
896   fallocate: called by the VFS to preallocate blocks or punch a hole.
898 Note that the file operations are implemented by the specific
899 filesystem in which the inode resides. When opening a device node
901 support routines in the VFS which will locate the required device
902 driver information. These support routines replace the filesystem file
903 operations with those for the device driver, and then proceed to call
904 the new open() method for the file. This is how opening a device file
905 in the filesystem eventually ends up calling the device driver open()
916 This describes how a filesystem can overload the standard dentry
917 operations. Dentries and the dcache are the domain of the VFS and the
920 the VFS uses a default. As of kernel 2.6.22, the following members are
937   d_revalidate: called when the VFS needs to revalidate a dentry. This
938 	is called whenever a name look-up finds a dentry in the
940 	dentries in the dcache are valid. Network filesystems are different
941 	since things can change on the server without the client necessarily
944 	This function should return a positive value if the dentry is still
948 	If in rcu-walk mode, the filesystem must revalidate the dentry without
949 	blocking or storing to the dentry, d_parent and d_inode should not be
956  d_weak_revalidate: called when the VFS needs to revalidate a "jumped" dentry.
958 	doing a lookup in the parent directory. This includes "/", "." and "..",
961 	In this case, we are less concerned with whether the dentry is still
962 	fully correct, but rather that the inode is still valid. As with
966 	This function has the same return code semantics as d_revalidate.
970   d_hash: called when the VFS adds a dentry to the hash table. The first
971 	dentry passed to d_hash is the parent directory that the name is
978 	dentry is the parent of the dentry to be compared, the second is
979 	the child dentry. len and name string are properties of the dentry
980 	to be compared. qstr is the name to compare it with.
983 	possible, and should not or store into the dentry.
984 	Should not dereference pointers outside the dentry without
987 	However, our vfsmount is pinned, and RCU held, so the dentries and
994   d_delete: called when the last reference to a dentry is dropped and the
996 	immediately, or 0 to cache the dentry. Default is NULL which means to
1003 	being deallocated). The default when this is NULL is that the
1007   d_dname: called when the pathname of a dentry should be generated.
1010 	it's done only when the path is needed.). Real filesystems probably
1013 	held, d_dname() should not try to modify the dentry itself, unless
1016 	at the end of the buffer, and returns a pointer to the first char.
1020 	This should create a new VFS mount record and return the record to the
1021 	caller.  The caller is supplied with a path parameter giving the
1022 	automount directory to describe the automount target and the parent
1024 	be returned if someone else managed to make the automount first.  If
1025 	the vfsmount creation failed, then an error code should be returned.
1026 	If -EISDIR is returned, then the directory will be treated as an
1029 	If a vfsmount is returned, the caller will attempt to mount it on the
1030 	mountpoint and will remove the vfsmount from its expiration list in
1031 	the case of failure.  The vfsmount should be returned with 2 refs on
1032 	it to prevent automatic expiration - the caller will clean up the
1035 	This function is only used if DCACHE_NEED_AUTOMOUNT is set on the
1036 	dentry.  This is set by __d_instantiate() if S_AUTOMOUNT is set on the
1039   d_manage: called to allow the filesystem to manage the transition from a
1041 	waiting to explore behind a 'mountpoint' whilst letting the daemon go
1042 	past and construct the subtree there.  0 should be returned to let the
1045 	mounted on it and not to check the automount flag.  Any other error
1048 	If the 'rcu_walk' parameter is true, then the caller is doing a
1050 	and the caller can be asked to leave it and call again by returning
1054 	This function is only used if DCACHE_MANAGE_TRANSIT is set on the
1077 	the usage count)
1079   dput: close a handle for a dentry (decrements the usage count). If
1080 	the usage count drops to 0, and the dentry is still in its
1081 	parent's hash, the "d_delete" method is called to check whether
1082 	it should be cached. If it should not be cached, or if the dentry
1087 	subsequent call to dput() will deallocate the dentry if its
1091 	the dentry then the dentry is turned into a negative dentry
1092 	(the d_iput() method is called). If there are other
1098   d_instantiate: add a dentry to the alias hash list for the inode and
1099 	updates the "d_inode" member. The "i_count" member in the
1100 	inode structure should be set/incremented. If the inode
1101 	pointer is NULL, the dentry is called a "negative
1106 	It looks up the child of that given name from the dcache
1107 	hash table. If it is found, the reference count is incremented
1108 	and the dentry is returned. The caller must use dput()
1109 	to free the dentry when it finishes using it.
1117 On mount and remount the filesystem is passed a string containing a
1132 to show all the currently active options.  The rules are:
1135     from the default
1140 Options used only internally between a mount helper and the kernel
1141 (such as file descriptors), or which only have an effect during the
1142 mounting (such as ones controlling the creation of a journal) are exempt
1143 from the above rules.
1145 The underlying reason for the above rules is to make sure, that a
1147 based on the information found in /proc/mounts.
1150 them is provided with the save_mount_options() and
1158 (Note some of these resources are not up-to-date with the latest kernel
1167 A tour of the Linux VFS by Michael K. Johnson. 1996
1170 A small trail through the Linux kernel by Andries Brouwer. 2001