Improving Samba write performance on Linux
Samba performance is good in most circumstances, but modern Linux distributions have improved file systems since Samba was first developed. In particular, they have a feature that Samba does not take advantage of by default. In recent work I found that making some simple changes to Samba could significantly improve the Samba write performance on Linux from a modern Windows client (Windows 7). This document will explain how to take advantage of this to increase the speed of writes on your file server, and how to measure the changes to ensure this has the desired effect.
Linux and Windows File systems
Linux file systems, like the file systems on UNIX before it, are designed around the notion of "sparse" files. On such file systems, if you create a file, and then write one byte at position 500MB from the start of the file, the underlying file system will only allocate one single block to store the one byte that was written. Even though the size of the file on disk will be reported as 500MB+1 bytes, the actual space used on the disk will only be a single block. The block size of a file system is fixed when the file system is first created. For modern disk sizes (1TB or more), the larger the block size the better. For the two file systems I'm discussing here, ext4 and XFS the standard block sizes are 4KB for ext4, and 64KB for XFS. Both xfs and ext4 on Linux do support larger block sizes if the page size of the running kernel is larger than 4KB, but for most distributions the page size and maximum block size is 4KB.
Such a file is called a "sparse" file, and has the great advantage that disk space can be over committed. The sparse ranges of the file are simply replaced with zero bytes when read, and only committed onto disk when an application actually does a write into that range.
The Windows NTFS file system, although able to support "sparse" files, is a more traditional file system in that writing one byte at position 500MB will force the file system to immediately allocate the intermediate blocks. This can take some time, and so SMB and SMB2 network traffic uses the strategy described below to avoid request timeouts.
SMB/SMB2 write activities
When a Windows client application sends a request to write one byte at position 500MB on a newly opened (empty) file, the SMB/SMB2 client redirector has to ensure that 500MB+1 bytes are really allocated on the target system. However, sending a simple SMBwriteX/SMB2_WRITE request with an offset of 500MB could easily cause a client timeout. An assumption built into the SMB/SMB2 protocol is that the target file system behaves like a Windows server, so the NTFS driver on the server would have to allocate 500MB worth of file system blocks in order to complete this request, which may take longer than the 30 seconds usually allowed for an SMB request.
What the Windows client redirector does in this case is to send a sequence of 1 byte requests, to cover the extension needed on the open file. In the reply to an NTCreateX/SMB2_CREATE call, the SMB server returns a value called the "allocation size", which is equivalent to the file system block size on a UNIX/Linux style file system. The "allocation size" is more flexible than the underlying file system block size, as (at least for Samba) it can be specified on a per-share basis.
This allocation size is used by the client redirector to specify the space between each 1 byte write call used to pre-allocate the empty space when a file is being extended. For example, if the allocation size is set to 1MB (the default in Samba) then when extending a file by 500MB the client redirector will issue 500 intermediate 1-byte SMBwriteX/SMB2_WRITE requests before issuing the real write request (at 500MB+1 byte) to complete the application write request.
Each of these one byte writes is unlikely to time out, thus allowing SMB/SMB2 to deal with writes that extend a file to an arbitrary size (within the limits of the file system and the protocol) without having to worry about network time outs.
Making writes efficient on Linux
By default, when Samba receives these 1 byte "extension" write requests, it simply does a normal one-byte "sparse" write at the required position in the file. This is very fast, but only causes one file system block (the block "dirtied" by the one byte write) to be allocated. When the real data is finally written into the file, the blocks then have to be allocated for real on the file system. Because these blocks are not then allocated "in order" on the file system, as it were, these actual writes can be quite slow.
The most efficient way to allocate file system blocks when data is to be written into all of the file (for example, a streaming video write) is to allocate what is called an "extent" on the file system. The requested blocks are then laid out by the underlying file system (ext4 or XFS) in a very efficient way which causes the actual writes to be much faster than having to allocate them dynamically.
How does a Linux application (Samba) get access to this new extent-based allocation call ? Simple, it's built into glibc on modern Linux distributions via the posix_fallocate() call. In the new patch that has gone into Samba 3.5.7 (and also all future versions of Samba), when the smb.conf parameter:
"strict allocate = yes"
is set on a share, whenever a file is extended by an SMBwriteX/SMB2_WRITE call, call Samba calls posix_fallocate() to ensure the file extends to at least the size given as the offset in the SMBwriteX/SMB2_WRITE request, then does the actual write.
In tests done on an ext4 file system, changing to "strict allocate = yes" and using the posix_fallocate() call in this way increased the write performance of Samba by 2/3 on a NETGEAR ReadyNAS box as tested by the Intel NASPT test tool, available here:
The specific tests used to measure the performance increase were the "File Copy to NAS" and the "HD Video Record" tests.
How do I get the patch ?
This patch has been added by default into 3.5.7 and all versions of Samba subsequent to this. Back ports of this fix are available here:
For Samba 3.5.0 - 3.5.6:
For Samba 3.4.x:
For Samba 3.3.x:
For Samba 3.2.x:
Note that this must be run on a file system that supports extents, with a kernel modern enough to support the posix_fallocate() call working directly on the underlying file system (2.6.23 or later), and a glibc that supports it (glibc 2.7 should be modern enough). Also note that "strict allocate = yes" *must* be set on the exported share.
As Linux is our primary deployment platform and most Linux distributions are now using ext4, I'm proposing to change the default of the "strict allocate" smb.conf parameter to change from "no" to "yes" for the 3.6.0 release.
What if I'm using a file system that doesn't support extent allocation ?
Never fear. You can safely leave "strict allocate = yes". Samba will call posix_fallocate() and glibc has fall back code within it which uses a technique very similar to the one the Windows redirector uses to emulate the extent-based allocation. glibc first calls the Kernel fallocate system call, and if that fails with ENOSYS (not supported) glibc will call the statvfs() call to find out the block size on the underlying file system, and then write one byte per f_bsize bytes. This is as efficient as can be done without an extent based allocation system call.
Even if the glibc on the system is old enough to not have this call, Samba will detect this and has fall back code built in to manually allocate space in writes of 32K bytes to extend a file to the required size. This is the slowest fall back option however, but is equivalent to the pre-3.5.7 code in Samba, so it's still pretty fast.
How do I know this is working ?
Samba will run faster :-). A *lot* faster. Note this will be seen mostly on loads heavily dependent on write performance, such as file copies or streaming writes.
For users, set debug level 10 on smbd and look for messages of the form:
vfs_fill_sparse: sys_posix_fallocate failed with error XXXX. Falling back to slow manual allocation
(where XXXX will most commonly be 38, which corresponds to ENOSYS on Linux). This will not be printed in the case where glibc emulation via statfs() is being used to do the allocation. As this is being done inside glibc there is no way for Samba to know if the fast (system call) fallocate is being used, or the slower statvfs() code is being used - the posix_fallocate() call succeeds to Samba in both cases. If you suspect statvfs() emulation is being used, you'll need to investigate via the "strace" method for developers described below.
For developers the way to confirm this is to examine an strace output when applied to an smbd process. Use:
strace -p <smb pid> >&/tmp/log
If the "fast" fallocate method is being used, you will see fallocate() system calls being listed in the log. If glibc statvfs emulation is being used, you will see a statvfs() call followed by a series of one byte pwrite() system calls.
Finally, if the Samba fall back code is being used you will see a sequence of pwrite() system calls each writing 32k with no statvfs() call before them.
But I want more..
As the Windows client issues one byte writes to extend a file every "allocation size" bytes, we can cheat by changing the allocation size we return on a per-share basis. For example, if you're mostly writing large video files onto a share, you can change the allocation size reported to the Windows client by changing the smb.conf parameter to something like 100MB, for example :
"allocation roundup size = 104857600"
from the default 1MB size. This can gain a few percent extra performance but may cause applications that use the allocation size to behave oddly, or even fail, as Windows never uses a size this large. As always, be careful and test your workload.
However the underlying change to "strict allocate" is safe, and with the patch (or new Samba version) will safely improve the write speed of your servers.
Jeremy Allison, Samba Team. email@example.com 6th Dec. 2010.
Thanks to Justin Maggard at NETGEAR for helping with this work, and for proof-reading this paper.