I frequently need to copy very large files (20GB+). The most used Python function to do copies is shutil.copyfile which has a default buffer size of 16384 bytes. This buffer size is a good setting for many small files. However, when dealing with much larger files a larger buffer makes a big difference.
Increasing the buffer size had previously given a 2x improvement in large file copy performance under Java. My initial copy performance under Python using the standard libraries was much too slow. By changing the buffer size, I increased large file copy performance by 2x over the default buffer size. This is a striking improvement given that the change is just a simple modification to the code.
I found a 10MB buffer to offer the best performance. There wasn't any improvement beyond that.
Here is my copyFile function
def copyFile(src, dst, buffer_size=10485760, perserveFileDate=True):
'''
Copies a file to a new location. Much faster performance than Apache Commons due to use of larger buffer
@param src: Source File
@param dst: Destination File (not file path)
@param buffer_size: Buffer size to use during copy
@param perserveFileDate: Preserve the original file date
'''
# Check to make sure destination directory exists. If it doesn't create the directory
dstParent, dstFileName = os.path.split(dst)
if(not(os.path.exists(dstParent))):
os.makedirs(dstParent)
# Optimize the buffer for small files
buffer_size = min(buffer_size,os.path.getsize(src))
if(buffer_size == 0):
buffer_size = 1024
if shutil._samefile(src, dst):
raise shutil.Error("`%s` and `%s` are the same file" % (src, dst))
for fn in [src, dst]:
try:
st = os.stat(fn)
except OSError:
# File most likely does not exist
pass
else:
# XXX What about other special files? (sockets, devices...)
if shutil.stat.S_ISFIFO(st.st_mode):
raise shutil.SpecialFileError("`%s` is a named pipe" % fn)
with open(src, 'rb') as fsrc:
with open(dst, 'wb') as fdst:
shutil.copyfileobj(fsrc, fdst, buffer_size)
if(perserveFileDate):
shutil.copystat(src, dst)
Just to say thanks for that older piece of code -- cuts down processing time by 50% for me and works without modifications! Awesome!
Posted by: Tobi | 04/28/2013 at 06:48 PM
Thank you for the comment. Enjoy!!!
Posted by: Michael | 04/28/2013 at 09:04 PM