I frequently need to walk the directory tree in my Python code. The builtin function, os.walk, is very capable of doing the walk and provides a feature rich pythonic result set.
However, it isn't very fast especially for large directory trees. Where I work we have a tree with other 1 million objects. Using os.walk takes over 4 hours to traverse the tree.
Using Windows dir.exe takes about 1 hour. There is a substational improvement in speed when using the native windows dir.exe command.
Here is my very simple code to get the file list. The function returns a set, which is better than a list since all the elements are unique. A set is searchable in O(1) time...a list in O(n)
Obviously this code is not portable and will only work on windows. PyWin32 is required for non-recursive searches that need to exclude hidden files.
def __getFileListViaDir(searchPattern, customCmdArgs="", excludeDirNames=True, excludeHidden=True):
'''
Get a list of files using the dir.exe. List is automatically created using subdirectories.
This recursive search is required in order to get the full path in the output.
If you are looking for a non-recursive search use getFileList instead
Only works on windows
@param searchPattern:
@param cmdArgs:
@param includeDirNames:
'''
fileList = []
cmdArgs = ' /b/s'
if(excludeDirNames):
cmdArgs += '/A-D'
if(excludeDirNames and excludeHidden):
cmdArgs = ' /b/s/A-D-H'
# In order to get hidden files need to get directory listing of just hidden files
# Then append non-hidden files to this list
if(not(excludeDirNames) and not(excludeHidden)):
fileList = __getFileListViaDir(searchPattern, "/AH", excludeDirNames=True, excludeHidden=False)
command = 'cmd.exe /c dir ' + searchPattern + ' ' + cmdArgs + customCmdArgs
p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
# Sometimes the results are null. For consistency set this result to 0 by default
# If null this for loop will not run
for line in p.stdout.readlines():
line = line.strip()
fileList.append(line)
# If return code is 0, save the results
if p.wait() == 0:
return fileList
def getFileList(directory, recursive=False, includeDirNames=False, excludeHidden=True):
'''
Get a list of files from a directory. List includes the full file path
Only works on Windows
@param directory: Directory to List
@param recursive: Flag whether or not to list subdirectroies
@param includeDirNames: Flag whether or not to include directory names or just file names
@return: Returns a set including full path for all files. Set is better than list here since all entries
are guaranteed to be unique. Search time is O(1) instead of O(n) for list.
'''
fileList = []
if(recursive):
fileList = __getFileListViaDir(directory, excludeDirNames = not(includeDirNames), excludeHidden = excludeHidden)
else:
files = os.listdir(directory)
for file in files:
if((includeDirNames == False) and (os.path.isdir(os.path.join(directory, file)))):
pass
else:
fileList.append(os.path.join(directory, file))
if(excludeHidden):
for file in fileList[:]:
if(not(os.path.exists(file))):
fileList.remove(file)
else:
fattrs = win32api.GetFileAttributes(file)
if(fattrs & win32con.FILE_ATTRIBUTE_HIDDEN):
fileList.remove(file)
return set(fileList)
do i put the directory name in the "searchPattern" field?
Posted by: Roland | 05/21/2013 at 07:15 AM