Code by Kevin
   


About
Code by Kevin: Programming, code, business, and other pursuits

Your Host
Kevin Walzer, software developer.

Home

Subscribe to RSS Feed
Get a syndicated feed of my weblog.

Archives
2011
2010
2009
2008
2007
2006

Categories
Business
Software
General

        home
Sat, 20 Mar 2010

Which file search alogrithm?

I'm working on an update to NameFind, and I'm revisiting one of its central design features: the algorithm it uses to search the file system for files with specific names.

From its inception, NameFind has been designed to use the Unix "find" command-line tool. The find tool is very simple in its approach: it recursively scans or crawls the file system by directory, looking for file names that match the search parameters passed to it. It is very accurate, it's real-time (meaning there's no out-of-date information returned by the search), and it does exactly what it's designed to do: find files by name. The drawback to find is that while it's very fast for small searches (of a single directory or directory tree), it is quite slow for full file-system searches. It's not unusual for a full file-system search for a file name to take 15-20 minutes, and even to cause NameFind to appear to lock up while it's waiting for the search to complete.

User complaints about NameFind have tended to focus on its speed and responsiveness, and so I've tried to find ways to speed up the search process, or at least make it feel more responsive. This is leading me to look at other search algorithms as well. Unfortunately, though, these other algorithms have drawbacks of their own.

By far the fastest tool for searching the file system is the "locate" command. Locate works similarly to find, searching for file names that match a particular search term. It does so by scanning an index rather than recursively trawling the file system: as a result, it is much, much faster. However, the database that locate uses is updated, by default, only once a week; the obvious danger here is that its output may be outdated. It may list things that have been deleted and miss things that have been added since the last time the database was updated. As a result, while its speed is a huge plus, its accuracy is not. And while it is possible to update the locate database manually, it is a lengthy process, and doing so would detract from the user experience I'm trying to offer with NameFind.

A third search option is "mdfind," which the Mac's command-line interface to the Spotlight search system. I originally wrote NameFind as an alternative to Spotlight because I did not find Spotlight's features especially useful for searching for specific files in specific directories. mdfind does have command-line switches for searching in specific directories for specific file names, and its output is faster than "find" because its search is based on a database search, not a real-time directory crawl; unlike locate, the Spotlight database is continually updated in real time. However, mdfind does have problems with accuracy, or at least relevancy. Spotlight, and therefore mdfind, indexes based on file content as well as file name and other data; it may return hits based on the search term being contained within the file's content, instead of its name. Spotlight also does not index certain kinds of data by default (system/hidden files, application bundles, etc.). Finally, while the built-in version of Spotlight translates its raw search output into user-friendly titles and icons, the command-line mdfind tool returns raw search data (instead of an e-mail message, it returns something like "/Users/kevin/Library/Mail/POP-kevin@11.1.1.1111/Deleted Messages.mbox/Messages/718453.emlx"). Looking more closely at mdfind, I'm reminded of why I didn't use it from the start.

Looking at all of these alternatives, none of them is perfect. But, despite its speed drawbacks, the find tool is best suited for my design goal: to find files by name. NameFind is especially useful and fast if I'm trying to search a single directory or two with a lot of files in it; for shallow directory searches, it's better suited than any of the alternatives. It's not as good for a full-system search; while I can probably tune up its responsiveness a bit so it doesn't feel so sluggish, I can't speed up the actual search process itself. Nonetheless, I am going to retain "find" as the engine that powers NameFind.

[/general] permanent link