dirdat - scans a directory tree into a relational database

First of all, this doesn't load into any specific database, it's generic. It makes a pair of tab delimited files you have to load yourself. I've used it with PostgreSQL and Microsoft Access. With PostgreSQL use the copy command, with Access use the text import wizard, and make yourself a template if you plan to do it often.

I wrote this sometime around 1998, and there's a Borland Delphi version for using under Windows, I think the c/unix version came first but I'm not sure anymore. Originally it was written under FreeBSD then I adapted it to OpenBSD where it's been since about 2000.

Why would you want to do this? I originally wrote it for keeping track of files burned to CDs back in the days of my first CD burner. The Windows version I've used in a production environment as a network administrator for a variety of things. Once you get the file data into a database you can write queries for almost anything, like pointing fingers at who's got the most MP3 files in their personal directory or whether everything copied over to the new RAID stack OK. What percentage of space on the server is being used for avi or mp3 files? You can query that, feed it into Crystal Reports if you really want to get fancy for a transparency at some meeting. Pie charts of space used by file type, endless possibilities. I've written major batch files for rearranging files from these databases.

This can make CRC32 signatures of files, interchangeable with the ones Pkzip makes. More recently we have md5, then sha, sha256, and so on but this serves the purpose, and they're only 8 bytes each. They won't uniquely identify files, but the combination of CRC32 and filesize is pretty good. The algorithm runs out of steam when the file size gets to about 2 gigs but for everyday use on non-video files it works pretty well. The goal is to know whether the file has been changed (due to user or hardware errors), this isn't some cryptographic signature. Zip (and thus jar and epub) files still use them.

Why a relational database? A relational database has "joins" between fields in one table and related fields in another table. Here every directory is numbered as it's found, the number and the name are stored. Data on each file is stored in another table, along with a directory number where the file lives.

So the table of directories might look like:

48      /usr/ppc_650/dvd1/990/2013/2013-10-21
49      /usr/ppc_650/dvd1/990/2013/2013-10-22
50      /usr/ppc_650/dvd1/990/2013/2013-10-23
51      /usr/ppc_650/dvd1/990/2013/2013-10-24

And the table of files might look like:

49      482318  dscn2633.jpg    10/22/2013 15:17:16     6EC4D52F
49      478744  dscn2634.jpg    10/22/2013 15:17:22     D6C13622
49      2895    info.txt        10/22/2013 15:17:22     386036BA
49      63788   thumbs.jpg      12/10/2013 16:06:48     D3B9D4FE
50      659750  dscn2635.jpg    10/23/2013 15:51:22     7CFF26B5
50      639990  dscn2636.jpg    10/23/2013 15:51:34     4B0B9BBC

The database would join the directory number (the first field in each table) between the two tables, if not permanently then whenever it needed to in a query. It's mostly a space saving measure, but it also abstracts the file information from the directory information. You don't need the full path to the file when there are hundreds of files in the same place, multiply this by thousands of directories and you begin to get the picture.

A program run looks like this (strictly command line, no GUI):

d530# ddat
Dir file name? dvd1_dirs.tab
File file name? dvd1_files.tab
Starting directory (. is OK): /usr/ppc_650/dvd1
Include CRC32 (y/n)?y
currently in /usr/ppc_650/dvd1
currently in /usr/ppc_650/dvd1/990
currently in /usr/ppc_650/dvd1/990/2010
currently in /usr/ppc_650/dvd1/990/2010/2010-12-25
currently in /usr/ppc_650/dvd1/990/2011
currently in /usr/ppc_650/dvd1/990/2011/2011-04-10
currently in /usr/ppc_650/dvd1/990/2011/2011-04-16
currently in /usr/ppc_650/dvd1/990/2011/2011-06-09
currently in /usr/ppc_650/dvd1/990/2011/2011-09-12
currently in /usr/ppc_650/dvd1/990/2011/2011-09-13
currently in /usr/ppc_650/dvd1/990/2011/2011-09-19
currently in /usr/ppc_650/dvd1/990/2011/2011-10-28
currently in /usr/ppc_650/dvd1/990/2011/2011-10-31
currently in /usr/ppc_650/dvd1/990/2012
currently in /usr/ppc_650/dvd1/990/2012/2012-04-05
currently in /usr/ppc_650/dvd1/990/2012/2012-08-11
currently in /usr/ppc_650/dvd1/990/2012/2012-08-25
currently in /usr/ppc_650/dvd1/990/2012/2012-09-16
currently in /usr/ppc_650/dvd1/990/2012/2012-10-18
currently in /usr/ppc_650/dvd1/990/2012/2012-11-11

If you need speed more than you need the CRC32s, enter n at that prompt. Making CRCs requires reading every byte of every file and can be slow especially on CD/DVD drives. The output files end up in the current directory with the names you specified. There's a subtle difference between reading file information on a CD/DVD and reading on a hard drive, this should work with either. You can make catalog files before you burn to CD/DVD or after. You can catalog files on commercially burned CD/DVDs, useful for finding something like dll or font files later.

If you try to load these files into Access, you might have a problem with the fact that they have unix line ends (assuming you run it under unix). The most sure-fire method I've found for fixing that is to FTP them from unix to Windows as ASCII (text) files, not binary. There are dedicated programs, mostly free, that run on both platforms for converting, but you probably already have an FTP client and you'll need one anyway. And don't be afraid to look at the files in Notetab, Notepad, or similar: they're just text. You can even open them, select all and copy, then paste into Excel. The tabs will make the data drop right into columns in Excel, then you can save as an Excel file. It wouldn't be that hard to have it make csv files, I just prefer tabbed.

This will bomb on circular links/symlinks like adir -> . because it tries to go into adir, finds another adir so it recurses until the stack overflows and it dumps core. I have used it on a 500 gig drive with 1,875,152 files in 140,121 directories.

The tarball

Alan Corey, ab1jx 4/4/2014

AB1JX / calcs