George V. Reilly

Implementing the Tree command in Rust, part 1: Walking Directories

tree tree core/src/num for Rust

I’ve been learning Rust lately. I started by reading several books, including Rust in Action, Code Like a Pro in Rust, and most of Pro­gram­ming Rust. Now, I’m starting to actually write code.

I read the Command-Line Rust book last month, which challenged readers to write our own im­ple­men­ta­tions of the tree command.

I decided to accept the challenge.

At its simplest, tree simply prints a directory tree, using some of the Unicode Box Drawing characters to show the hi­er­ar­chi­cal re­la­tion­ship, as in the image at right.

I’ve split the code into two phases, which will be covered in two blog posts.

  1. Walking the directory tree on disk to build an in-memory tree.
  2. Pretty-printing the in-memory tree.

While it’s certainly possible to print a subtree as it’s being read, separating the two phases yields code that is cleaner, simpler, and more testable.

In future, I will insert a third phase, processing, between the reading and writing phases, by a weak analogy with Extract-Transform-Load (ETL).

Walking the Directory Tree

There are three kinds of file tree node that I care about: File, Directory, and Symlink. These are the variants exposed by Rust’s FileType.

Here, name refers to the last component of a path; e.g., the gamma in alpha/beta/gamma. The file system metadata is not currently used, but will be in future.

#[derive(Debug)]
pub struct File {
    pub name: String,
    pub metadata: fs::Metadata,
}

#[derive(Debug)]
pub struct Symlink {
    pub name: String,
    pub target: String,
    pub metadata: fs::Metadata,
}

#[derive(Debug)]
pub struct Directory {
    pub name: String,
    pub entries: Vec<FileTree>,
}

File and directory paths are not guaranteed to be UTF-8. Indeed, Unix file paths are an arbitrary sequence of bytes, while Windows file paths are an opaque sequence of 16-bit integers. You might think that I should be using OsString here, since it holds a platform-native string. String has to be valid UTF-8; OsString doesn’t. Un­for­tu­nate­ly, it’s not easy to look at the actual data in an OsString, unless you convert it (possibly lossily) to a String. See File paths and OS strings for more.

The obvious way to represent a file tree node in Rust is as an enum with three tuple-like variants.

#[derive(Debug)]
pub enum FileTree {
    DirNode(Directory),
    FileNode(File),
    LinkNode(Symlink),
}

Here, each variant in the enum holds a struct of similar name. We will be able to take advantage of Rust’s pattern matching to handle each variant.

We’ll use fs::read_dir to read each directory in the hierarchy. The read_dir function returns an iterator that yields instances of io::Result<DirEntry>. If a DirEntry is a directory, we can re­cur­sive­ly invoke our dir_walk function to read the child directory and add its contents to our in-memory tree.

The walkdir crate also walks through a directory tree, but it hides the recursion from you. It’s an excellent choice otherwise.

Skipping and Sorting

In each directory that we read, we need to consider two factors.

  1. Which entries to skip, such as hidden files.
  2. How to sort the entries.

We almost always want to skip hidden files and di­rec­to­ries—on Unix, those entries whose names start with the . character. Every directory includes entries for . (itself) and .. (parent directory), and may include other hidden files or di­rec­to­ries, such as .vimrc or .git.

On Windows, hidden files are controlled by an attribute, not by their name.

For more com­pli­cat­ed usage, we might want to skip ignored files, as specified in .gitignore.

The simplest useful filter for entry names is one that rejects hidden files and di­rec­to­ries.

pub fn is_not_hidden(name: &str) -> bool {
    return !name.starts_with('.');
}

Disk I/O is costly and slow, compared to memory access. It’s far more efficient to not read a directory at all than it is to eliminate a subtree at a later stage. Even if the OS has cached the relevant directory contents, there’s still a cost to the syscall to retrieve that data from the kernel.

There is no specific order to entries in a directory or to the results returned by low-level APIs like fs::read_dir. By default, ls sorts entries al­pha­bet­i­cal­ly, but it can also sort by creation time, mod­i­fi­ca­tion time, or size, in ascending or descending order.

Unix filesys­tems are case-sensitive, but Mac filesys­tems (APFS and HFS+) are case-in­sen­si­tive by default, although they preserve the case of the original filename. Windows’ filesys­tems (NTFS, exFAT, and FAT32) are likewise case-preserving and case-in­sen­si­tive.

Here is a case-sensitive comparator for use with sort_by:

pub fn sort_by_name(a: &fs::DirEntry, b: &fs::DirEntry) -> Ordering {
    let a_name: String =
        a.path().file_name().unwrap().to_str().unwrap().into();     
    let b_name: String =
        b.path().file_name().unwrap().to_str().unwrap().into();
    a_name.cmp(&b_name)                                             
}
  1. This messy expression is necessary to get the name as a String.
  2. cmp returns Less, Equal, or Greater from the Ordering enum.

More on Ordering here.

The dir_walk function

Finally, the recursive dir_walk function that creates the tree of FileTree nodes.

pub fn dir_walk(
    root: &PathBuf,
    filter: fn(name: &str) -> bool,                 
    compare: fn(a: &fs::DirEntry, b: &fs::DirEntry) -> Ordering,
) -> io::Result<Directory> {
    let mut entries: Vec<fs::DirEntry> = fs::read_dir(root)?
        .filter_map(|result| result.ok())
        .collect();                                 
    entries.sort_by(compare);
    let mut directory: Vec<FileTree> =
        Vec::with_capacity(entries.len());          
    for e in entries {
        let path = e.path();
        let name: String = path.file_name().unwrap().to_str().unwrap().into();
        if !filter(&name) {                         
            continue;
        };
        let metadata = fs::metadata(&path).unwrap();
        let node = match path {                     
            path if path.is_dir() => {
                FileTree::DirNode(                  
                    dir_walk(&root.join(name), filter, compare)?)
            }
            path if path.is_symlink() => FileTree::LinkNode(Symlink {
                name: name.into(),
                target: fs::read_link(path).unwrap().to_string_lossy().to_string(),
                metadata: metadata,
            }),
            path if path.is_file() => FileTree::FileNode(File {
                name: name.into(),
                metadata: metadata,
            }),
            _ => unreachable!(),
        };
        directory.push(node);
    }
    let name = root
        .file_name()
        .unwrap_or(OsStr::new("."))                 
        .to_str()
        .unwrap()
        .into();
    Ok(Directory {                                  
        name: name,
        entries: directory,
    })
}
  1. Currently, the filter and compare parameters are fns. They could probably be FnMut traits.
  2. Read directory. Discard any Error results. Collect into a Vec.
  3. We’ll need at most this many entries.
  4. Use filter to discard names that won’t be visited.
  5. Match the path as a DirNode, LinkNode, or FileNode, by using match guards.
  6. Visit the sub­di­rec­to­ry re­cur­sive­ly.
  7. If root was ".", the file_name() will be None.
  8. Return a Directory for this directory, wrapped in an io::Result.

In Part 2, we’ll print the directory tree.

blog comments powered by Disqus
fsymbols for Unicode weirdness » « Implementing the Tree command in Rust, part 2: Printing Trees