FileSaver - v1.1.1

Fast free and open-source disk usage scanner

Pedro Tacla Yamada
8 Jun 20205 min read


FileSaver Logo
FileSaver logo

Since I got my latest MacBook, I can only go for a couple of months before it starts running out of free disk space. It’s a pretty bad situation where to save a bit on SSD size, I have to spend my time cleaning-up large files like music samples, VMs, node_modules and so on.

So then I decided to save a lot of time 😂 and instead of buying one of the available “disk cleaning” solutions, I spent some time trying some open-source software and eventually decided to write FileSaver, a fast multi-threaded disk usage scanner.

FileSaver screenshot

On my MacBook, scanning 5.3 million files takes 100 seconds, with around 51.000 files/second of throughput. The compromise is high memory usage of around 2GB.

What I used to do with coreutils

I used to do the following command combination to find sizes:

# Find file sizes
du -h ./directory | \
    # Sort by size
    # (just sort on linux, requires gnu coreutils on OSX)
    gsort -h      | \
    # Put on a pager
    less

This works fine except that:

  • It’s slow
  • It’s not interactive

duc

I’ve also tried a great alternative to the above, called duc.

duc is a great open-source utility written in C that performs this task a bit faster and with interactive options. It can also generate an index that can be re-used, so we don’t need to scan all files when looking for places to clean.

Even so, what I found was that on my filled-up disk, it was really slow to generate the index for the whole thing, including system directories like /Library and /usr/local. As a developer, a lot of the trash was around those.

Parallel size scanning

I profiled duc and found that it’s single threaded nature was a bit bottleneck. Writes to the index blocked scanning file sizes and the calculations to aggregate sizes also probably compromised throughput.

I ran over this issue:

Some thoughts about indexing, filtering, parallelization, etc https://github.com/zevv/duc/issues/205

The idea that I could write a multi-threaded program and get a massive performance boost stuck with me.

I considered contributing to duc, but it had two major flaws for me:

  • It’s GPL licensed
  • It’s written in C

The 2nd point is just because, knowing a bit of C++ I didn’t see the point of going to C.

I then designed and iterated on a parallel scanning architecture / aggregation algorithm that achieves pretty good results in terms of throughput.

Problem domain and aggregation overview

The problem we have to solve is that of finding directory sizes based on all their children’s sizes.

Given a certain path to scan, we want to lstat it. This will give us its size and file type. If this path is a directory, we then want to scan all its children.

This path will have “completed” once all its children have completed or if lstat says it isn’t a directory.

So we have to continuously scan children and as they come in, update the parent path’s completed children count (and do this recursively up).

To support better UX and performance, whenever a path is completed, its delta update in size is carried up each of its parents.

For example, imagine we have the following directory structure:

content
└── blog
    └── filesaver-freeing-up-space
        ├── index.md
        └── logo.png
  1. We’ll start the scan at /content

    • Since it’s a directory we read its children
    • Its size and 1 child count is stored
  2. We do the same at /content/blog and then /content/blog/filesaver-freeing-up-space
  3. At this point our state looks like:

    • /content - 1 pending child path
    • /content/blog - 1 pending child path
    • /content/blog/filesaver-freeing-up-space - 2 pending child paths
  4. We scan index.md

    • /content/blog/filesaver-freeing-up-space/index.md - is completed + 100KB
    • /content/blog/filesaver-freeing-up-space - 1 pending child paths (+ 100KB)
    • Other paths’ pending counts are kept, but their total sizes are updated
    • /content/blog - 1 pending child paths (+ 100KB)
    • /content - 1 pending child paths (+ 100KB)
  5. We scan logo.png

    • /content/blog/filesaver-freeing-up-space/logo.png - is completed + 300KB
    • /content/blog/filesaver-freeing-up-space - is completed (+ 300KB)
    • /content/blog - is completed (+ 300KB)
    • /content - is completed (+ 300KB)

Architecture

FileSaver is a simple program, but it’s structured. The architecture is as follows.

FileSaver architecture diagram

Components

  1. Lock based queue
  2. Worker Threads
  3. Reader Thread
  4. UI Thread

Queue

We pass data between threads using an uninteresting lock based queue, plus one extra lock on the size/pending state, since the UI thread will read it while its being processed.

For this we have filesaver::data::WorkQueue

Worker Threads

The workers work with two queues:

  • workQueue provides incoming paths to scan
  • resultQueue has entries pushed onto it, with their sizes, types and child paths attached

The worker will enqueue any children it finds.

Reader thread

The reader thread performs the aggregation as described in the aggregation overview. The state gets stored in a giant hash-map (std::unordered_map) mapping paths to their sizes and number of pending children.

UI

The UI uses AppKit and is written in Objective-C++. It polls the state for the directories that are visible on the screen.

Distribution

For now, FileSaver is available from its GitHub page. The project has structure that should make it convenient to include it as a library onto a different codebase. Though there could still be improvements for this use-case like organizing public headers and the build system, it goes most of the way and should have good DoxyGen docs.

The code is distributed under the MIT license.

Download FileSaver

Copyright (c) 2020 Beijaflor Software