One-Liner to Save the Day: Surfacing the Heaviest Directories in a Sea of Logs

Picture this: Uber’s logs were exploding, with up to 200TB of Spark-generated data on a single busy day and a monthly mountain of 5PB uncompressed logs. The pressure to retain and search this data without breaking the bank drove a two-phase compression approach that changed everything 1. In the middle of this heat, a sysadmin sought a quick, robust one-liner to identify the five largest top-level directories under /work/projects—ignoring hidden dirs, handling spaces in names, and shrugging off unreadable dirs. This story follows that hunt from problem to practical, battle-tested command-line technique that developers can reuse at the terminal.

The Problem on the Desk

When storage usage spikes, quick insight becomes the currency of action. The challenge is to find the five largest top-level directories under /work/projects while ignoring hidden directories, tolerating spaces, and surviving unreadable paths. The goal is a single, reliable one-liner that outputs " : K" per line, using only standard UNIX tools.

The One-Liner Unfolds

You can achieve this with a robust pipeline that accounts for spaces and errors, then formats the result cleanly. The following one-liner enumerates top-level dirs, computes their disk usage, sorts the results, and prints them in the required format. It tolerates spaces in directory names and suppresses unreadable entries.

Why This Pattern Works

This pattern leverages space-safe enumeration (find -print0), per-directory sizing (du -sk), robust error handling (2>/dev/null), and straightforward formatting (awk). The approach mirrors a broader lesson: when data growth tests scalability, simple, composable tools often win. For context on why data compression and storage strategies matter at scale, see foundational discussions on data compression and storage efficiency 1 9 10 .

From Discovery to Practice

The real-world takeaway goes beyond a single command. It’s a reminder that performance and cost often hinge on early, targeted instrumentation—identifying where the growth happens first, then applying layered strategies to reduce cost while preserving accessibility. In many teams, this translates to combining quick on-disk insights with smarter data retention and compression policies 1 . Real-World Case Study Uber Uber faced explosive growth in Spark/log data. Their big data platform could generate up to 200TB of logs on a busy day, totaling about 5PB of uncompressed logs per month. Retaining this data verifiably for debugging and analytics was becoming prohibitively expensive, so they pursued a scalable, searchable compression solution. Key Takeaway: Tailored, two-phase compression can provide massive storage savings while preserving searchability; pushing compression closer to data generation (on the appender) plus offline, more aggressive phase 2 compression unlocks practical long-term retention at petabyte scales.

Directory Size Pipeline

graph TD A(Start) --> B{Identify top-level dirs} B --> C[Enumerate with find -mindepth 1 -maxdepth 1 -type d -print0] C --> D[Compute sizes with du -sk per dir] D --> E[Suppress unreadable errors (2>/dev/null)] E --> F[Sort descending by size] F --> G[Output top 5 as ' : K'] G --> H(Done) Did you know? Two-phase compression, a strategy Uber explored, demonstrates how moving work closer to data generation can unlock orders of magnitude in storage savings. Key Takeaways Use find -print0 to safely enumerate directories with spaces Pipe to while read -d '' to handle arbitrary names 2>/dev/null to suppress unreadable errors; sort and head to pick top 5 awk formats output as 'dir: sizeK' exactly References 1 Reducing Logging Cost by Two Orders of Magnitude using CLP article 2 Data compression documentation 3 du (GNU core utilities manual) documentation 4 shutil.disk_usage documentation 5 os.walk documentation 6 File system documentation 7 Lossless data compression documentation 8 TLDR documentation Share This Ever wondered how to tame petabytes of logs with a single command? 🔥 Uber faced 5PB monthly uncompressed logs; the fix started with a deceptively simple one-liner.,A space-safe approach handles tricky names and unreadable dirs without breaking the pipeline.,Two-phase compression proved the secret sauce for scalable, searchable retention. Read the full story to see how a tiny shell pattern spawned a large-scale storage win. #SoftwareEngineering #SystemDesign #DataEngineering #DevOps #Logging #BigData #Compression undefined function copySnippet(btn) { const snippet = document.getElementById('shareSnippet').innerText; navigator.clipboard.writeText(snippet).then(() => { btn.innerHTML = ' '; setTimeout(() => { btn.innerHTML = ' '; }, 2000); }); }

System Flow

Did you know? Two-phase compression, a strategy Uber explored, demonstrates how moving work closer to data generation can unlock orders of magnitude in storage savings.

References

1Reducing Logging Cost by Two Orders of Magnitude using CLParticle
2Data compressiondocumentation
3du (GNU core utilities manual)documentation
4shutil.disk_usagedocumentation
5os.walkdocumentation
6File systemdocumentation
7Lossless data compressiondocumentation
8TLDRdocumentation

Wrapping Up

The one-liner is more than a line of code; it’s a pattern for disciplined data hygiene. Start by knowing where the growth hides, then apply incremental improvements that scale with the data, not against it. In practice, teams can adopt this mindset to curb storage costs and preserve fast, searchable access to critical logs tomorrow.