The Problem on the Desk
When storage usage spikes, quick insight becomes the currency of action. The challenge is to find the five largest top-level directories under /work/projects while ignoring hidden directories, tolerating spaces, and surviving unreadable paths. The goal is a single, reliable one-liner that outputs " : K" per line, using only standard UNIX tools.
The One-Liner Unfolds
You can achieve this with a robust pipeline that accounts for spaces and errors, then formats the result cleanly. The following one-liner enumerates top-level dirs, computes their disk usage, sorts the results, and prints them in the required format. It tolerates spaces in directory names and suppresses unreadable entries.
Why This Pattern Works
This pattern leverages space-safe enumeration (find -print0), per-directory sizing (du -sk), robust error handling (2>/dev/null), and straightforward formatting (awk). The approach mirrors a broader lesson: when data growth tests scalability, simple, composable tools often win. For context on why data compression and storage strategies matter at scale, see foundational discussions on data compression and storage efficiency 1 9 10 .
From Discovery to Practice
The real-world takeaway goes beyond a single command. It’s a reminder that performance and cost often hinge on early, targeted instrumentation—identifying where the growth happens first, then applying layered strategies to reduce cost while preserving accessibility. In many teams, this translates to combining quick on-disk insights with smarter data retention and compression policies 1 . Real-World Case Study Uber Uber faced explosive growth in Spark/log data. Their big data platform could generate up to 200TB of logs on a busy day, totaling about 5PB of uncompressed logs per month. Retaining this data verifiably for debugging and analytics was becoming prohibitively expensive, so they pursued a scalable, searchable compression solution. Key Takeaway: Tailored, two-phase compression can provide massive storage savings while preserving searchability; pushing compression closer to data generation (on the appender) plus offline, more aggressive phase 2 compression unlocks practical long-term retention at petabyte scales.
Directory Size Pipeline
graph TD A(Start) --> B{Identify top-level dirs} B --> C[Enumerate with find -mindepth 1 -maxdepth 1 -type d -print0] C --> D[Compute sizes with du -sk per dir] D --> E[Suppress unreadable errors (2>/dev/null)] E --> F[Sort descending by size] F --> G[Output top 5 as ' : K'] G --> H(Done) Did you know? Two-phase compression, a strategy Uber explored, demonstrates how moving work closer to data generation can unlock orders of magnitude in storage savings. Key Takeaways Use find -print0 to safely enumerate directories with spaces Pipe to while read -d '' to handle arbitrary names 2>/dev/null to suppress unreadable errors; sort and head to pick top 5 awk formats output as 'dir: sizeK' exactly References 1 Reducing Logging Cost by Two Orders of Magnitude using CLP article 2 Data compression documentation 3 du (GNU core utilities manual) documentation 4 shutil.disk_usage documentation 5 os.walk documentation 6 File system documentation 7 Lossless data compression documentation 8 TLDR documentation Share This Ever wondered how to tame petabytes of logs with a single command? 🔥 Uber faced 5PB monthly uncompressed logs; the fix started with a deceptively simple one-liner.,A space-safe approach handles tricky names and unreadable dirs without breaking the pipeline.,Two-phase compression proved the secret sauce for scalable, searchable retention. Read the full story to see how a tiny shell pattern spawned a large-scale storage win. #SoftwareEngineering #SystemDesign #DataEngineering #DevOps #Logging #BigData #Compression undefined function copySnippet(btn) { const snippet = document.getElementById('shareSnippet').innerText; navigator.clipboard.writeText(snippet).then(() => { btn.innerHTML = ' '; setTimeout(() => { btn.innerHTML = ' '; }, 2000); }); }
System Flow
Did you know? Two-phase compression, a strategy Uber explored, demonstrates how moving work closer to data generation can unlock orders of magnitude in storage savings.
References
- 1Reducing Logging Cost by Two Orders of Magnitude using CLParticle
- 2Data compressiondocumentation
- 3du (GNU core utilities manual)documentation
- 4shutil.disk_usagedocumentation
- 5os.walkdocumentation
- 6File systemdocumentation
- 7Lossless data compressiondocumentation
- 8TLDRdocumentation
Wrapping Up
The one-liner is more than a line of code; it’s a pattern for disciplined data hygiene. Start by knowing where the growth hides, then apply incremental improvements that scale with the data, not against it. In practice, teams can adopt this mindset to curb storage costs and preserve fast, searchable access to critical logs tomorrow.