I was looking for a way to estimate the disk space in use by a file, i.e. a Python version of the du command. Naively, I thought that shutil.disk_usage would give me that, since its docstring says: Return disk usage statistics about the given path. But it turns out that it is instead a Python equivalent of the df command: it shows information about the file system on which the path resides. So, I have two questions:
Is there a Python equivalence of the du command? I am happy now with calling du as an external, but a Python variant might be useful for repeated use.
Can the docstring of shutil.disk_usage be improved to make clear that it only shows results about the whole file system, not about the individual path?
In a pinch, you could try rounding st_size up to the next multiple of the filesystem block size. Of course, that won’t be correct for sparse files and the like.
Taking a cursory look at how du is actually implemented, it essentially just computes st_blocks * 512 when st_blocks is available. You can read through the full details below and try to match the logic for your system.
Thanks for the suggestions. I may take a dip into it when I have more time. For now, I will stay with running du in a subprocess, with the following function:
def disk_usage(paths):
cmd = ['du', '-B1', '-s', '-D'] + [os.path.expanduser(p) for p in paths]
P = subprocess.run(cmd, capture_output=True, encoding='utf-8')
out = {}
for line in P.stdout.split('\n'):
s = line.split('\t')
if len(s) == 2:
out[s[1]] = int(s[0])
return out
But don’t (physical on-disk) disk block sizes vary over time as drives get bigger? I remember when a HDD block size was 32K, now it might be 512K or more.
How would one find the block size for an individual drive so this routine would work for 30 years?
Are block sizes managed differently for HDD and SDD?
To be clear, st_blocks isn’t measured in units of filesystem blocksizes or in units of st_blksize (all three are uncorrelated).
The units of st_blocks aren’t POSIX-standardized, but it’s often blocks of 512-bytes. Often enough that the Python docs simply document it as “Number of 512-byte blocks allocated for file.”
The actual filesystem blocksize is found in statvfs.f_bsize, which has the Python analog os.statvfs("...").f_bsize.
st_blksize - A file system-specific preferred I/O block size for this object. In some file system types, this may vary from file to file.
st_blocks - Number of blocks allocated for this object.
[…]
The unit for the st_blocks member of the stat structure is not defined within IEEE Std 1003.1-2001. In some implementations it is 512 bytes. It may differ on a file system basis. There is no correlation between values of the st_blocks and st_blksize, and the f_bsize (from <sys/statvfs.h>) structure members.
The value that st_blocks reports has no direct relation to the underlying filesytem block size. Per the coreutils source code linked in my post above, du assumes that st_blocks is in units of 512 bytes unless you redefine it to some other value with a macro at compile time.