Skip to content

Commit 02a3754

Browse files
committed
[Windows] Avoid using FileIndex for unique IDs
The FileIndex values returned from GetFileInformationByHandle are considered stable and uniquely identifying a file, as long as the handle is open. When handles are closed, there are no guarantees for their stability or uniqueness. On some file systems (such as NTFS), the indices are documented to be stable even across handles. But with some file systems, in particular network mounts, file indices can be reused very soon after handles are closed. When such file indices are used for LLVM's UniqueID, files are considered duplicates as soon as the filesystem driver happens to have used the same file index for the handle used to inspect the file. This caused widespread, non-obvious (seemingly random) breakage. This can happen e.g. if running on a directory that is shared via Remote Desktop or VirtualBox. To avoid the issue, use a hash of the canonicalized path for the file as unique identifier, instead of using FileIndex. This fixes #61401 and #22079. Performance wise, this adds (usually) one extra call to GetFinalPathNameByHandleW for each call to getStatus(). A test cases such as running clang-scan-deps becomes around 1% slower by this, which is considered tolerable. Change the equivalent() function to use getUniqueID instead of checking individual file_status fields. The equivalent(Twine,Twine,bool& result) function calls status() on each path successively, without keeping the file handles open, which also is prone to such false positives. This also gets rid of checks of other superfluous fields in the equivalent(file_status, file_status) function - the unique ID of a file should be enough (that is what is done for Unix anyway). This comes with one known caveat: For hardlinks, each name for the file now gets a different UniqueID, and equivalent() considers them different. While that's not ideal, occasional false negatives for equivalent() is usually that fatal (the cases where we strictly do need to deduplicate files with different path names are quite rare) compared to the issues caused by false positives for equivalent() (where we'd deduplicate and omit totally distinct files). The FileIndex is documented to be stable on NTFS though, so ideally we could maybe have used it in the majority of cases. That would require a heuristic for whether we can rely on FileIndex or not. We considered using the existing function is_local_internal for that; however that caused an unacceptable performance regression (clang-scan-deps became 38% slower in one test, even more than that in another test). Differential Revision: https://reviews.llvm.org/D155579
1 parent 0bb49af commit 02a3754

File tree

4 files changed

+51
-21
lines changed

4 files changed

+51
-21
lines changed

llvm/docs/ReleaseNotes.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,14 @@ Changes to the WebAssembly Backend
117117
Changes to the Windows Target
118118
-----------------------------
119119

120+
* The LLVM filesystem class ``UniqueID`` and function ``equivalent()``
121+
no longer determine that distinct different path names for the same
122+
hard linked file actually are equal. This is an intentional tradeoff in a
123+
bug fix, where the bug used to cause distinct files to be considered
124+
equivalent on some file systems. This change fixed the issues
125+
https://github.com/llvm/llvm-project/issues/61401 and
126+
https://github.com/llvm/llvm-project/issues/22079.
127+
120128
Changes to the X86 Backend
121129
--------------------------
122130

llvm/include/llvm/Support/FileSystem.h

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -233,8 +233,7 @@ class file_status : public basic_file_status {
233233
#elif defined (_WIN32)
234234
uint32_t NumLinks = 0;
235235
uint32_t VolumeSerialNumber = 0;
236-
uint32_t FileIndexHigh = 0;
237-
uint32_t FileIndexLow = 0;
236+
uint64_t PathHash = 0;
238237
#endif
239238

240239
public:
@@ -255,13 +254,12 @@ class file_status : public basic_file_status {
255254
uint32_t LastAccessTimeHigh, uint32_t LastAccessTimeLow,
256255
uint32_t LastWriteTimeHigh, uint32_t LastWriteTimeLow,
257256
uint32_t VolumeSerialNumber, uint32_t FileSizeHigh,
258-
uint32_t FileSizeLow, uint32_t FileIndexHigh,
259-
uint32_t FileIndexLow)
257+
uint32_t FileSizeLow, uint64_t PathHash)
260258
: basic_file_status(Type, Perms, LastAccessTimeHigh, LastAccessTimeLow,
261259
LastWriteTimeHigh, LastWriteTimeLow, FileSizeHigh,
262260
FileSizeLow),
263261
NumLinks(LinkCount), VolumeSerialNumber(VolumeSerialNumber),
264-
FileIndexHigh(FileIndexHigh), FileIndexLow(FileIndexLow) {}
262+
PathHash(PathHash) {}
265263
#endif
266264

267265
UniqueID getUniqueID() const;

llvm/lib/Support/Windows/Path.inc

Lines changed: 32 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -158,12 +158,7 @@ std::string getMainExecutable(const char *argv0, void *MainExecAddr) {
158158
}
159159

160160
UniqueID file_status::getUniqueID() const {
161-
// The file is uniquely identified by the volume serial number along
162-
// with the 64-bit file identifier.
163-
uint64_t FileID = (static_cast<uint64_t>(FileIndexHigh) << 32ULL) |
164-
static_cast<uint64_t>(FileIndexLow);
165-
166-
return UniqueID(VolumeSerialNumber, FileID);
161+
return UniqueID(VolumeSerialNumber, PathHash);
167162
}
168163

169164
ErrorOr<space_info> disk_space(const Twine &Path) {
@@ -362,16 +357,17 @@ std::error_code is_local(const Twine &path, bool &result) {
362357
}
363358

364359
static std::error_code realPathFromHandle(HANDLE H,
365-
SmallVectorImpl<wchar_t> &Buffer) {
360+
SmallVectorImpl<wchar_t> &Buffer,
361+
DWORD flags = VOLUME_NAME_DOS) {
366362
Buffer.resize_for_overwrite(Buffer.capacity());
367363
DWORD CountChars = ::GetFinalPathNameByHandleW(
368-
H, Buffer.begin(), Buffer.capacity(), FILE_NAME_NORMALIZED);
364+
H, Buffer.begin(), Buffer.capacity(), FILE_NAME_NORMALIZED | flags);
369365
if (CountChars && CountChars >= Buffer.capacity()) {
370366
// The buffer wasn't big enough, try again. In this case the return value
371367
// *does* indicate the size of the null terminator.
372368
Buffer.resize_for_overwrite(CountChars);
373369
CountChars = ::GetFinalPathNameByHandleW(H, Buffer.begin(), Buffer.size(),
374-
FILE_NAME_NORMALIZED);
370+
FILE_NAME_NORMALIZED | flags);
375371
}
376372
Buffer.truncate(CountChars);
377373
if (CountChars == 0)
@@ -647,12 +643,7 @@ bool can_execute(const Twine &Path) {
647643

648644
bool equivalent(file_status A, file_status B) {
649645
assert(status_known(A) && status_known(B));
650-
return A.FileIndexHigh == B.FileIndexHigh &&
651-
A.FileIndexLow == B.FileIndexLow && A.FileSizeHigh == B.FileSizeHigh &&
652-
A.FileSizeLow == B.FileSizeLow &&
653-
A.LastWriteTimeHigh == B.LastWriteTimeHigh &&
654-
A.LastWriteTimeLow == B.LastWriteTimeLow &&
655-
A.VolumeSerialNumber == B.VolumeSerialNumber;
646+
return A.getUniqueID() == B.getUniqueID();
656647
}
657648

658649
std::error_code equivalent(const Twine &A, const Twine &B, bool &result) {
@@ -698,6 +689,7 @@ static perms perms_from_attrs(DWORD Attrs) {
698689
}
699690

700691
static std::error_code getStatus(HANDLE FileHandle, file_status &Result) {
692+
SmallVector<wchar_t, MAX_PATH> ntPath;
701693
if (FileHandle == INVALID_HANDLE_VALUE)
702694
goto handle_status_error;
703695

@@ -725,13 +717,37 @@ static std::error_code getStatus(HANDLE FileHandle, file_status &Result) {
725717
if (!::GetFileInformationByHandle(FileHandle, &Info))
726718
goto handle_status_error;
727719

720+
// File indices aren't necessarily stable after closing the file handle;
721+
// instead hash a canonicalized path.
722+
//
723+
// For getting a canonical path to the file, call GetFinalPathNameByHandleW
724+
// with VOLUME_NAME_NT. We don't really care exactly what the path looks
725+
// like here, as long as it is canonical (e.g. doesn't differentiate between
726+
// whether a file was referred to with upper/lower case names originally).
727+
// The default format with VOLUME_NAME_DOS doesn't work with all file system
728+
// drivers, such as ImDisk. (See
729+
// https://github.com/rust-lang/rust/pull/86447.)
730+
uint64_t PathHash;
731+
if (std::error_code EC =
732+
realPathFromHandle(FileHandle, ntPath, VOLUME_NAME_NT)) {
733+
// If realPathFromHandle failed, fall back on the fields
734+
// nFileIndex{High,Low} instead. They're not necessarily stable on all file
735+
// systems as they're only documented as being unique/stable as long as the
736+
// file handle is open - but they're a decent fallback if we couldn't get
737+
// the canonical path.
738+
PathHash = (static_cast<uint64_t>(Info.nFileIndexHigh) << 32ULL) |
739+
static_cast<uint64_t>(Info.nFileIndexLow);
740+
} else {
741+
PathHash = hash_combine_range(ntPath.begin(), ntPath.end());
742+
}
743+
728744
Result = file_status(
729745
file_type_from_attrs(Info.dwFileAttributes),
730746
perms_from_attrs(Info.dwFileAttributes), Info.nNumberOfLinks,
731747
Info.ftLastAccessTime.dwHighDateTime, Info.ftLastAccessTime.dwLowDateTime,
732748
Info.ftLastWriteTime.dwHighDateTime, Info.ftLastWriteTime.dwLowDateTime,
733749
Info.dwVolumeSerialNumber, Info.nFileSizeHigh, Info.nFileSizeLow,
734-
Info.nFileIndexHigh, Info.nFileIndexLow);
750+
PathHash);
735751
return std::error_code();
736752

737753
handle_status_error:

llvm/unittests/Support/Path.cpp

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -708,12 +708,16 @@ TEST_F(FileSystemTest, Unique) {
708708

709709
ASSERT_NO_ERROR(fs::remove(Twine(TempPath2)));
710710

711+
#ifndef _WIN32
711712
// Two paths representing the same file on disk should still provide the
712713
// same unique id. We can test this by making a hard link.
714+
// FIXME: Our implementation of getUniqueID on Windows doesn't consider hard
715+
// links to be the same file.
713716
ASSERT_NO_ERROR(fs::create_link(Twine(TempPath), Twine(TempPath2)));
714717
fs::UniqueID D2;
715718
ASSERT_NO_ERROR(fs::getUniqueID(Twine(TempPath2), D2));
716719
ASSERT_EQ(D2, F1);
720+
#endif
717721

718722
::close(FileDescriptor);
719723

@@ -909,12 +913,16 @@ TEST_F(FileSystemTest, TempFiles) {
909913

910914
// Create a hard link to Temp1.
911915
ASSERT_NO_ERROR(fs::create_link(Twine(TempPath), Twine(TempPath2)));
916+
#ifndef _WIN32
917+
// FIXME: Our implementation of equivalent() on Windows doesn't consider hard
918+
// links to be the same file.
912919
bool equal;
913920
ASSERT_NO_ERROR(fs::equivalent(Twine(TempPath), Twine(TempPath2), equal));
914921
EXPECT_TRUE(equal);
915922
ASSERT_NO_ERROR(fs::status(Twine(TempPath), A));
916923
ASSERT_NO_ERROR(fs::status(Twine(TempPath2), B));
917924
EXPECT_TRUE(fs::equivalent(A, B));
925+
#endif
918926

919927
// Remove Temp1.
920928
::close(FileDescriptor);

0 commit comments

Comments
 (0)