Closed
Description
Protobuf has a size limit of 2GB per message.
A single index for Chromium is about 6GB, triggering an error in scip-clang.
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/message_lite.cc:402] scip.Index exceeded maximum protobuf size of 2GB: 6138839817
The indexer needs to shard the data ahead-of-time and emit that instead.
Proposed sharded index format:
- Directory containing one or more
*.shard.scip
files, each of which contains ascip.Index
. All shards must have the samemetadata
. - The default name for the outer directory will be
index.scip
.
src-cli can be pointed to index.scip
with -file
(maybe we should rename this flag?). It will be responsible for compressing/tarring and uploading the index.
The behavior should be as-if:
- The
metadata
field of ascip.Index
was populated by themetadata
field in some shard. - Any other fields of
scip.Index
(currentlydocuments
andexternal_symbols
) in*.shard.scip
files in the archive are processed in lexicographic ordering based on shard file names.
We should add documentation about this format in the README or in the scip.proto
file.
This feature requires changes in:
- The backend to accept the new sharded format. https://github.com/sourcegraph/sourcegraph/pull/51132
- The lib/codeintel/upload package in the Sourcegraph monorepo, to create the tar archive. https://github.com/sourcegraph/sourcegraph/pull/51134
- src-cli, to use the newer lib/codeintel/upload package and pass the right content type header
- scip-clang: to emit large indexes in the sharded format
- This repo: docs update.
Maybe we should also mention this feature addition in various CHANGELOGs.