I am trying to compute my graph's eigenvector centrality (and a few similar metrics). The node label and edge type I want to include consist of billions of nodes and hundreds of billions of edges.
I tried the following projection.
CALL gds.graph.project.cypher(
'myGraph',
'MATCH (n:Person) RETURN id(n) AS id',
'MATCH (n:Person)-[r]-(m:Person) RETURN id(n) AS source, id(m) AS target'
)
and I get the following error:
Failed to invoke procedure gds.graph.project.cypher: Caused by: java.lang.OutOfMemoryError: Java heap space
The configuration reads as follows, which leads to using all the available memory.
Do I correctly understand that projections are needed to run algorithms such as eigenvector centrality? Cannot such algorithms run directly on the database? e.g., streaming data or using API to query the needed info instead of loading it into memory.
Given that I cannot run on a larger machine or a cluster, I am thinking of running the projection on a sampled subset of the nodes (e.g., randomly selecting 10% of nodes). However, this is not ideal as it is not representative enough. Do you have suggestions for running GDS algorithms on such large graphs?
I run the process on a single machine with 64GB RAM, and the database is stored on an NVMe. Database version 5.26.1.
GDS requires graph projections for its algorithms, there is no option to run directly on the database. GDS also needs to project all the data that you want to run on into a single in-memory projection, there is no option to spill to disk.
There are a few things you could try, but it's unlikely that you would fit a graph with billions of nodes and relationships into 32g of JVM heap.
You're using the legacy cypher projection, which is the least optimized and no longer supported. You can try updating to the current cypher projection but in your case I recommend going for the native projection.
You can use the memory estimation to see how much memory GDS would need for your projection without doing it. You can get an understanding on how much more memory you would need how much data you need to drop to fit everything into the available memory. Note that this is an estimation which can be wrong and it's an estimation for GDS only, you still need additional memory for the database itself.
Native projection can work with a smaller page cache, if you don't need to run anything on the database other than GDS, you can try reducing the page cache to 1g or even less (depends on your measurements)
The JVM uses a technique called "compressed pointers" to be able to manage smaller heap sizes (<32g) with internal 32bit pointers instead of 64bit. The downside is that heap sizes between 32g and 48g are something of a "dead zone" where the additional memory is spent offsetting the increased pointer sizes. Try configuring 31g or something less than 32g, or more than 48g to make the best use of the memory that you have.
In all likelihood, you still won't fit everything. Sampling is probably the best you can do there. GDS has a graph sampling algorithm, but it also requires a projection first
Thanks for confirming that GDS needs a projection; that surprises me as it defeats the purpose of using a database if you need to map its data to an in-memory representation to run an algorithm.
The reason GDS uses an in-memory projection is that the database itself is written with transactional workload in mind, not analytical workload. We used to have a "run directly on the database" version of a GDS graph a long time ago (5+ years), which was very very slow, like two orders of magnitude slower than the separate projection step.
Having enough memory for GDS to project the graph is currently the only way to work with those graph sizes in GDS.
That makes sense, you expect faster performance from an algorithm that runs entirely in-memory compared to an algorithm that relies on disk-based IO.
I think providing both in-memory and on-disk options would have made more sense, in particular, because: (a) NVMe is becoming faster and cheaper so disk-based IO is not as expensive as it was on 5200rpm HDD and, (b) if I was able to fit my data in-memory, why would I bother using a database in the first place? :) My data is ~4TB when stored on disk; I don't think expecting to load the whole data into memory in order to run algorithms such as PageRank is reasonable/realistic. I can filter and down-sample to get as small data as I can fit in memory, but (a) I could have done that with my data on plain text files, and (b) such significant down-sampling (0.1% of my data) will not be representative enough.
I imagine the two options would provide:
if your data can fit in-memory: create a projection in memory and run the fast in-memory algorithms.
if your data does not fit in memory: run an algorithm that uses minimal memory and executes the algorithm by leveraging frequent queries to the database and disk-based caches. Will run slower, may take a few days to complete, but it will unblock on large-scale graphs.
Yes, there is certainly room for that, at the end, though, it's a prioritization issue. In practice, GDS is a product primarily sold to enterprises, and so far we've not had customers who weren't able to rent a big enough machine in the cloud (some even had 12TB ram) to cover the memory requirements. As a result, providing a good implementation of out-of-core GDS has never been prioritized enough, compared to the effort it takes to implement it. It's not like we've never thought about it, we just never had enough people ask about it to justify working on it.