Handling Shared Feature Matrices Across Multiple Graphs: Alternatives to Custom DataLoader Implementation #9604

vivekvjnk · 2024-08-18T16:25:02Z

vivekvjnk
Aug 18, 2024

I am working on a project involving graph generation using the EDGE model, where my dataset consists of multiple graphs that share a single feature matrix. To facilitate training, I have implemented a custom DataLoader that inherits from the PyG DataLoader, with a specialised collate function to avoid redundant copies of the shared feature matrix during batching.

Specifically, my custom implementation ensures that the feature matrix is not duplicated across the different graphs in the batch, which would otherwise result in an unnecessary increase in memory usage and computational overhead.

Here’s a brief overview of what I've done:

Custom Collator Class: I created a custom collator class that handles the feature matrix differently during batching.
Modified DataLoader: I replaced the default PyG DataLoader with a custom implementation that utilises the new Collator class.

Objective:
Goal is to achieve efficient memory usage and avoid redundancy when working with a shared feature matrix across multiple graphs.

Questions:

Alternative Approaches: Are there any existing methods or best practices within PyG that could achieve the same outcome—efficiently batching multiple graphs with a shared feature matrix—without the need for a custom DataLoader?
Performance Considerations: If a custom DataLoader is indeed the best approach, are there any recommended optimisations or potential pitfalls to be aware of when implementing such a solution?
Future Integrations: As I plan to integrate this setup with a VAE-based model for prompt-based graph generation, is there anything specific I should consider at this stage to ensure smooth integration later on?

I'm new to graph generation ( GNNs in general ). Any suggestions on the approach would be greatly appreciated!

Context:

PyTorch Geometric version : 2.5.3
The dataset contains graphs with a shared feature matrix, which needs to be managed efficiently during batching.

Expected outcome

# Shared feature matrix
x_s = torch.randn(5, 16)  # 5 nodes with 16 features
#  Adjacency matrices of each individual graphs 
edge_index = torch.tensor([
    [2, 3, 1, 0],
    [1, 2, 3, 4],
])
edge_index_one = torch.tensor([
    [0, 4, 0],
    [1, 1, 3],
])
edge_index_two = torch.tensor([
    [0, 1, 2, 4, 0],
    [1, 1, 3, 0, 3],
])
....
# Here each individual element of the list is of type Data, prepared from same feature matrix and corresponding adjacency
# matrices
shared_fm_data_list = [shared_fm_data,shared_fm_data_one,shared_fm_data_two]

shared_fm_loader = CustomDataLoader(shared_fm_data_list,batch_size=2)

# Following is the output when iterated through shared_fm_loader
DataBatch(x=[5, 16], edge_index=[2, 7], batch=[7])
tensor([[2, 3, 1, 0, 0, 4, 0],
             [1, 2, 3, 4, 1, 1, 3]])
DataBatch(x=[5, 16], edge_index=[2, 5], batch=[5])
tensor([[0, 1, 2, 4, 0],
             [1, 1, 3, 0, 3]])

Thank you in advance for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Shared Feature Matrices Across Multiple Graphs: Alternatives to Custom DataLoader Implementation #9604

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Handling Shared Feature Matrices Across Multiple Graphs: Alternatives to Custom DataLoader Implementation #9604

vivekvjnk Aug 18, 2024

Replies: 0 comments

vivekvjnk
Aug 18, 2024