You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on a project involving graph generation using the EDGE model, where my dataset consists of multiple graphs that share a single feature matrix. To facilitate training, I have implemented a custom DataLoader that inherits from the PyG DataLoader, with a specialised collate function to avoid redundant copies of the shared feature matrix during batching.
Specifically, my custom implementation ensures that the feature matrix is not duplicated across the different graphs in the batch, which would otherwise result in an unnecessary increase in memory usage and computational overhead.
Here’s a brief overview of what I've done:
Custom Collator Class: I created a custom collator class that handles the feature matrix differently during batching.
Modified DataLoader: I replaced the default PyG DataLoader with a custom implementation that utilises the new Collator class.
Objective:
Goal is to achieve efficient memory usage and avoid redundancy when working with a shared feature matrix across multiple graphs.
Questions:
Alternative Approaches: Are there any existing methods or best practices within PyG that could achieve the same outcome—efficiently batching multiple graphs with a shared feature matrix—without the need for a custom DataLoader?
Performance Considerations: If a custom DataLoader is indeed the best approach, are there any recommended optimisations or potential pitfalls to be aware of when implementing such a solution?
Future Integrations: As I plan to integrate this setup with a VAE-based model for prompt-based graph generation, is there anything specific I should consider at this stage to ensure smooth integration later on?
I'm new to graph generation ( GNNs in general ). Any suggestions on the approach would be greatly appreciated!
Context:
PyTorch Geometric version : 2.5.3
The dataset contains graphs with a shared feature matrix, which needs to be managed efficiently during batching.
Expected outcome
# Shared feature matrix
x_s = torch.randn(5, 16) # 5 nodes with 16 features
# Adjacency matrices of each individual graphs
edge_index = torch.tensor([
[2, 3, 1, 0],
[1, 2, 3, 4],
])
edge_index_one = torch.tensor([
[0, 4, 0],
[1, 1, 3],
])
edge_index_two = torch.tensor([
[0, 1, 2, 4, 0],
[1, 1, 3, 0, 3],
])
....
# Here each individual element of the list is of type Data, prepared from same feature matrix and corresponding adjacency
# matrices
shared_fm_data_list = [shared_fm_data,shared_fm_data_one,shared_fm_data_two]
shared_fm_loader = CustomDataLoader(shared_fm_data_list,batch_size=2)
# Following is the output when iterated through shared_fm_loader
DataBatch(x=[5, 16], edge_index=[2, 7], batch=[7])
tensor([[2, 3, 1, 0, 0, 4, 0],
[1, 2, 3, 4, 1, 1, 3]])
DataBatch(x=[5, 16], edge_index=[2, 5], batch=[5])
tensor([[0, 1, 2, 4, 0],
[1, 1, 3, 0, 3]])
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I am working on a project involving graph generation using the EDGE model, where my dataset consists of multiple graphs that share a single feature matrix. To facilitate training, I have implemented a custom DataLoader that inherits from the PyG
DataLoader
, with a specialised collate function to avoid redundant copies of the shared feature matrix during batching.Specifically, my custom implementation ensures that the feature matrix is not duplicated across the different graphs in the batch, which would otherwise result in an unnecessary increase in memory usage and computational overhead.
Here’s a brief overview of what I've done:
Collator
Class: I created a custom collator class that handles the feature matrix differently during batching.DataLoader
with a custom implementation that utilises the newCollator
class.Objective:
Goal is to achieve efficient memory usage and avoid redundancy when working with a shared feature matrix across multiple graphs.
Questions:
Alternative Approaches: Are there any existing methods or best practices within PyG that could achieve the same outcome—efficiently batching multiple graphs with a shared feature matrix—without the need for a custom DataLoader?
Performance Considerations: If a custom DataLoader is indeed the best approach, are there any recommended optimisations or potential pitfalls to be aware of when implementing such a solution?
Future Integrations: As I plan to integrate this setup with a VAE-based model for prompt-based graph generation, is there anything specific I should consider at this stage to ensure smooth integration later on?
I'm new to graph generation ( GNNs in general ). Any suggestions on the approach would be greatly appreciated!
Context:
Expected outcome
Thank you in advance for your help!
Beta Was this translation helpful? Give feedback.
All reactions