Lightnews — Scholar-powered news

Hamed Shirzad

@hamedshirzad.bsky.social

Enjoyed giving our tutorial on Geometric & Topological Deep Learning at IEEE MLSP 2025 alongside @semihcanturk.bsky.social. Loving the Istanbul vibes and the amazing food here! ✅

August 31, 2025 at 9:02 PM

Hamed Shirzad

@hamedshirzad.bsky.social

How much do nodes attend to graph edges, versus expander edges or self-loops?

On the Photo dataset (homophilic), attention mainly comes from graph edges. On the Actor dataset (heterophilic), self-loops and expander edges play a major role.

December 12, 2024 at 12:32 AM

Hamed Shirzad

@hamedshirzad.bsky.social

Q. Is selecting the top few attention scores effective?

A. Top-k scores rarely cover the attention sum across nodes, unless the graph has a very small average degree. Results are consistent for both dim=4 and dim=64.

December 12, 2024 at 12:31 AM

Hamed Shirzad

@hamedshirzad.bsky.social

Q. How similar are attention scores across layers?

A. In all experiments, the first layer's attention scores differed significantly, but scores were very consistent for all the other layers.

December 12, 2024 at 12:31 AM

Hamed Shirzad

@hamedshirzad.bsky.social

Q. How do attention scores change across layers?

A. The first layer consistently shows much higher entropy (more uniform attention across nodes), while deeper layers have sharper attention scores.

December 12, 2024 at 12:30 AM

Hamed Shirzad

@hamedshirzad.bsky.social

We trained 100 single-head Transformers (masked for graph edges w/ and w/o expander graphs + self-loops) on Photo & Actor, with hidden dims 4 to 64.

Q. Are attention scores consistent across widths?

A. The distributions of where a node attends are pretty consistent.

December 12, 2024 at 12:30 AM

Hamed Shirzad

@hamedshirzad.bsky.social

🚨 Come chat with us at NeurIPS next week! 🚨
🗓️ Thursday, Dec 12
⏰ 11:00 AM–2:00 PM PST
📍 East Exhibit Hall A-C, Poster #3010
📄 Paper: arxiv.org/abs/2411.16278
💻 Code: github.com/hamed1375/Sp...
See you there! 🙌✨
[13/13]

December 5, 2024 at 8:20 PM

Hamed Shirzad

@hamedshirzad.bsky.social

Downsampling the edges and regular-degree calculations can make you even faster and more memory efficient than GCN!

December 5, 2024 at 8:19 PM

Hamed Shirzad

@hamedshirzad.bsky.social

But we can scale to graphs Exphormer couldn’t even dream of:

December 5, 2024 at 8:19 PM

Hamed Shirzad

@hamedshirzad.bsky.social

How much accuracy do we lose compared to an Exphormer with many more edges (and way more memory usage)? Not much.

December 5, 2024 at 8:18 PM

Hamed Shirzad

@hamedshirzad.bsky.social

Now, with sparse (meaningful) edges, k-hop sampling is feasible again even across several layers. Memory and runtime can be traded off by choosing how many “core nodes” we expand from.
[7/13]

December 5, 2024 at 8:18 PM

Hamed Shirzad

@hamedshirzad.bsky.social

By sampling a regular degree, graph computations are much more efficient (simple batched matmul instead of needing a scatter operation). Naive implementations of sampling can also be really slow, but reservoir sampling makes resampling edges per epoch no big deal.

December 5, 2024 at 8:17 PM

Hamed Shirzad

@hamedshirzad.bsky.social

Now, we extract the attention scores from the initial network, and use them to sample a sparse attention graph for a bigger model. Attention scores vary on each layer, but no problem: we sample neighbors per layer. Memory usage plummets!
[5/13]

December 5, 2024 at 8:16 PM

Hamed Shirzad

@hamedshirzad.bsky.social

But not all the edges matter – if we know which won’t be used, we can just drop them and get a sparser graph/smaller k-hop neighborhoods. It turns out a small network (same arch, tiny hidden dim, minor tweaks) can be a really good proxy for which edges will matter!
[4/13]

December 5, 2024 at 8:16 PM

Hamed Shirzad

@hamedshirzad.bsky.social

For very large graphs, though, even very simple GNNs need batching. One way is k-hop neighborhood selection, but expander graphs are specifically designed so that k-hop neighborhoods are big. Other batching approaches can drop important edges and kill the advantages of GT.
[3/13]

December 5, 2024 at 8:15 PM

Hamed Shirzad

@hamedshirzad.bsky.social

Our previous work, Exphormer, uses expander graphs to avoid the quadratic complexity of full GTs.

December 5, 2024 at 8:13 PM

Hamed Shirzad

@hamedshirzad.bsky.social

🚨 Come chat with us at NeurIPS next week! 🚨
🗓️ Thursday, Dec 12
⏰ 11:00 AM–2:00 PM PST
📍 East Exhibit Hall A-C, Poster #3010
📄 Paper: arxiv.org/abs/2411.16278
💻 Code: github.com/hamed1375/Sp...
See you there! 🙌✨

December 5, 2024 at 8:07 PM

Hamed Shirzad

@hamedshirzad.bsky.social

Downsampling the edges and regular-degree calculations can make you even faster and more memory efficient than GCN!

December 5, 2024 at 8:04 PM

Hamed Shirzad

@hamedshirzad.bsky.social

But we can scale to graphs Exphormer couldn’t even dream of:

December 5, 2024 at 8:04 PM

Hamed Shirzad

@hamedshirzad.bsky.social

How much accuracy do we lose compared to an Exphormer with many more edges (and way more memory usage)? Not much.

December 5, 2024 at 8:03 PM

Hamed Shirzad

@hamedshirzad.bsky.social

Now, with sparse (meaningful) edges, k-hop sampling is feasible again even across several layers. Memory and runtime can be traded off by choosing how many “core nodes” we expand from.
[7/13]

December 5, 2024 at 8:02 PM

Hamed Shirzad

@hamedshirzad.bsky.social

By sampling a regular degree, graph computations are much more efficient (simple batched matmul instead of needing a scatter operation). Naive implementations of sampling can also be really slow, but reservoir sampling makes resampling edges per epoch no big deal.

December 5, 2024 at 8:02 PM

Hamed Shirzad

@hamedshirzad.bsky.social

Now, we extract the attention scores from the initial network, and use them to sample a sparse attention graph for a bigger model. Attention scores vary on each layer, but no problem: we sample neighbors per layer. Memory usage plummets!
[5/13]

December 5, 2024 at 8:01 PM

Hamed Shirzad

@hamedshirzad.bsky.social

But not all the edges matter – if we know which won’t be used, we can just drop them and get a sparser graph/smaller k-hop neighborhoods. It turns out a small network (same arch, tiny hidden dim, minor tweaks) can be a really good proxy for which edges will matter!

December 5, 2024 at 8:00 PM

Hamed Shirzad

@hamedshirzad.bsky.social

For very large graphs, though, even very simple GNNs need batching. One way is k-hop neighborhood selection, but expander graphs are specifically designed so that k-hop neighborhoods are big. Other batching approaches can drop important edges and kill the advantages of GT.
[3/13]

December 5, 2024 at 7:59 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news