Skip to content

Flash Attention-Inspired Queuing for Ultra-Low Latency Communication Networks

We present a queuing subsystem that adapts FlashAttention ideas to message middleware: (i) FlashQueue, a priority
queue with optional async event loop; and (ii) MemoryMappedFlashQueue, which adds a “hot” SRAM-like buffer and a cold
HBM-like backing priority queue. We report latency, cachehit ratio, and throughput under synthetic workloads, showing
predictable wins from hot-buffer admission control and async
event loops

Exploring Frequency-Inspired Optimization in Transformer for …

Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range …

arxiv.org

Optimizing Low-Latency Applications with Swift Packet Queuing – arXiv

Inspired by this, we present SwiftQueue, a new L4S queue-selection system driven by a custom Transformer for per-packet latency prediction.

arxiv.org

[PDF] Transformer-Based Wireless Traffic Prediction and Network … – arXiv

Abstract—This paper introduces an innovative method for predicting wireless network traffic in concise temporal intervals.

arxiv.org

Reducing Vision Transformer Latency on Edge Devices via GPU …

This paper investigates how to efficiently deploy transformer-based neural networks on edge devices. Recent methods reduce the latency of …

arxiv.org

Communication-Efficient Multi-Device Inference Acceleration … – arXiv

We propose Astra, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism …

arxiv.org

Decision Transformers for Wireless Communications: A New … – arXiv

In this paper, we adopt an alternative AI technology, namely, Decision Transformer (DT), and propose a DT-based adaptive decision architecture for wireless …

arxiv.org

JPPO++: Joint Power and Denoising-inspired Prompt Optimization …

We propose Joint Prompt and Power Optimization (JPPO), a framework that jointly optimizes prompt compression and wireless transmission power for mobile LLM …

arxiv.org

Quantized Spike-driven Transformer – arXiv

Optimized potential initialization for low-latency spiking neural networks. Proceedings of the AAAI Conference on Artificial Intelligence …

arxiv.org

[PDF] Meta-Learning Inspired Transformer Selection for Green Semantic …

This evolution promises significant benefits, including reduced latency, lower bandwidth usage, and higher throughput compared to traditional …

arxiv.org

Vision Transformers on the Edge: A Comprehensive Survey … – arXiv

We systematically categorize and analyze the latest advancements in pruning, quantization, knowledge distillation, and hardware-aware optimizations. Furthermore …

arxiv.org

Is Flash Attention Stable? – arXiv

Flash Attention is a widely-adopted technique used to speed up the attention mechanism, often considered a system bottleneck in transformer …

arxiv.org

Introduction to Flash Attention: A Breakthrough in Efficient … – Medium

Flash Attention marks a significant advancement in attention mechanisms, addressing efficiency concerns and enabling faster and more memory-efficient training …

medium.com

Flash-Attention-Enhanced Multi-Agent Deep Deterministic Policy …

To improve performance in a MEC scenario, this paper proposes a Flash-Attention-enhanced MADDPG algorithm (FA-MADDPG) for decision making, and its time …

mdpi.com

An end-to-end attention-based approach for learning on graphs

GraphGPS natively supports Flash attention, while Graphormer requires specific modifications to the attention matrix that are not currently …

nature.com

[PDF] Is Flash Attention Stable? – arXiv

We find that Flash Attention sees roughly an order of magnitude more numeric deviation as compared to. Baseline Attention at BF16 when measured …

arxiv.org

Jagged Flash Attention Optimization | Shaped Blog

By combining jagged tensors with flash attention, this innovation achieves up to 9× speedup and 22× memory reduction compared to dense attention …

shaped.ai

64. Breaking the Attention Barrier: A Deep Dive into Scaling LLM …

Flash Attention is an algorithm designed to address the memory and computational bottlenecks associated with attention mechanisms in large …

machinelearningatscale.substack.com

FlashAttention: Fast and Memory-Efficient Exact Attention with IO …

It would interesting to see a roofline plot to demonstrate the compute-bound and memory access trade-off with and without flash-attention.

openreview.net

Rethinking Dynamic Networks and Heterogeneous Computing with …

A classic example is Flash Attention[8], which combines originally independent operations such as matmul, dropout, softmax, and mask into a …

dl.acm.org

Flash Attention with CUDA. Introduction | by Damien J | Medium

Flash Attention, as the name suggests, brings a fast and memory-efficient solution to attention mechanisms. It addresses some of the …

medium.com

Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony

This paper introduces Asynchronous Expert Parallelism (AEP), a new paradigm that decouples layer execution from barrier-style synchronization.

arxiv.org

HierMoE: Accelerating MoE Training with Hierarchical Token … – arXiv

The mixture-of-experts (MoE) architecture with sparse activation has gained significant research interest in large language models (LLMs) [1, 2, …

arxiv.org

X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts …

Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained …

arxiv.org

[PDF] X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts …

X-MoE accomplishes this via a combination of techniques, such as padding-free MoE training with cross-platform kernels for improved memory and.

arxiv.org

[PDF] Lancet: Accelerating Mixture-of-Experts Training via Whole Graph …

ABSTRACT. The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters.

arxiv.org

FSMoE: A Flexible and Scalable Training System for Sparse Mixture …

As the experts are distributed across multiple devices, the dispatch operation uses a collective communication technique called AlltoAll …

arxiv.org

FlashDMoE: Fast Distributed MoE in a Single Kernel – arXiv

This work introduces FlashDMoE, the first system to fuse the entire Mixture-of-Experts (MoE) operator into a single, persistent GPU kernel. We …

arxiv.org

Consistent and Efficient Tensor Programming with Eager-Mode SPMD

… dispatch overhead per operator is intolerable, particularly in models containing Mixture of Experts (MoE). Because thousands of lightweight …

arxiv.org

Middleware for LLMs: Tools Are Instrumental for Language Agents …

In particular, Mixtral represents an advanced mixture-of-experts model that has demonstrated superior performance and even surpasses GPT-3.5-turbo on Chatbot …

arxiv.org

1 Introduction – arXiv

The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces the challenge of …

arxiv.org

Communication Efficient Parallel MoE Inference with Speculative …

Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens’ …

arxiv.org

Speculative Decoding and Beyond: An In-Depth Review of … – arXiv

This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks.

arxiv.org

Communication-Efficient Collaborative LLM Inference via Distributed …

Abstract:Speculative decoding is an emerging technique that accelerates large language model (LLM) inference by allowing a smaller draft …

arxiv.org

[1801.01203] Spectre Attacks: Exploiting Speculative Execution – arXiv

Spectre attacks exploit speculative execution, inducing a victim to perform operations that leak confidential information via side channels.

arxiv.org

[PDF] Speculative Decoding and Beyond: An In-Depth Survey of Techniques

Speculative decoding (SD) uses a two-phase process: a draft model predicts multiple tokens in parallel, followed by verification using the …

arxiv.org

A Survey of Speculative Execution in Large Language Models – arXiv

We present the very first survey paper that reviews and unifies literature of speculative execution in LLMs (eg, blockwise parallel decoding, speculative …

arxiv.org

Speeding up Speculative Decoding via Sequential Approximate …

Speculative Decoding (SD) is a recently proposed technique for faster inference using Large Language Models (LLMs).

arxiv.org

[PDF] On the Correctness of Speculative Consensus – arXiv

Consensus protocols allow changes supported by the majority. The Proof-of-Execution (PoE) protocol uses speculative execution to minimize …

arxiv.org

A Speculative LLM Decoding Framework for Efficient Edge Serving

This position paper introduces a new framework that leverages speculative decoding, previously viewed primarily as a decoding acceleration …

arxiv.org

Collaborative Speculative Inference for Efficient LLM Inference Serving

Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, …

arxiv.org

Ring Attention with Blockwise Transformers for Near-Infinite Context

Our proposed approach Ring Attention allows training up to device count times longer sequence than baselines and enables the training of sequences that exceed …

arxiv.org

[PDF] Ring Attention with Blockwise Transformers for Near-Infinite Context.

Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a …

arxiv.org

[PDF] Striped Attention: Faster Ring Attention for Causal Transformers – arXiv

We propose Striped Attention, a variant of Ring Attention which permutes the input sequence in a way which almost entirely eliminates the …

arxiv.org

Distributed Memory-efficient Attention for Long-context LLMs Training

In this paper, we introduce DistFlashAttn, a distributed memory-efficient attention mechanism optimized for long-context LLMs training.

arxiv.org

Communication Efficient Distributed Self-Attention Mechanism – arXiv

The original Ring Attention’s [20] block distribution caused load imbalances when applying causal attention. … Large scale distributed deep …

arxiv.org

TokenRing: An Efficient Parallelism Framework for Infinite-Context …

TokenRing addresses a critical challenge in distributed systems—such as in Ring Attention—where communication and computation cannot be …

arxiv.org

LV-XAttn: Distributed Cross-Attention for Long Visual Inputs … – arXiv

Figure 2 shows that cross-attention operations distributed with Ring Attention (Liu et al., 2024a) can account for up to 87% of the …

arxiv.org

Star Attention: Efficient LLM Inference over Long Sequences – arXiv

Among these, only Ring Attention is a distributed algorithm designed to scale inference across multiple GPUs. Since Star Attention also targets distributed …

arxiv.org

LASP-2: Rethinking Sequence Parallelism for Linear Attention and …

In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models.

arxiv.org

Increasing Transformer Context Length with Sparse Graph … – arXiv

Ring attention achieves sequence parallelism … networks with global attention,” Advances in Neural Information Processing Systems, vol.

arxiv.org