Profiling PyTorch: nn.Linear vs. GeGLU MLP

PyTorch’s nn.Linear layer stands out because it fuses matrix multiplication and bias addition into a single GPU kernel. This fusion isn’t just a neat trick—it cuts down memory traffic significantly by folding the bias addition directly into the GEMM kernel’s epilogue. The result? A streamlined operation that runs faster and more efficiently on GPUs.

Contrast that with a three-layer MLP using GeGLU activations. Here, multiple distinct GPU kernels fire off separately for each linear projection and pointwise activation. Profiling uncovers repeated cuBLAS occupancy queries for every linear step, revealing the overhead of dispatching many smaller kernels instead of one fused operation. Torch.compile doesn’t speed up a single nn.Linear call since it’s already optimized, but it does trim CPU overhead by removing unnecessary tensor stride transposes. This profiling insight highlights why nn.Linear’s kernel fusion remains the gold standard for performance in PyTorch.

How nn.Linear Optimizes GPU Workload

The secret behind nn.Linear’s speed lies in its fusion of operations into one GPU kernel. Instead of launching separate kernels for matrix multiplication and bias addition, nn.Linear folds the bias addition directly into the GEMM kernel’s epilogue stage. This clever trick cuts down memory traffic and kernel launch overhead, streamlining the entire process.

Profiling shows that this fused kernel efficiently utilizes GPU resources, avoiding the multiple dispatches that plague multi-kernel setups. When torch.compile is applied to a single nn.Linear call, it doesn’t speed up execution because the kernel is already near-optimal. However, it does reduce CPU overhead by removing unnecessary tensor stride transpositions, which helps in larger workloads.

Contrast this with a three-layer MLP using GeGLU activations. Each linear projection and activation runs as separate kernels, multiplying dispatch overhead and fragmenting GPU occupancy. The profiler reveals repeated cuBLAS occupancy queries triggered for every linear projection, highlighting inefficiencies in kernel scheduling.

Understanding how nn.Linear fuses these steps clarifies why it outperforms more complex multi-kernel MLPs. This fusion approach minimizes memory movement and kernel launches, key bottlenecks in GPU workload management. It’s a practical lesson in how kernel fusion can unlock smoother, faster deep learning computations on GPUs.

Breaking Down the MLP with GeGLU Activation

Multi-layer perceptrons (MLPs) with GeGLU activation break down differently on the GPU compared to simpler layers like nn.Linear. Instead of one fused operation, these MLPs run several distinct kernels. Each linear projection—essentially a matrix multiplication—launches separately, followed by pointwise activations like GeGLU, which split the input tensor and apply gated linear units. This fragmentation means more GPU kernel launches and added overhead.

The GeGLU activation itself involves multiplying two linear transformations element-wise, doubling the number of matrix multiplies compared to a standard MLP layer. Unlike nn.Linear, which fuses matrix multiplication and bias addition into a single efficient kernel, GeGLU’s multi-step process prevents such fusion. Each step generates intermediate tensors, increasing memory traffic and latency.

Profilers show that MLPs with GeGLU produce multiple cuBLAS occupancy queries—one per linear projection—highlighting repeated kernel dispatches. This contrasts sharply with nn.Linear’s single, highly optimized kernel call. Understanding this kernel-level breakdown is key to grasping why GeGLU-based MLPs lag behind nn.Linear in raw GPU efficiency despite their theoretical expressiveness.

What These Profiling Insights Mean for Performance

These profiling insights clarify why nn.Linear outpaces more complex MLPs using GeGLU activations. By fusing matrix multiplication and bias addition into one kernel, nn.Linear slashes memory traffic and cuts down GPU dispatch overhead. For developers, this means simpler layers can be highly efficient if implemented with kernel fusion in mind.

On the flip side, multi-kernel MLPs with GeGLU suffer from fragmented GPU workloads. Each linear projection and activation spawns separate kernels, increasing latency and underutilizing GPU resources. This fragmentation complicates optimization efforts and limits gains from just-in-time compilation tools like torch.compile, which can’t fuse these kernels automatically.

For practitioners, the takeaway is clear: when building or tuning models, understanding kernel fusion’s role is crucial. Complex architectures may demand custom fusion strategies or kernel redesign to approach nn.Linear’s efficiency. Without this, performance bottlenecks persist despite advances in compiler tech.

In production settings, these findings affect throughput and cost. Efficient kernel fusion translates directly to faster inference and lower energy consumption. Teams aiming for scalable deployment should prioritize profiling and fusion-aware coding patterns. Otherwise, they risk paying a premium for architectural choices that don’t map well onto GPU hardware.

The gap between fused and multi-kernel execution highlights a persistent challenge in deep learning frameworks: balancing model expressiveness with hardware efficiency. This profiling deep dive offers a practical lens for developers to rethink model design and optimization beyond surface-level code changes.

Ссылка на первоисточник
From one-off prompts to workflows: How to use custom agents in GitHub Copilot CLI
Science & Tech

Digest: GitHub Copilot CLI Custom Agents

GitHub’s new Custom Agents for Copilot CLI embed reusable workflows as Markdown files within repositories. These agents carry team-specific…