Luminal Blog

Hardware Arbitrage with Luminal and Positron AI

Luminal — Mon, 18 May 2026 20:08:58 GMT

Background

Today nearly all inference demand is met with GPUs designed for a wide-variety of workloads. The flexibility of GPUs has been both a great strength, allowing many different models to run on the same chips, as well as a weakness by forcing workloads with very different characteristics to run on the same hardware. Some workloads are inherently memory bound and require large memory bandwidth, like LLM decoding, while others are compute bound and require large amounts of flops, like LLM prefill or diffusion.

Matching hardware to workloads allows us to better balance hardware utilization and achieve better inference performance and economics. However, when moving off of pure GPU inference, a significant difficulty emerges: models need to be rewritten for different accelerators, often times with an entirely new programming model. This process usually requires engineers with deep expertise in a specific architecture, who are often hard to find. Large engineering teams at frontier labs often spend many months porting a single model to a new platform.

This is the main reason we set out to take a compiler-first approach from day 1. By owning the compiler, we can control the entire process, from model definition through optimization and lowering down to exact hardware instructions. This gives us unmatched flexibility in which accelerators we can target, and by leveraging large-scale search we can lean on our compiler to find optimized implementations on a wide variety of hardware architectures.

Why Positron?

When evaluating hardware partners for our first non-GPU compiler backend, Positron fit cleanly into our vision of heterogeneous inference, complementing GPUs and providing a new landscape to explore with our search-based compiler.

They have a clear focus on what matters most to them: memory bandwidth utilization. While other chips offer large bandwidth, it’s often quite hard to access and most inference runs at small fractions of total bandwidth utilization. With a simplified programming model and hardware that made memory accesses explicitly controllable, writing code to access the device’s full bandwidth is much more straightforward than on GPUs. This made them a clear choice for bandwidth-heavy workloads, such as decoding.

Given a Positron Atlas system as a far different programming model compared to a GPU, it provides an enticing target for testing the robustness of the Luminal compiler.

The Software Bottleneck in AI Hardware

To date, only one company has successfully achieved mass-market software adoption for a comprehensive AI accelerator software stack: Nvidia’s CUDA. Despite colossal market incentives, creating a comprehensive mapping from user workloads to complex hardware without unnecessary overheads is hard. The startup graveyard is full of hardware companies underestimating the difficulty of building a software stack both bespoke to their hardware and integrated with the rest of the ML software ecosystem.

In building Luminal we knew if we got the abstractions right, if our search was powerful enough, we could compile user-provided workloads straight from Pytorch to achieve speed-of-light performance across wide varieties of hardware. In doing so, Luminal is the quickest way for a new chip to start running models and achieve far higher performance than any “compatibility layer”. In the long run if we embrace the bitter lesson hard enough it should even outperform any hand-crafted or agent-crafted implementation through the power of large-scale kernel search.

Our luminal_tron backend is our first public demonstration of the flexibility of this technology.

The Positron Atlas Architecture

Atlas is a set of systolic array accelerators arranged in a star-topology, with a host CPU in the middle. Achieving high-performance inference requires constant overlapping of matrix-multiplies, weight loading, and host-side element-wise operations.

A Positron Atlas node

Each Archer device contains a systolic array, SRAM buffers, and 32GB of HBM containing model weights. Compute-heavy operations such as dense matrix multiplies or attention operations are offloaded to the Archer devices, while elementwise operations happen on the host CPU.

Achieving high performance means heavily overlapping operations happening on the devices with those happening on host threads. Doing this requires Luminal to generate a heavily multi-threaded program that uses many CPU cores in parallel to overlap host-side compute and device-side work dispatching. Searching over the space of possible overlapping schedules is key to Luminal automatically finding schedules that maximize hardware utilization.

Inference is fundamentally heterogeneous

When a new prompt arrives at an inference endpoint, the first part of the inference workload is called prefill, requiring the transformer model to ingest the full prompt to generate a KV cache and the first output token. Because the full prompt is supplied right from the start, each token can be processed in parallel. This requires the weights of the model to only be loaded once, no matter how large the prompt is. So memory bandwidth is roughly constant, even on longer prompts (with the exception of attention), while the compute required for prefill scales roughly linearly. This means prefill is usually compute-bound, limited by the amount of compute available. This fits well on standard GPUs, which use powerful tensor cores to deliver petaflops of compute.

On the other hand, once prefill is complete and the first token and KV cache have been generated, the process of generating every subsequent token is called decode and requires loading the entire model’s weights into compute units to get each token out. Since each token relies on all tokens before it, this process is purely sequential, meaning we can’t share weights across tokens in the sequence dimension like we do in prefill. This results in decode being entirely limited by the speed at which we can load the model’s weights into the compute units, also known as bandwidth-bound. It is this regime we target with Positron Atlas.

A disaggregated prefill-decode workload

Between the phases, we transfer the KV cache between the GPU and the Atlas system before decoding can begin.

We chose to compile the prefill phase to a CUDA graph and run on an H100, and compile the same model’s decode phase to Atlas. Even though these hardware platforms work entirely differently under the hood, Luminal compiles the exact same model definition with a single-line code change:

import torch, luminal_cuda, luminal_tron
# Compile for GPU
prefil_model = torch.compile(model, backend=luminal_cuda)
# Compile for Positron
decode_model = torch.compile(model, backend=luminal_tron)

A more technical post discussing results and metrics is coming soon, with precise performance benchmarks achieved on both prefill and decode.

Conclusion

The Luminal compiler allows us to take off-the-shelf models and compile them automatically to a variety hardware backends. We’ve partnered with Positron AI to bring Atlas up as a compiler backend, bringing the full range of Luminal-supported models to Atlas. We also combine Atlas and Blackwell to demonstrate disaggregated inference across heterogeneous hardware.

By targeting hardware that fits workloads tightly, Luminal greatly improves performance and cost of inference. As inference becomes an increasing bottleneck in AI deployment, purpose-built hardware is quickly becoming the norm. In this post we’ve described how Positron’s Atlas, with the Luminal compiler, fits into this world. Luminal, the unified software stack capable of accessing the full performance of each accelerator, is key to unlocking this future.

Producing The Perfect Token

Luminal — Mon, 06 Apr 2026 11:49:17 GMT

Inference is rapidly becoming the primary bottleneck of business, driving many inference clouds to race to provide faster and cheaper tokens.

This race has resulted in the quality of tokens differing massively depending on inference cloud, model, and serving setup. Benchmarks performed by the same model on different clouds can range as much as 20% due to quality issues. This is the gap between useful and useless tokens, so today we’ll go over the factors that affect quality, the economic basis of reliability, and how we engineer our compiler and cloud to deliver only the highest quality artisanal tokens on the market.

When thinking about inference, focus is placed on which model is being served, at what speed, and at what price. After all, if Provider A serves the same model as Provider B, at the same speed and 30% cheaper, why not use Provider A?

Neural networks are supposed to be deterministic calculations, so anyone who can run those calculations cheapest should get all the business. However recent benchmarks ran on Kimi K2 across various providers tell a different story. Significant divergences appear despite the model and benchmarks being held constant. Worse yet, as Thinking Machines documented in their excellent piece on determinism in LLMs, getting the exact same outputs out of an LLM is not possible without a significant performance penalty.

So why is this, and what can we do about it?

First we’ll start off by going over why this matters to inference providers and customers, as well as how it affects real inference workloads today. We’ll then need to dive into the fundamentals of how computers store numbers and the different choices we have when picking representations. Then we’ll go over how these tradeoffs affect real inference and which operations are most sensitive to errors. Finally we'll wrap up by going over how compilers reason about this in general, and how Luminal reasons about this specifically, and how we avoid optimizing a valuable token stream into jibberish.

The money is in the bits

A modern inference cloud isn’t too dissimilar to a steel mill, in that it has inputs and outputs, and aims to produce outputs from inputs at a lower cost than the customer is willing to pay for them. They generally see large advantages in economies of scale, amortizing fixed costs across very large volume.

However like a steel mill, these businesses are constantly under competitive pressure to lower their COGS, giving them either more margin or more pricing power against competitors viewed as selling an identical product. The hyper-competitiveness of the inference game has led to the cost of intelligence decreasing over 390x in the past 3 years alone. In this market clouds are constantly looking for an edge, a way to produce their output (in this case tokens) ever cheaper.

So what is the primary bottleneck on token production for these businesses? In two words: memory bandwidth. Every time an LLM generates a token, all (or a fraction) of it’s weights need to be loaded from memory into a compute unit, which represents the vast majority of the time and energy of the operation, relative to actual computation. If there was a way to decrease the required memory bandwidth for an LLM to produce a token, huge gains could be realized in terms of both speed and cost.

This has set off a race over the past decade to figure out how to shrink models more and more by using less bits per parameter. However, customers have begun realizing the downsides of this trend, seeing large mismatches between reported performance and experienced performance on many inference providers, lead to growing customer skepticism. As we’ll see, this issue is a lot more complex than it seems on the surface.

How do computers represent numbers?

When we think of numbers, we generally think of whole numbers, like 1, 2 or 42, or real decimal numbers like 4.3 or 3.14. But computers are binary machines, representing everything in finite amounts of 1’s and 0’s. So if we wanted to represent a decimal number, like the kinds neural networks operate with, in a computer, what are our options?

IEEE Floating Point Standard

The IEEE 754 standard defines how floating-point numbers are represented and computed in modern hardware. Each number is encoded as three parts: a sign bit, an exponent (which determines dynamic range), and a mantissa (which determines precision).

These bits are interpreted as:

value = (−1)^sign × mantissa × 2^exponent

This can be generally thought of as sign sets the direction, exponent sets the scale, and mantissa sets the detail within that scale.

The FP32 format

The most popular of these datatypes are FP64, FP32 and FP16, using 64, 32 and 16 bits respectively.

Modern narrow datatypes

More recently there’s been a push to invent even more narrow-precision datatypes: BF16 from Google Brain, and more recently FP8 (E3M4, E4M3) and various 4-bit variants (MXFP4 and NVFP4).

The majority of the performance gains shown in more recent generation GPUs stem directly from using lower-precision datatypes, as seen in Nvidia’s gen-to-gen performance chart:

It is advantageous to use datatypes that require less bits because they take pressure off memory bandwidth, fit into caches better, and require fewer transistors to implement mathematical operations in hardware. However as we’ll see there are correctness tradeoffs associated with lower-precision datatypes.

We’ll talk more about TF32 later, as it’s an unusual case.

The precision spectrum

We now see a spectrum of datatypes:

FP32 is the gold standard for precision and correctness (FP64 isn’t widely used in AI). Generally low-performance, high-correctness.
FP16 / BF16 being a generally safe / mature format to use depending on if range or precision are targeted.
FP8 for speed-of-light performance on Hopper-generation (2022 onwards) accelerators with some (manageable) accuracy tradeoffs and no block-scaling complexity.
MXFP4 / NVFP4 for state-of-the-art performance on Blackwell-generation (2025 onwards) accelerators utilizing very low-bit weights for maximum bandwidth efficiency and scaling factors for preserving accuracy.
INT8 is less commonly used in datacenter accelerators but common on edge devices owning to the simplicity of integer arithmetic hardware.

Sources of error

The tradeoff of a low-bit datatype is less representational power since fewer bits means fewer states. Fewer exponent bits shrink dynamic range resulting in more overflows and underflows, while fewer mantissa bits increases rounding errors.

Two additional behaviors also incur a mismatch between represented and real numbers: subnormals and flush-to-zero behavior. In IEEE 754, numbers very close to zero are represented using subnormals. Instead of the usual “1.xxx × 2^e” form, they drop the implicit leading 1 and use “0.xxx × 2^emin”. This allows gradual underflow where values don’t jump straight from the smallest normal number to zero, instead tapering off smoothly. Flush-to-zero is a performance optimization where the system treats all subnormal values as exactly zero, which eliminates the hardware required to correctly handle subnormal values.

These tradeoffs are tricky to track since they depend greatly on not only the datatype and hardware in question, but the exact operation as well, with some operations being much more sensitive to low-bit mismatches than others.

Accumulations are the most common source of errors, putting pressure on numeric precision for long accumulation sequences. While multiplications are generally fine to do in fairly low precision due to errors being finite and bounded, as the length of the accumulation grows, errors build up.

This is a big problem for matrix multiplies, which famously do long accumulation chains as part of their dot-product operation:

In the diagram above, we are taking a dot product of the elements in the shaded areas of the A and B matrices to get the single element shaded area of the C matrix. When the K dimension is large we need to do a long accumulation chain to get to the final result. LLMs have been increasing the K dimension for years, some now as large as 14848 in the case of Falcon 180B. For this reason most accelerators implement accumulators in a higher precision than the multiply units, often times as high as FP32.

Softmax is another common operation for errors to occur due to the exponentiation involved on every element, which increases the magnitude of each element and leads to common overflows, especially on low-exponent datatypes. Techniques like stable softmax involve subtracting the maximum element from all elements first before the standard softmax is applied for this reason.

Normalization, such as LayerNorms found inside most LLMs put high pressure on the precision (mantissa bits) when computing variance. On datatypes with low mantissa bits, rounding errors commonly occur. For this reason variance is often computed in FP32.

Outliers are a very common phenomena where a few elements in the activations dominate the scaling of an operation in a transformer and ruin the effective resolution in INT8 precision. Clipping activations can help eliminate outliers, however clipping fundamentally destroys information, so it also contributes to quality loss.

A note on determinism

AI models are generally made up entirely of linear algebra operations, and since linear algebra is generally thought of as deterministic, it stands to reason we can always get deterministic, reproducible outputs out of our AI models. Unfortunately as Thinking Machines has documented excellently here, that isn’t usually the case. Their post is very detailed and I would highly recommend reading it, but for our purposes I’ll summarize a key cause of nondeterminism as this inequality when dealing with finite-precision floating points:

In modern floating-point hardware, such as GPUs, there are no guarantees made about accumulation ordering, meaning if the above holds true then we cannot be bit-wise certain about our outputs.

Hidden Precision

One major change to Nvidia GPUs over the past few generations (since Ampere in 2020) was the introduction of TensorFloat32 (TF32) precision. Despite the name, it actually uses 19 bits, specitically arranged as 8 exponent bits and 10 mantissa bits. You’ll notice this is essentially a mix of FP16’s mantissa (10 bits) and BF16’s exponent (8 bits), which gives it the same precision as FP16 and same range as BF16.

Even more confusingly, users never actually “touch” this datatype, meaning it isn’t meant to be directly handled in user code at all. You’ll never see a buffer of TF32 values or need to compute 19 * n_elements to determine a buffer size. Instead, it entirely exists within the TensorCore’s systolic array (matrix multiply unit) and enables much higher performance than native FP32 mode, albeit at the cost of less numerical precision. It is enabled or disabled in cublas with the arguments CUBLAS_TF32_TENSOR_OP_MATH or CUBLAS_DEFAULT_MATH respectively.

Quantization methods

Using fewer bits is only half the story, how you map values into those bits matters just as much. Quantization takes a high-precision value and maps it into a smaller set of discrete levels, typically via a scale:

q = round(x / scale)
x̂ = q * scale

Choosing that scale is where most of the tradeoffs live.

Per-tensor vs per-channel

Per-tensor uses one scale for an entire tensor. This is simple, but inaccurate if values vary widely.

Per-channel / per-block assigns a scale per row/column. This adapts much better to real distributions and is widely used despite slightly higher overhead.

Certain newer datatypes like nvfp4 mix these techniques, using a higher precision per-tensor scale and a lower precision per-block scale.

Static vs dynamic

Static quantization precomputes scales (common for weights).

Dynamic quantization computes them at runtime (common for activations).

Block-wise quantization

At very low bitwidths (FP8, 4-bit), scales are often shared across small blocks (e.g. 32 values). This improves accuracy but adds complexity and requires kernels to load both values and scales.

“Automatic” mixed precision

Since we’ve seen how certain operations are more sensitive to low-precision datatypes than others, couldn’t we just mark the operations that are sensitive and switch to high-precision datatypes for just those?

Yes! That’s exactly how PyTorch’s Automatic Mixed Precision works. It relies on a table that marks operations which require higher precision and does upcasts before and downcasts after. This helps to alleviate the primary issues of precision loss, though it’s a fairly brittle approach. As opsets are large, this is a large manual effort to make sure all operations are correctly marked, and when new operations are added, they can just as easily slip through this manual marking process and execute in a precision lower than is required for stable results.

The compiler’s role in protecting accuracy

This manual op-marking approach starts to break down as we move towards a world of ML compilers, where the program written by the model developer is ingested into a compiler and undergoes aggressive optimizing transformations to reduce memory pressure or fit hardware units better. Operations that were previously insensitive to precision errors could be transformed into sensitive operations, or vice-versa. Compounding this, most compilers now operate at a lower level than whole tensor operations, meaning loops and blockwise or elementwise operations are tracked explicitly. For example, this means accumulations are often done in 2-step fashions, first accumulating inside a block and then accumulating between blocks, sometimes in different precisions. As we’ve seen before, long accumulation chains are hotspots for numerical error buildup.

In this world, its the compiler’s job to maintain a contract with the user: numerical error must not increase due to optimizations, or it must increase in predictable, user-controllable ways.

If this contract is not maintained, outputs end up being lower quality due to numerical errors / instabilities. Debugging this is a very involved process, often requiring a deep dive into the compiler’s internals and chosen transformations. Without good tooling, this is quite a challenging issue to debug!

Modelling precision in Luminal

An e-graph datastructure, like the one used in Luminal

Keeping this contract is vital to correct results, and given Luminal is a compiler focused on formal large scale search, we needed to be able to verifiably hold numeric guarantees even as the compiler traverses a semantically rich search space at various levels of operators.

A straightforward approach is to measure absolute tolerance (atol) error and relative tolerance (rtol) error end-to-end through an entire model. This is a standard technique used by many libraries, and works fairly well given sufficiently noisy inputs. However this approach has some drawbacks, most notably to do with runtime. We may have millions of compute graphs that would lead to unacceptable numerical losses, but this approach would require we ran each one fully to rule them out, a process that could take far too long at compile time.

An approach Luminal commonly uses is to do operator-level or subgraph-level precision tracking, essentially measuring operator or subgraph numerical errors and reasoning about how they compound in a whole model. One way to think about this is that if subgraph A has been measured (through atol and rtol) to produce unacceptable numerical loss, we can safely rule out all compute graphs that contain A, knowing that the remainder of the graph cannot have "less” overall error (this doesn’t hold true in a handful of edge cases, however this post is long enough!).

Static analysis represents another approach to quantifying errors in linear algebra expressions. There are several forms, but generally these take the form of:

Start with bounds on input variables [a, b]
Derive correlations through the expression and across operators
Estimate overall error bounds
Use a rewriting system to minimize error given some constraints

A common drawback to interval-based tracking is the explosion of error bounds. In order to guarantee outputs will fall within an interval, solvers will generally estimate worst-case on each operation, which builds up over the course of a full expression, leading to the end error interval overestimating the true error bounds.

Solvers like Daisy use bit-level representations to analyze bit-level transformations and symbolically model errors rigorously. The upside of this level of rigor typically is tighter end error bounds, without as much overestimation as general interval-based tracking. However due to bit-level tracking, these solvers can become quite expensive on large expressions (which LLMs certainly qualify as).

Generally static / analytical solvers are required to over-assume a worst-case error, and mixing in empirical error measurements with representative inputs often helps keep the search grounded to real-world data.

Wrapping up

Numerics determine whether your tokens can be trusted or not. Despite labs sinking billions into training better and better models, relatively little attention is paid to making sure the fidelity of those models are preserved after the benchmarking runs are over and they go into service.

As we’ve seen, modern computers can only represent floating point numbers with a finite number of bits, and so numerical error is unavoidable. Quantifying and controlling that error is vital. As Luminal is a compiler, it is our job to ensure no optimizations or rewriting destroys numerical accuracy, lest our outputs be not only fast but also incorrect.

General-purpose rewriting solutions, like those used in Luminal, allow us to traverse this space smoothly and reason about not only performance but also numerics in a joint space.

I’m excited about the possibilities of controllable, low-precision, low-error accelerated inference. Luminal exists in a unique space where we can co-design with our hardware partners and model partners to continue driving the cost of intelligence down, so when the “country of geniuses in a datacenter” arrives, we can all afford to use it.

If this excites you, we’re hiring.

Compiling Models to Megakernels

Luminal — Fri, 09 Jan 2026 23:14:52 GMT

Luminal is an inference compiler, and as such we’re interested in driving inference right up to the physical limits of the hardware. Inference has two fundamental limitations: compute (flops) and bandwidth (TB/s). Increasing these two requires buying much more expensive hardware, so we want to make sure we’re using all the compute and bandwidth we have available to us! This basically boils down to: anytime the GPU is not loading data, we’re wasting bandwidth, and anytime the GPU is not computing, we’re wasting compute.

Bottlenecks

Let’s look at a typical timeline of executing a transformer layer:

A simplified view of a typical transformer forward pass

We see two problems immediately:

Every time we finish a kernel and start another one, the GPU sits idle while the CPU launches the next kernel.
Some streaming-multiprocessor cores (SMs) in the GPU finish their work early and sit idle while other SMs finish the remaining work.

Kernel launch overhead is well-known and can be partially mitigated with techniques like CUDA Graphs on Nvidia GPUs. This isn’t perfect, though, as Hazy Research demonstrated in their original megakernel post. With a dummy kernel that does no work, and ordinarily takes 2.1 micros, when CUDA graphs are enabled, it still takes 1.3 micros!

The next issue is also a well-known phenomenon called Wave Quantization, which occurs when a kernel’s work cannot be evenly distributed across all SMs, leaving some SMs to finish early and stall while others lag behind to finish the kernel. Depending on the total runtime of the kernels and the shape of the work, these gaps can become very significant!

Due to the nature of the tensor computations we’re interested in, we don’t actually have to wait for a full synchronization to begin the next op. Take a tiled matmul for example:

Data access patterns of a tiled matmul

This operation does not need to wait for all of tensor A or all of tensor B to begin computing, since it only consumes a stripe of tiles from both A and B. So long as that stripe is ready, we can start computing a tile of C! This full synchronization is entirely enforced by the standard kernel execution model, not required by the mathematics.

There’s actually a hidden third bottleneck preventing us from fully utilizing our hardware’s bandwidth and compute effectively: each kernel does no compute until it loads enough weights to start working. Generally this means even if the kernel can do perfect load-compute overlapping during it’s main loop execution, it cannot get around the idle time waiting for the initial weights to load. We’d need a finer-grained timeline showing loading and compute to see that effect:

A simplified view of a single SM during execution

Now we can see there’s a large amount of time spent loading the initial weights before we can even begin to compute. The whole time our expensive tensor cores are sitting idle! Even if our kernels were programmed by experts and perfectly utilized bandwidth during their execution, this is outside their control. Techniques like Programmatic Dependent Launch help mitigate this by letting the next kernel start setting up (loading weights) while the current kernel is running, however this is done on the device level, not the per-SM level, so we’re still left with significant bubbles.

One kernel per model

What if instead we could fuse every operation in a forward pass into a single kernel? This would give us a few advantages:

We’d eliminate kernel launch latency right off the bat, since we only launch one kernel for the entire forward pass.
We’d also be able to immediately start running work from the next operation on SMs that have early-finished work on the current operations, eliminating our wave quantization effects.
We’d be able to start loading weights for the SM’s next operation during the epilogue of the current operation, thereby eliminating the above gap between compute spans.

This technique was pioneered by Hazy Research last year, where they fused Llama 1B into a single megakernel. However, a significant limitation of their approach was requiring these megakernels to be built by hand, manually defining each instruction and scheduling it to SMs. Since we’re compiling models from source, we want this process to be automatic and robust to arbitrary architectures.

Let’s walk through how megakernels work, and then we’ll dive in to how Luminal automatically generates them for arbitrarily complex models.

An interpreter on a GPU

Megakernels stem from the concept of an interpreter. Most programmers will be familiar with how interpreted languages like Python work, where an interpreter reads, decodes, and executes instructions one-by-one. We can view a GPU as a large multi-core processor, where each core is capable of executing a very limited instruction set. We can either provide the cores their instructions directly in shared memory on a per-core basis, or in global memory in a global instruction stream. In other words, we need to decide if we want to statically schedule instructions to independent streams per-SM or on a single stream SMs all share.

A quick word about each path:

Static scheduling benefits from being able to prefetch and load many instructions at a time, directly into shared memory. The overhead for fetching a new instruction is very low since it’s already fetched at execution time and resides in fast memory. A downside of this approach is it requires the programmer or compiler to statically partition instructions across SMs ahead of time, which is challenging especially since instructions can be variable-latency. Furthermore, jitter is often present in SMs, causing some to run slower than others for unpredictable hardware reasons.
Dynamic (global) scheduling incurs more significant overhead by requiring a roundtrip to global memory and an atomic lock to fetch each instruction. These can be hidden though, during the execution of the previous instruction, so long as the previous instruction takes enough time to hide the fetch latency. Global scheduling also does not require the programmer or compiler to partition instructions to SMs ahead of time, instead allowing SMs to opportunistically pop instructions off the queue once they are ready. This naturally corrects for jitter, because faster SMs will pick up the slack while slower ones lag.

We felt the tradeoffs introduced with dynamic scheduling were worth it. Our megakernels provide a single global instruction queue shared by all SMs, which both simplifies the compiler’s work as well as allows for variable-latency instructions.

Since instructions communicate through global memory, we still want to do the same fusion patterns as in traditional kernels. This means our instructions end up being fairly coarse grained, handling computations like Matmul + ResidualAdd or RMSNorm + Matmul + RoPE to minimize global memory roundtrips.

Here’s a view of how our SMs work through instructions:

A profile of a megakernel executing across many SMs

Notice how there’s overlap between when the current instruction ends and the next instruction begins running. We also even see SMs running multiple instances of instructions in the same timespan single instructions run on other SMs, showing that instruction latency is quite variable!

There’s one big problem left we haven’t discussed: synchronization. As we discussed before, normal kernels have a major downside in that future work cannot be ran until all SMs finish on the current kernel. However, the corollary to that is we are guaranteed all data is ready by the start of the next kernel. Once we start running future ops before past ops are entirely done, this guarantee goes away, requiring us to be very fine-grained in how we synchronize and assert the input data to the next op is in fact ready. The mechanism we use for doing this is standard barrier counters. However, unlike Hazy’s barriers, we use an increment-then-decrement barrier approach, where ops first increment their assigned barrier at launch, run, and then decrement their barrier once they are completed. We can then view each barrier as a sort of “inflight producer” counter. This mechanism means we don’t need the consumer to know how many producers to wait for on a given piece of data, it simply needs to wait for the number of inflight producers to equal zero.

Generating Megakernels

Luminal is a graph-based compiler, and as such it represents models as compute graphs. The challenge we undertake is transforming a compute graph into an instruction queue, with fine-grained data dependencies wired up correctly. Our approach takes 2 passes:

Rewriting existing ops into block ops, partitioned over SMs, with strided input and output data dependencies
Deriving barrier strides given all present input-output op pairings.

The first step is relatively straightforward. We have an op, say Matmul, that can be rewritten into a TileMatmul to handle a tile of data at a time. During the process of rewriting, we use shape-layout algebra (similar to CuTE) inside the e-graph engine (egglog) to derive correct strides for each input and the output tiles. Our approach is flexible on the shape of data we input and output from ops. For instance, some ops benefit from tiles (like matmul) whereas others don’t and operating on contiguous rows at a time is more efficient.

Once we have partitioned ops, we derive the barriers each op should consume from (check equals 0 before running) and produce to (increment and decrement). Let’s make this concrete by going back to our tile matmul example:

In this case, lets say M = 128, N = 128, K = 128, and our tiles are of size 32x32. We’re launching a 2D grid of (128 / 32) x (128 / 32) = 4 x 4 = 16 tile matmul instances to cover C. Our job is to work out the expression that would map the launch index (0-15) to a barrier index for source A. This is done by looking at the producer of A’s launch dimensions. If they are the same size along M we can prove independence along that dimension, since we only consume one tile’s worth of data along M. Therefore along M we initialize 128 / 32 = 4 barriers, and use a stride of 1 to specify that as we launch down that dimension, we want to step our barriers by 1. Along K we are always consuming the whole dimension, so our stride there should be 0. Therefore our final A barrier stride would be m * 1 + n * 0 or flattened along a single launch axis, it would be (x / 4) * 1 + (x % 4) * 0 = x / 4 , which maps our launch index (0-15) to our barrier (0-3) we want to consume from.

The idea behind analyzing each launch dimension is to preserve as much independence as possible. In the worst case, we need every producer SM and every consumer SM to share a single barrier, which would bring us back to the full-sync of traditional kernels. In the best case we have full independence where each next op depends on only one previous op, and can launch immediately when an SM completes.

This all ties together in a struct that looks like this:

struct BlockOp {
  src_a_data: Expression,
  src_b_data: Expression,
  src_a_barrier: Expression,
  src_b_barrier: Expression,
  dest_data: Expression,
  dest_barrier: Expression,
}

Where each expression defines a stride mapping the logical launch index to a physical index. Now each op knows where to get it’s source data, which barriers to look at before running, where to write it’s dest data, and which barrier to increment / decrement.

The next step is to generate the op implementations for all of these ops, from each block-op’s definitions. A standard implementation takes this form:

__device__ void mk_op(
    OpPayload payload, // op-specific payload struct containing metadata
    const float* const source_ptrs[3], // source data pointers resolved by the interpreter
    float* out_ptr, // dest data pointer resolved by the interpreter
    const int current, // the current logical launch index of this op
    int t // the current thread index in this threadblock
) {
    // body
}

This gives us all the information we need to execute a block op. The interpreter resolves the data pointers and barriers, correctly waits on barriers, and passes in data pointers to our implementation function. Ops also can create payload structs and place them in the instruction queue to be passed to the implementation. These structs typically have metadata in them, such as runtime dimensions or pointers to special data stores like external KV caches. By not constraining the metadata ops can access, we can get very creative with op design and access execution patterns not possible in more constrained implementations.

Symbolic Work Queues

One big challenge up front was how to handle rebuilding work queues (instruction queues) in between executions. The process of reallocating and re-scheduling every operation on a queue before each and every execution can be large and become a major bottleneck. Certain queues can be cached for multiple runs, but in general we don’t want to worry about the costly process of re-allocating and rebuilding queues every time something as simple as a sequence length changes.

We don’t want this! Pipelining would help here, but it'd be even better if we didn’t need to rebuild at all.

Luminal’s solution to this is to represent instructions in the work queue, rather than instruction instances, we call this a symbolic work queue. For instance, if we have a MxKxN matmul that is partitioned into (M / 32)x(N / 32) tiled matmul ops, we don’t actually want to have (M / 32)x(N / 32) ops present in the queue. Instead we’ll put one tiled matmul entry in the queue and mark it’s launch dimensions as (M / 32)x(N / 32). Then we’ll initialize a running counter of how many remaining instruction instances we need to launch for the given instruction on the queue before moving the program counter. These will be atomically decremented as each SM pops another instruction instance off the queue.

What this gets us is an ability to symbolically represent how many instances of an instruction we want to fire off. For another example, let’s say we have a tensor of shape Sx128, and a row normalization op that normalizes a row at a time. We want to fire off S ops, which we represent exactly as such. Then at runtime we simply evaluate S with the concrete dynamic dimension values which contain the real sequence length for that execution, and we get the correct number of operations to dispatch. By representing our data pointers and barriers as strides, we can also do the exact same process of expression evaluation to resolve real data pointers and barriers at runtime. We can now change S (and any other dynamic dimension) with zero modification to the underlying work queue (or any other host-side work) at runtime!

All that’s left is to assemble the work queue once at compile time by topologically visiting each partitioned op, scheduling it’s instruction / payload struct, and then at runtime calling a single kernel dispatch and waiting on the results!

Conclusion

We’ve come a long way, so lets recap:

Traditional kernels cause bubbles through kernel launch overhead, wave quantization, and inter-instruction memory bubbles
By fusing an entire model into a single megakernel, we can overcome all three of these challenges
We can generate megakernels through a multi-stage process of rewriting an op to be partitioned over SMs, deriving data and barrier strides, and generating an interpreter by inlining each op’s implementation functions. Then we visit each op in the graph again to build the work queue, and bring the queue and interpreter together to execute!

It’s still early days for megakernels. A lot of abstractions have yet to be built, but we’re excited to realize a cleaner, more performant programming model for GPUs and custom accelerators focused on minimizing unnecessary synchronizations and keeping the hardware resources busy.

We’re releasing our work on megakernels in the Luminal compiler repo, come check it out and contribute. We’re leveraging the bitter lesson to build a truly next generation inference compiler, learning from decades of industry progress in ML, compiler engineering, and HPC. The future demands orders of magnitude more efficient compute. If this kind of state-of-the-art inference engineering excites you, we’re hiring! Shoot me a DM.

A big thanks to Hazy Research for their pioneering work in megakernels.

Announcing our $5.3M Seed Round

Luminal — Tue, 18 Nov 2025 03:39:00 GMT

We’re excited to announce that Luminal has raised a $5.3M seed round to bring speed-of-light inference to everyone. Our round was led by Felicis Ventures, with incredible angels like Paul Graham, Guillermo Rauch, and many more.

The Software Problem

As increasingly powerful models begin to accelerate various parts of the global economy, demand for compute continues to skyrocket. Every week a new article breaks about some multi-billion dollar datacenter buildout or compute partnership. To meet these demands, the semiconductor industry has shifted to an accelerated pace of development, releasing chips capable of higher and higher FLOPs / $ and FLOPs / watt.

Meanwhile, the software that runs on those chips continues to lag far behind, leading to huge swatchs of these chips running dark and unutilized. The best chips in the world are only as good as their software, as seen on Nvidia’s Hopper generation only reaching software maturity a full 2 years after release. The problem is only getting worse: as chip complexity increases, speed-of-light (peak) performance is increasingly out of reach for developers.

A Compiled Cloud

Luminal is building a future where reaching full hardware utilization (and positive unit economics) is as simple as running luminal.deploy(). AI companies should get back to worrying about their customers and product, not niche CUDA instructions and complex inference infrastructure.

We’re building a tightly integrated high-performance compiler and inference cloud to overcome this “software bottleneck”. We believe large-scale kernel search holds the key to enabling speed-of-light performance on a wide variety of accelerators, from GPUs to ASICs. And we believe the best way to deliver this capability is in a tightly integrated, high-performance inference cloud.

An Open Source Future

From the start, Luminal has been an open source project, with incredible community backing and adoption. For us to truely fulfill our mission of speed-of-light inference for all, building the core of our compiler in the open lets us build with the community and lets developers build and run on their own hardware.

Given the sheer complexity involved in solving accelerated computing, no single company can do it alone. If you’re an AI engineer excited about deleting 90% of the complexity in AI, come build with us!

Looking Forward

We’re working with companies running custom models to drive down latency and increase throughput in our deployments. If you want your models running faster and cheaper, sign up here and we’ll reach out.