<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Luminal Blog]]></title><description><![CDATA[AI infrastructure at the speed of light.]]></description><link>https://blog.luminal.com</link><image><url>https://substackcdn.com/image/fetch/$s_!yln-!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184da0fd-68d1-4f07-9165-c5cf0faa01ce_116x116.png</url><title>Luminal Blog</title><link>https://blog.luminal.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 10 Apr 2026 21:48:13 GMT</lastBuildDate><atom:link href="https://blog.luminal.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Luminal]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[luminalai@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[luminalai@substack.com]]></itunes:email><itunes:name><![CDATA[Luminal]]></itunes:name></itunes:owner><itunes:author><![CDATA[Luminal]]></itunes:author><googleplay:owner><![CDATA[luminalai@substack.com]]></googleplay:owner><googleplay:email><![CDATA[luminalai@substack.com]]></googleplay:email><googleplay:author><![CDATA[Luminal]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Producing The Perfect Token]]></title><description><![CDATA[The unspoken inference quality gap and how numerics determine if the inference you're paying for is worth it.]]></description><link>https://blog.luminal.com/p/producing-the-perfect-token</link><guid isPermaLink="false">https://blog.luminal.com/p/producing-the-perfect-token</guid><dc:creator><![CDATA[Luminal]]></dc:creator><pubDate>Mon, 06 Apr 2026 11:49:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qKy5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qKy5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qKy5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png 424w, https://substackcdn.com/image/fetch/$s_!qKy5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png 848w, https://substackcdn.com/image/fetch/$s_!qKy5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png 1272w, https://substackcdn.com/image/fetch/$s_!qKy5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qKy5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png" width="1456" height="699" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:699,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:96711,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.luminal.com/i/191886382?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qKy5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png 424w, https://substackcdn.com/image/fetch/$s_!qKy5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png 848w, https://substackcdn.com/image/fetch/$s_!qKy5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png 1272w, https://substackcdn.com/image/fetch/$s_!qKy5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba221404-e629-4d9d-9a3a-0e3789f5a5d0_1624x780.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Inference is rapidly becoming the primary bottleneck of business, driving many inference clouds to race to provide faster and cheaper tokens.</p><p>This race has resulted in the quality of tokens differing massively depending on inference cloud, model, and serving setup. Benchmarks performed by the <strong>same model</strong> on different clouds can range as much as <strong>20%</strong> due to quality issues. This is the gap between useful and useless tokens, so today we&#8217;ll go over the factors that affect quality, the economic basis of reliability, and how we engineer our compiler and cloud to deliver only the highest quality artisanal tokens on the market.</p><p>When thinking about inference, focus is placed on which model is being served, at what speed, and at what price. After all, if Provider A serves the same model as Provider B, at the same speed and 30% cheaper, why not use Provider A?</p><p>Neural networks are supposed to be deterministic calculations, so anyone who can run those calculations cheapest should get all the business. However <a href="https://eval.16x.engineer/blog/kimi-k2-provider-evaluation-results">recent benchmarks</a> ran on Kimi K2 across various providers tell a different story. Significant divergences appear despite the model and benchmarks being held constant. Worse yet, as <a href="https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/">Thinking Machines documented in their excellent piece on determinism in LLMs</a>, getting the exact same outputs out of an LLM is not possible without a significant performance penalty.</p><p>So why is this, and what can we do about it?</p><p>First we&#8217;ll start off by going over why this matters to inference providers and customers, as well as how it affects real inference workloads today. We&#8217;ll then need to dive into the fundamentals of how computers store numbers and the different choices we have when picking representations. Then we&#8217;ll go over how these tradeoffs affect real inference and which operations are most sensitive to errors. Finally we'll wrap up by going over how compilers reason about this in general, and how Luminal reasons about this specifically, and how we avoid optimizing a valuable token stream into jibberish.</p><p></p><h2>The money is in the bits</h2><p>A modern inference cloud isn&#8217;t too dissimilar to a steel mill, in that it has inputs and outputs, and aims to produce outputs from inputs at a lower cost than the customer is willing to pay for them. They generally see large advantages in economies of scale, amortizing fixed costs across very large volume.</p><p>However like a steel mill, these businesses are constantly under competitive pressure to lower their COGS, giving them either more margin or more pricing power against competitors viewed as selling an identical product. The hyper-competitiveness of the inference game has led to the cost of intelligence decreasing over <a href="https://www.brownstoneresearch.com/bleeding-edge/the-cost-of-intelligence/#:~:text=That%20cost%20has%20declined%20from%20$4%2C500%20per,now%20scoring%2090.5%25%20on%20the%20ARC-AGI-1%20test.">390x in the past 3 years alone</a>. In this market clouds are constantly looking for an edge, a way to produce their output (in this case tokens) ever cheaper.</p><p>So what is the primary bottleneck on token production for these businesses? In two words: <strong>memory bandwidth</strong>. Every time an LLM generates a token, all (or a fraction) of it&#8217;s weights need to be loaded from memory into a compute unit, which represents the vast majority of the time and energy of the operation, relative to actual computation. If there was a way to decrease the required memory bandwidth for an LLM to produce a token, huge gains could be realized in terms of both speed and cost.</p><p>This has set off a race over the past decade to figure out how to shrink models more and more by using less bits per parameter. However, customers have begun realizing the downsides of this trend, seeing large mismatches between reported performance and experienced performance on many inference providers, lead to growing customer skepticism. As we&#8217;ll see, this issue is a lot more complex than it seems on the surface.</p><p></p><h2>How do computers represent numbers?</h2><p>When we think of numbers, we generally think of whole numbers, like 1, 2 or 42, or real decimal numbers like 4.3 or 3.14. But computers are binary machines, representing everything in finite amounts of 1&#8217;s and 0&#8217;s. So if we wanted to represent a decimal number, like the kinds neural networks operate with, in a computer, what are our options?</p><h3>IEEE Floating Point Standard</h3><p>The IEEE 754 standard defines how floating-point numbers are represented and computed in modern hardware. Each number is encoded as three parts: a sign bit, an exponent (which determines dynamic range), and a mantissa (which determines precision).</p><p>These bits are interpreted as:</p><p><strong>value = (&#8722;1)^sign &#215; mantissa &#215; 2^exponent</strong></p><p>This can be generally thought of as <em>sign sets the direction, exponent sets the scale, and mantissa sets the detail within that scale.</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3ksc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3ksc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png 424w, https://substackcdn.com/image/fetch/$s_!3ksc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png 848w, https://substackcdn.com/image/fetch/$s_!3ksc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png 1272w, https://substackcdn.com/image/fetch/$s_!3ksc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3ksc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png" width="544" height="214.2066276803119" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:404,&quot;width&quot;:1026,&quot;resizeWidth&quot;:544,&quot;bytes&quot;:208403,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.luminal.com/i/191886382?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3ksc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png 424w, https://substackcdn.com/image/fetch/$s_!3ksc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png 848w, https://substackcdn.com/image/fetch/$s_!3ksc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png 1272w, https://substackcdn.com/image/fetch/$s_!3ksc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f07d84-deb1-4c9d-9c66-2ebf0a49123c_1026x404.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The FP32 format</figcaption></figure></div><p>The most popular of these datatypes are <strong>FP64,</strong> <strong>FP32</strong> and <strong>FP16</strong>, using 64, 32 and 16 bits respectively.</p><h3>Modern narrow datatypes</h3><p>More recently there&#8217;s been a push to invent even more narrow-precision datatypes: BF16 from Google Brain, and more recently FP8 (E3M4, E4M3) and various 4-bit variants (MXFP4 and NVFP4).</p><p>The majority of the performance gains shown in more recent generation GPUs stem directly from using lower-precision datatypes, as seen in Nvidia&#8217;s gen-to-gen performance chart:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hE6n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hE6n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png 424w, https://substackcdn.com/image/fetch/$s_!hE6n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png 848w, https://substackcdn.com/image/fetch/$s_!hE6n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png 1272w, https://substackcdn.com/image/fetch/$s_!hE6n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hE6n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png" width="1456" height="825" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:825,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:506702,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.luminal.com/i/191886382?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hE6n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png 424w, https://substackcdn.com/image/fetch/$s_!hE6n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png 848w, https://substackcdn.com/image/fetch/$s_!hE6n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png 1272w, https://substackcdn.com/image/fetch/$s_!hE6n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e12d998-dec3-4d08-9dee-db04e0da341e_1464x830.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It is advantageous to use datatypes that require less bits because they take pressure off memory bandwidth, fit into caches better, and require fewer transistors to implement mathematical operations in hardware. However as we&#8217;ll see there are correctness tradeoffs associated with lower-precision datatypes. </p><p>We&#8217;ll talk more about TF32 later, as it&#8217;s an unusual case.</p><h3>The precision spectrum</h3><p>We now see a spectrum of datatypes:</p><ul><li><p><strong>FP32 </strong>is the gold standard for precision and correctness (FP64 isn&#8217;t widely used in AI). Generally low-performance, high-correctness.</p></li><li><p><strong>FP16 / BF16</strong> being a generally safe / mature format to use depending on if range or precision are targeted.</p></li><li><p><strong>FP8</strong> for speed-of-light performance on Hopper-generation (2022 onwards) accelerators with some (manageable) accuracy tradeoffs and no block-scaling complexity.</p></li><li><p><strong>MXFP4 / NVFP4</strong> for state-of-the-art performance on Blackwell-generation (2025 onwards) accelerators utilizing very low-bit weights for maximum bandwidth efficiency and scaling factors for preserving accuracy. </p></li><li><p><strong>INT8</strong> is less commonly used in datacenter accelerators but common on edge devices owning to the simplicity of integer arithmetic hardware.</p></li></ul><p></p><h2>Sources of error</h2><p>The tradeoff of a low-bit datatype is less representational power since fewer bits means fewer states. Fewer exponent bits shrink dynamic range resulting in more overflows and underflows, while fewer mantissa bits increases rounding errors.</p><p>Two additional behaviors also incur a mismatch between represented and real numbers: <em>subnormals</em> and <em>flush-to-zero</em> behavior. In IEEE 754, numbers very close to zero are represented using subnormals. Instead of the usual &#8220;1.xxx &#215; 2^e&#8221; form, they drop the implicit leading 1 and use &#8220;0.xxx &#215; 2^emin&#8221;. This allows gradual underflow where values don&#8217;t jump straight from the smallest normal number to zero, instead tapering off smoothly. Flush-to-zero is a performance optimization where the system treats <strong>all subnormal values as exactly zero</strong>, which eliminates the hardware required to correctly handle subnormal values.</p><p>These tradeoffs are tricky to track since they depend greatly on not only the datatype and hardware in question, but the exact operation as well, with some operations being much more sensitive to low-bit mismatches than others.</p><p><strong>Accumulations</strong> are the most common source of errors, putting pressure on numeric precision for long accumulation sequences. While multiplications are generally fine to do in fairly low precision due to errors being finite and bounded, as the length of the accumulation grows, errors build up.</p><p>This is a big problem for matrix multiplies, which famously do long accumulation chains as part of their dot-product operation:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ysuw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ysuw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png 424w, https://substackcdn.com/image/fetch/$s_!ysuw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png 848w, https://substackcdn.com/image/fetch/$s_!ysuw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png 1272w, https://substackcdn.com/image/fetch/$s_!ysuw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ysuw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png" width="1178" height="484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:484,&quot;width&quot;:1178,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;6 Step Optimization of GeMMs in CUDA&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="6 Step Optimization of GeMMs in CUDA" title="6 Step Optimization of GeMMs in CUDA" srcset="https://substackcdn.com/image/fetch/$s_!ysuw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png 424w, https://substackcdn.com/image/fetch/$s_!ysuw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png 848w, https://substackcdn.com/image/fetch/$s_!ysuw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png 1272w, https://substackcdn.com/image/fetch/$s_!ysuw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55c8794-3c5f-45e4-8535-51018cd241e4_1178x484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the diagram above, we are taking a dot product of the elements in the shaded areas of the A and B matrices to get the single element shaded area of the C matrix. When the K dimension is large we need to do a long accumulation chain to get to the final result. LLMs have been increasing the K dimension for years, some now as large as 14848 in the case of Falcon 180B. For this reason most accelerators implement accumulators in a higher precision than the multiply units, often times as high as FP32.</p><p><strong>Softmax</strong> is another common operation for errors to occur due to the exponentiation involved on every element, which increases the magnitude of each element and leads to common overflows, especially on low-exponent datatypes. Techniques like <em>stable softmax</em> involve subtracting the maximum element from all elements first before the standard softmax is applied for this reason.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;softmax(x_i) = exp(x_i - max(x)) / sum(exp(x - max(x))))&quot;,&quot;id&quot;:&quot;LNXZGCIQZP&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p><strong>Normalization</strong>, such as LayerNorms found inside most LLMs put high pressure on the precision (mantissa bits) when computing variance. On datatypes with low mantissa bits, rounding errors commonly occur. For this reason variance is often computed in FP32.</p><p><strong>Outliers</strong> are a very common phenomena where a few elements in the activations dominate the scaling of an operation in a transformer and ruin the effective resolution in INT8 precision. Clipping activations can help eliminate outliers, however clipping fundamentally destroys information, so it also contributes to quality loss.</p><p></p><h3>A note on determinism</h3><p>AI models are generally made up entirely of linear algebra operations, and since linear algebra is generally thought of as deterministic, it stands to reason we can always get deterministic, reproducible outputs out of our AI models. Unfortunately as Thinking Machines has documented excellently <a href="https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/">here</a>, that isn&#8217;t usually the case. Their post is very detailed and I would highly recommend reading it, but for our purposes I&#8217;ll summarize a key cause of nondeterminism as this inequality when dealing with finite-precision floating points:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(a+b)+c \\neq a+(b+c)&quot;,&quot;id&quot;:&quot;WMVLHVFVWS&quot;}" data-component-name="LatexBlockToDOM"></div><p>In modern floating-point hardware, such as GPUs, there are no guarantees made about accumulation ordering, meaning if the above holds true then we cannot be bit-wise certain about our outputs.</p><p></p><h3>Hidden Precision</h3><p>One major change to Nvidia GPUs over the past few generations (since Ampere in 2020) was the introduction of TensorFloat32 (TF32) precision. Despite the name, it actually uses 19 bits, specitically arranged as 8 exponent bits and 10 mantissa bits. You&#8217;ll notice this is essentially a mix of FP16&#8217;s mantissa (10 bits) and BF16&#8217;s exponent (8 bits), which gives it the same precision as FP16 and same range as BF16.</p><p>Even more confusingly, users never actually &#8220;touch&#8221; this datatype, meaning it isn&#8217;t meant to be directly handled in user code at all. You&#8217;ll never see a buffer of TF32 values or need to compute 19 * n_elements to determine a buffer size. Instead, it entirely exists within the TensorCore&#8217;s systolic array (matrix multiply unit) and enables much higher performance than native FP32 mode, albeit at the cost of less numerical precision. It is enabled or disabled in cublas with the arguments <code>CUBLAS_TF32_TENSOR_OP_MATH</code> or <code>CUBLAS_DEFAULT_MATH</code> respectively.</p><p></p><h2><strong>Quantization methods</strong></h2><p>Using fewer bits is only half the story, <em>how </em>you map values into those bits matters just as much. Quantization takes a high-precision value and maps it into a smaller set of discrete levels, typically via a scale:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;4d49e840-9b2f-48cf-88ea-8e0cc2403dcd&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">q = round(x / scale)
x&#770; = q * scale</code></pre></div><p>Choosing that scale is where most of the tradeoffs live.</p><h3><strong>Per-tensor vs per-channel</strong></h3><p><strong>Per-tensor</strong> uses one scale for an entire tensor. This is simple, but inaccurate if values vary widely.</p><p><strong>Per-channel / per-block</strong> assigns a scale per row/column. This adapts much better to real distributions and is widely used despite slightly higher overhead.</p><p>Certain newer datatypes like nvfp4 mix these techniques, using a higher precision per-tensor scale and a lower precision per-block scale.</p><h3><strong>Static vs dynamic</strong></h3><p><strong>Static quantization</strong> precomputes scales (common for weights).</p><p><strong>Dynamic quantization</strong> computes them at runtime (common for activations).</p><h3><strong>Block-wise quantization</strong></h3><p>At very low bitwidths (FP8, 4-bit), scales are often shared across small blocks (e.g. 32 values). This improves accuracy but adds complexity and requires kernels to load both values and scales.</p><h2>&#8220;Automatic&#8221; mixed precision</h2><p>Since we&#8217;ve seen how certain operations are more sensitive to low-precision datatypes than others, couldn&#8217;t we just mark the operations that are sensitive and switch to high-precision datatypes for just those?</p><p>Yes! That&#8217;s exactly how PyTorch&#8217;s Automatic Mixed Precision works. It relies on a table that marks operations which require higher precision and does upcasts before and downcasts after. This helps to alleviate the primary issues of precision loss, though it&#8217;s a fairly brittle approach. As opsets are large, this is a large manual effort to make sure all operations are correctly marked, and when new operations are added, they can just as easily slip through this manual marking process and execute in a precision lower than is required for stable results.</p><h3></h3><h2>The compiler&#8217;s role in protecting accuracy</h2><p>This manual op-marking approach starts to break down as we move towards a world of ML compilers, where the program written by the model developer is ingested into a compiler and undergoes aggressive optimizing transformations to reduce memory pressure or fit hardware units better. Operations that were previously insensitive to precision errors could be transformed into sensitive operations, or vice-versa. Compounding this, most compilers now operate at a lower level than whole tensor operations, meaning loops and blockwise or elementwise operations are tracked explicitly. For example, this means accumulations are often done in 2-step fashions, first accumulating inside a block and then accumulating between blocks, sometimes in different precisions. As we&#8217;ve seen before, long accumulation chains are hotspots for numerical error buildup.</p><p>In this world, its the compiler&#8217;s job to maintain a contract with the user: <strong>numerical error must not increase due to optimizations, or it must increase in predictable, user-controllable ways</strong>.</p><p>If this contract is not maintained, outputs end up being lower quality due to numerical errors / instabilities. Debugging this is a very involved process, often requiring a deep dive into the compiler&#8217;s internals and chosen transformations. Without good tooling, this is quite a challenging issue to debug!</p><h2>Modelling precision in Luminal</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nwx0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nwx0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png 424w, https://substackcdn.com/image/fetch/$s_!nwx0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png 848w, https://substackcdn.com/image/fetch/$s_!nwx0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png 1272w, https://substackcdn.com/image/fetch/$s_!nwx0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nwx0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png" width="308" height="279.4423076923077" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1321,&quot;width&quot;:1456,&quot;resizeWidth&quot;:308,&quot;bytes&quot;:5608280,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.luminal.com/i/191886382?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nwx0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png 424w, https://substackcdn.com/image/fetch/$s_!nwx0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png 848w, https://substackcdn.com/image/fetch/$s_!nwx0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png 1272w, https://substackcdn.com/image/fetch/$s_!nwx0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf2e4ec4-6c6f-4ce5-9cb5-99f745aab943_2152x1952.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An e-graph datastructure, like the one used in Luminal</figcaption></figure></div><p>Keeping this contract is vital to correct results, and given Luminal is a compiler focused on formal large scale search, we needed to be able to verifiably hold numeric guarantees even as the compiler traverses a semantically rich search space at various levels of operators.</p><p>A straightforward approach is to measure absolute tolerance (atol) error and relative tolerance (rtol) error end-to-end through an entire model. This is a standard technique used by many libraries, and works fairly well given sufficiently noisy inputs. However this approach has some drawbacks, most notably to do with runtime. We may have millions of compute graphs that would lead to unacceptable numerical losses, but this approach would require we ran each one fully to rule them out, a process that could take far too long at compile time.</p><p>An approach Luminal commonly uses is to do <em>operator-level</em> or <em>subgraph-level</em> precision tracking, essentially measuring operator or subgraph numerical errors and reasoning about how they compound in a whole model. One way to think about this is that if subgraph A has been measured (through atol and rtol) to produce unacceptable numerical loss, we can safely rule out all compute graphs that contain A, knowing that the remainder of the graph cannot have "<em>less</em>&#8221; overall error (this doesn&#8217;t hold true in a handful of edge cases, however this post is long enough!).</p><p>Static analysis represents another approach to quantifying errors in linear algebra expressions. There are several forms, but generally these take the form of:</p><ul><li><p>Start with bounds on input variables <code>[a, b]</code></p></li><li><p>Derive correlations through the expression and across operators</p></li><li><p>Estimate overall error bounds</p></li><li><p>Use a rewriting system to minimize error given some constraints</p></li></ul><p>A common drawback to interval-based tracking is the explosion of error bounds. In order to guarantee outputs will fall within an interval, solvers will generally estimate worst-case on each operation, which builds up over the course of a full expression, leading to the end error interval overestimating the true error bounds.</p><p>Solvers like <a href="https://malyzajko.github.io/papers/tacas18_daisy_toolpaper.pdf">Daisy</a> use bit-level representations to analyze bit-level transformations and symbolically model errors rigorously. The upside of this level of rigor typically is tighter end error bounds, without as much overestimation as general interval-based tracking. However due to bit-level tracking, these solvers can become quite expensive on large expressions (which LLMs certainly qualify as).</p><p>Generally static / analytical solvers are required to over-assume a worst-case error, and mixing in empirical error measurements with representative inputs often helps keep the search grounded to real-world data.</p><p></p><h2>Wrapping up</h2><p>Numerics determine whether your tokens can be trusted or not. Despite labs sinking billions into training better and better models, relatively little attention is paid to making sure the fidelity of those models are preserved after the benchmarking runs are over and they go into service.</p><p>As we&#8217;ve seen, modern computers can only represent floating point numbers with a finite number of bits, and so numerical error is unavoidable. Quantifying and controlling that error is vital. As Luminal is a compiler, it is our job to ensure no optimizations or rewriting destroys numerical accuracy, lest our outputs be not only fast but also incorrect.</p><p>General-purpose rewriting solutions, like those used in Luminal, allow us to traverse this space smoothly and reason about not only performance but also numerics in a joint space.</p><p>I&#8217;m excited about the possibilities of controllable, low-precision, low-error accelerated inference. Luminal exists in a unique space where we can co-design with our hardware partners and model partners to continue driving the cost of intelligence down, so when the &#8220;country of geniuses in a datacenter&#8221; arrives, we can all afford to use it.</p><p><strong>If this excites you, we&#8217;re hiring.</strong></p>]]></content:encoded></item><item><title><![CDATA[Compiling Models to Megakernels]]></title><description><![CDATA[Fine-grained synchronization, deep pipelines, and zero kernel launch overheads, automatically.]]></description><link>https://blog.luminal.com/p/compiling-models-to-megakernels</link><guid isPermaLink="false">https://blog.luminal.com/p/compiling-models-to-megakernels</guid><dc:creator><![CDATA[Luminal]]></dc:creator><pubDate>Fri, 09 Jan 2026 23:14:52 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c3fe3c3f-74ad-4862-a2d2-edb8b3705ecb_3840x2160.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lrfi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lrfi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp 424w, https://substackcdn.com/image/fetch/$s_!lrfi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp 848w, https://substackcdn.com/image/fetch/$s_!lrfi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp 1272w, https://substackcdn.com/image/fetch/$s_!lrfi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lrfi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp" width="1272" height="665" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:665,&quot;width&quot;:1272,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26030,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.luminal.com/i/183811143?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lrfi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp 424w, https://substackcdn.com/image/fetch/$s_!lrfi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp 848w, https://substackcdn.com/image/fetch/$s_!lrfi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp 1272w, https://substackcdn.com/image/fetch/$s_!lrfi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff469c82a-fcfa-431c-b5b1-c3f6f37673e6_1272x665.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Luminal is an inference compiler, and as such we&#8217;re interested in driving inference right up to the physical limits of the hardware. Inference has two fundamental limitations: compute (flops) and bandwidth (TB/s). Increasing these two requires buying much more expensive hardware, so we want to make sure we&#8217;re using all the compute and bandwidth we have available to us! This basically boils down to: anytime the GPU is not loading data, we&#8217;re wasting bandwidth, and anytime the GPU is not computing, we&#8217;re wasting compute.</p><p></p><h3>Bottlenecks</h3><p>Let&#8217;s look at a typical timeline of executing a transformer layer:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YR9X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YR9X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png 424w, https://substackcdn.com/image/fetch/$s_!YR9X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png 848w, https://substackcdn.com/image/fetch/$s_!YR9X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png 1272w, https://substackcdn.com/image/fetch/$s_!YR9X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YR9X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png" width="1456" height="217" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:217,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89839,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://luminalai.substack.com/i/183811143?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YR9X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png 424w, https://substackcdn.com/image/fetch/$s_!YR9X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png 848w, https://substackcdn.com/image/fetch/$s_!YR9X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png 1272w, https://substackcdn.com/image/fetch/$s_!YR9X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd34e62b-a711-42d9-8d60-6dfdd4de3aef_2028x302.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption">A simplified view of a typical transformer forward pass</figcaption></figure></div><p>We see two problems immediately:</p><ol><li><p>Every time we finish a kernel and start another one, the GPU sits idle while the CPU launches the next kernel.</p></li><li><p>Some streaming-multiprocessor cores (SMs) in the GPU finish their work early and sit idle while other SMs finish the remaining work.</p></li></ol><p>Kernel launch overhead is well-known and can be partially mitigated with techniques like <a href="https://developer.nvidia.com/blog/cuda-graphs/">CUDA Graphs</a> on Nvidia GPUs. This isn&#8217;t perfect, though, as <a href="https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles">Hazy Research demonstrated in their original megakernel post</a>. With a dummy kernel that does no work, and ordinarily takes 2.1 micros, when CUDA graphs are enabled, it still takes 1.3 micros!</p><p>The next issue is also a well-known phenomenon called Wave Quantization, which occurs when a kernel&#8217;s work cannot be evenly distributed across all SMs, leaving some SMs to finish early and stall while others lag behind to finish the kernel. Depending on the total runtime of the kernels and the shape of the work, these gaps can become <strong>very</strong> significant!</p><p>Due to the nature of the tensor computations we&#8217;re interested in, we don&#8217;t actually have to wait for a full synchronization to begin the next op. Take a tiled matmul for example:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yrRt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yrRt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 424w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 848w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 1272w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yrRt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png" width="1178" height="484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:484,&quot;width&quot;:1178,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;An Engineer's Guide to GEMM | Pete Warden's blog&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="An Engineer's Guide to GEMM | Pete Warden's blog" title="An Engineer's Guide to GEMM | Pete Warden's blog" srcset="https://substackcdn.com/image/fetch/$s_!yrRt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 424w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 848w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 1272w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Data access patterns of a tiled matmul</figcaption></figure></div><p>This operation does not need to wait for all of tensor A or all of tensor B to begin computing, since it only consumes a stripe of tiles from both A and B. So long as that stripe is ready, we can start computing a tile of C! This full synchronization is entirely enforced by the standard kernel execution model, not required by the mathematics.</p><p>There&#8217;s actually a hidden third bottleneck preventing us from fully utilizing our hardware&#8217;s bandwidth and compute effectively: each kernel does no compute until it loads enough weights to start working. Generally this means even if the kernel can do perfect load-compute overlapping during it&#8217;s main loop execution, it cannot get around the idle time waiting for the initial weights to load. We&#8217;d need a finer-grained timeline showing loading and compute to see that effect:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B5Co!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B5Co!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png 424w, https://substackcdn.com/image/fetch/$s_!B5Co!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png 848w, https://substackcdn.com/image/fetch/$s_!B5Co!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png 1272w, https://substackcdn.com/image/fetch/$s_!B5Co!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B5Co!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png" width="1456" height="153" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:153,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64830,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://luminalai.substack.com/i/183811143?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B5Co!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png 424w, https://substackcdn.com/image/fetch/$s_!B5Co!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png 848w, https://substackcdn.com/image/fetch/$s_!B5Co!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png 1272w, https://substackcdn.com/image/fetch/$s_!B5Co!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a7b4eb2-fb51-4089-a527-962ef36de4b3_2032x214.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">A simplified view of a single SM during execution</figcaption></figure></div><p>Now we can see there&#8217;s a large amount of time spent loading the initial weights before we can even begin to compute. The whole time our expensive tensor cores are sitting idle! Even if our kernels were programmed by experts and perfectly utilized bandwidth during their execution, this is outside their control. Techniques like <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/#programmatic-dependent-launch-and-synchronization">Programmatic Dependent Launch</a> help mitigate this by letting the next kernel start setting up (loading weights) while the current kernel is running, however this is done on the device level, not the per-SM level, so we&#8217;re still left with significant bubbles.</p><p></p><h3>One kernel per model</h3><p>What if instead we could fuse every operation in a forward pass into a single kernel? This would give us a few advantages:</p><ol><li><p>We&#8217;d eliminate kernel launch latency right off the bat, since we only launch one kernel for the entire forward pass.</p></li><li><p>We&#8217;d also be able to immediately start running work from the next operation on SMs that have early-finished work on the current operations, eliminating our wave quantization effects.</p></li><li><p>We&#8217;d be able to start loading weights for the SM&#8217;s next operation <strong>during the epilogue of the current operation</strong>, thereby eliminating the above gap between compute spans.</p></li></ol><p>This technique was pioneered by <a href="https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles">Hazy Research last year</a>, where they fused Llama 1B into a single megakernel. However, a significant limitation of their approach was requiring these megakernels to be built by hand, manually defining each instruction and scheduling it to SMs. Since we&#8217;re compiling models from source, we want this process to be automatic and robust to arbitrary architectures.</p><p>Let&#8217;s walk through how megakernels work, and then we&#8217;ll dive in to how Luminal automatically generates them for arbitrarily complex models.</p><p></p><h3>An interpreter on a GPU</h3><p>Megakernels stem from the concept of an interpreter. Most programmers will be familiar with how interpreted languages like Python work, where an interpreter reads, decodes, and executes instructions one-by-one. We can view a GPU as a large multi-core processor, where each core is capable of executing a very limited instruction set. We can either provide the cores their instructions directly in shared memory on a per-core basis, or in global memory in a global instruction stream. In other words, we need to decide if we want to statically schedule instructions to independent streams per-SM or on a single stream SMs all share.</p><p>A quick word about each path:</p><ul><li><p><strong>Static scheduling</strong> benefits from being able to prefetch and load many instructions at a time, directly into shared memory. The overhead for fetching a new instruction is very low since it&#8217;s already fetched at execution time and resides in fast memory. A downside of this approach is it requires the programmer or compiler to statically partition instructions across SMs ahead of time, which is challenging especially since instructions can be variable-latency. Furthermore, jitter is often present in SMs, causing some to run slower than others for unpredictable hardware reasons.</p></li><li><p><strong>Dynamic (global) scheduling</strong> incurs more significant overhead by requiring a roundtrip to global memory and an atomic lock to fetch each instruction. These can be hidden though, during the execution of the previous instruction, so long as the previous instruction takes enough time to hide the fetch latency. Global scheduling also does not require the programmer or compiler to partition instructions to SMs ahead of time, instead allowing SMs to opportunistically pop instructions off the queue once they are ready. This naturally corrects for jitter, because faster SMs will pick up the slack while slower ones lag.</p></li></ul><p>We felt the tradeoffs introduced with dynamic scheduling were worth it. Our megakernels provide a single global instruction queue shared by all SMs, which both simplifies the compiler&#8217;s work as well as allows for variable-latency instructions.</p><p>Since instructions communicate through global memory, we still want to do the same fusion patterns as in traditional kernels. This means our instructions end up being fairly coarse grained, handling computations like Matmul + ResidualAdd or RMSNorm + Matmul + RoPE to minimize global memory roundtrips.</p><p></p><p>Here&#8217;s a view of how our SMs work through instructions:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w6rx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w6rx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg 424w, https://substackcdn.com/image/fetch/$s_!w6rx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg 848w, https://substackcdn.com/image/fetch/$s_!w6rx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!w6rx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w6rx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg" width="1456" height="751" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:751,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:200000,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.luminal.com/i/183811143?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w6rx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg 424w, https://substackcdn.com/image/fetch/$s_!w6rx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg 848w, https://substackcdn.com/image/fetch/$s_!w6rx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!w6rx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91112858-cba1-4a8f-8b39-d0ffa6218953_1776x916.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><strong>A profile of a megakernel executing across many SMs</strong></figcaption></figure></div><p>Notice how there&#8217;s overlap between when the current instruction ends and the next instruction begins running. We also even see SMs running multiple instances of instructions in the same timespan single instructions run on other SMs, showing that instruction latency is quite variable!</p><p>There&#8217;s one big problem left we haven&#8217;t discussed: synchronization. As we discussed before, normal kernels have a major downside in that future work cannot be ran until <em>all</em> SMs finish on the current kernel. However, the corollary to that is we are guaranteed all data is ready by the start of the next kernel. Once we start running future ops before past ops are entirely done, this guarantee goes away, requiring us to be very fine-grained in how we synchronize and assert the input data to the next op is in fact ready. The mechanism we use for doing this is standard barrier counters. However, unlike Hazy&#8217;s barriers, we use an increment-then-decrement barrier approach, where ops first increment their assigned barrier at launch, run, and then decrement their barrier once they are completed. We can then view each barrier as a sort of &#8220;inflight producer&#8221; counter. This mechanism means we don&#8217;t need the consumer to know how many producers to wait for on a given piece of data, it simply needs to wait for the number of inflight producers to equal zero.</p><p></p><h3>Generating Megakernels</h3><p>Luminal is a graph-based compiler, and as such it represents models as compute graphs. The challenge we undertake is transforming a compute graph into an instruction queue, with fine-grained data dependencies wired up correctly. Our approach takes 2 passes:</p><ul><li><p>Rewriting existing ops into block ops, partitioned over SMs, with strided input and output data dependencies</p></li><li><p>Deriving barrier strides given all present input-output op pairings.</p></li></ul><p>The first step is relatively straightforward. We have an op, say Matmul, that can be rewritten into a TileMatmul to handle a tile of data at a time. During the process of rewriting, we use shape-layout algebra (similar to CuTE) inside the e-graph engine (egglog) to derive correct strides for each input and the output tiles. Our approach is flexible on the shape of data we input and output from ops. For instance, some ops benefit from tiles (like matmul) whereas others don&#8217;t and operating on contiguous rows at a time is more efficient.</p><p>Once we have partitioned ops, we derive the barriers each op should consume from (check equals 0 before running) and produce to (increment and decrement). Let&#8217;s make this concrete by going back to our tile matmul example:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yrRt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yrRt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 424w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 848w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 1272w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yrRt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png" width="1178" height="484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:484,&quot;width&quot;:1178,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;An Engineer's Guide to GEMM | Pete Warden's blog&quot;,&quot;title&quot;:&quot;An Engineer's Guide to GEMM | Pete Warden's blog&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="An Engineer's Guide to GEMM | Pete Warden's blog" title="An Engineer's Guide to GEMM | Pete Warden's blog" srcset="https://substackcdn.com/image/fetch/$s_!yrRt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 424w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 848w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 1272w, https://substackcdn.com/image/fetch/$s_!yrRt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa4d6943-3e6e-47c0-915a-45e9731fcd69_1178x484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this case, lets say M = 128, N = 128, K = 128, and our tiles are of size 32x32. We&#8217;re launching a 2D grid of (128 / 32) x (128 / 32)  = 4 x 4 = 16 tile matmul instances to cover C. Our job is to work out the expression that would map the launch index (0-15) to a barrier index for source A. This is done by looking at the producer of A&#8217;s launch dimensions. If they are the same size along M we can prove independence along that dimension, since we only consume one tile&#8217;s worth of data along M. Therefore along M we initialize 128 / 32 = 4 barriers, and use a stride of 1 to specify that as we launch down that dimension, we want to step our barriers by 1. Along K we are always consuming the whole dimension, so our stride there should be 0. Therefore our final A barrier stride would be <code>m * 1 + n * 0</code> or flattened along a single launch axis, it would be <code>(x / 4) * 1 + (x % 4) * 0 = x / 4</code> , which maps our launch index (0-15) to our barrier (0-3) we want to consume from.</p><p>The idea behind analyzing each launch dimension is to preserve as much independence as possible. In the worst case, we need every producer SM and every consumer SM to share a single barrier, which would bring us back to the full-sync of traditional kernels. In the best case we have full independence where each next op depends on only one previous op, and can launch immediately when an SM completes.</p><p>This all ties together in a struct that looks like this:</p><pre><code>struct BlockOp {
  src_a_data: Expression,
  src_b_data: Expression,
  src_a_barrier: Expression,
  src_b_barrier: Expression,
  dest_data: Expression,
  dest_barrier: Expression,
}</code></pre><p>Where each expression defines a stride mapping the logical launch index to a physical index. Now each op knows where to get it&#8217;s source data, which barriers to look at before running, where to write it&#8217;s dest data, and which barrier to increment / decrement.</p><p>The next step is to generate the op implementations for all of these ops, from each block-op&#8217;s definitions. A standard implementation takes this form:</p><pre><code>__device__ void mk_op(
    OpPayload payload, // op-specific payload struct containing metadata
    const float* const source_ptrs[3], // source data pointers resolved by the interpreter
    float* out_ptr, // dest data pointer resolved by the interpreter
    const int current, // the current logical launch index of this op
    int t // the current thread index in this threadblock
) {
    // body
}</code></pre><p>This gives us all the information we need to execute a block op. The interpreter resolves the data pointers and barriers, correctly waits on barriers, and passes in data pointers to our implementation function. Ops also can create payload structs and place them in the instruction queue to be passed to the implementation. These structs typically have metadata in them, such as runtime dimensions or pointers to special data stores like external KV caches. By not constraining the metadata ops can access, we can get very creative with op design and access execution patterns not possible in more constrained implementations.</p><p></p><h3>Symbolic Work Queues</h3><p>One big challenge up front was how to handle rebuilding work queues (instruction queues) in between executions. The process of reallocating and re-scheduling every operation on a queue before each and every execution can be large and become a major bottleneck. Certain queues can be cached for multiple runs, but in general we don&#8217;t want to worry about the costly process of re-allocating and rebuilding queues every time something as simple as a sequence length changes.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CnTW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CnTW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png 424w, https://substackcdn.com/image/fetch/$s_!CnTW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png 848w, https://substackcdn.com/image/fetch/$s_!CnTW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png 1272w, https://substackcdn.com/image/fetch/$s_!CnTW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CnTW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png" width="1456" height="118" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:118,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40629,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.luminal.com/i/183811143?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CnTW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png 424w, https://substackcdn.com/image/fetch/$s_!CnTW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png 848w, https://substackcdn.com/image/fetch/$s_!CnTW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png 1272w, https://substackcdn.com/image/fetch/$s_!CnTW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff70ce0a7-2965-4e30-9022-580e4bcff1d7_2040x166.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">We don&#8217;t want this! Pipelining would help here, but it'd be even better if we didn&#8217;t need to rebuild at all.</figcaption></figure></div><p>Luminal&#8217;s solution to this is to represent <strong>instructions</strong> in the work queue, rather than <strong>instruction instances</strong>, we call this a <em>symbolic work queue.</em> For instance, if we have a MxKxN matmul that is partitioned into (M / 32)x(N / 32) tiled matmul ops, we don&#8217;t actually want to have (M / 32)x(N / 32) ops present in the queue. Instead we&#8217;ll put one tiled matmul entry in the queue and mark it&#8217;s launch dimensions as (M / 32)x(N / 32). Then we&#8217;ll initialize a running counter of how many remaining instruction instances we need to launch for the given instruction on the queue before moving the program counter. These will be atomically decremented as each SM pops another instruction instance off the queue.</p><p>What this gets us is an ability to symbolically represent how many instances of an instruction we want to fire off. For another example, let&#8217;s say we have a tensor of shape Sx128, and a row normalization op that normalizes a row at a time. We want to fire off S ops, which we represent exactly as such. Then at runtime we simply evaluate S with the concrete dynamic dimension values which contain the real sequence length  for that execution, and we get the correct number of operations to dispatch. By representing our data pointers and barriers as strides, we can also do the exact same process of expression evaluation to resolve real data pointers and barriers at runtime. We can now change S (and any other dynamic dimension) with zero modification to the underlying work queue (or any other host-side work) at runtime!</p><p>All that&#8217;s left is to assemble the work queue once at compile time by topologically visiting each partitioned op, scheduling it&#8217;s instruction / payload struct, and then at runtime calling a single kernel dispatch and waiting on the results!</p><p></p><h3>Conclusion</h3><p>We&#8217;ve come a long way, so lets recap:</p><ul><li><p>Traditional kernels cause bubbles through kernel launch overhead, wave quantization, and inter-instruction memory bubbles</p></li><li><p>By fusing an entire model into a single megakernel, we can overcome all three of these challenges</p></li><li><p>We can generate megakernels through a multi-stage process of rewriting an op to be partitioned over SMs, deriving data and barrier strides, and generating an interpreter by inlining each op&#8217;s implementation functions. Then we visit each op in the graph again to build the work queue, and bring the queue and interpreter together to execute!</p></li></ul><p>It&#8217;s still early days for megakernels. A lot of abstractions have yet to be built, but we&#8217;re excited to realize a cleaner, more performant programming model for GPUs and custom accelerators focused on minimizing unnecessary synchronizations and keeping the hardware resources busy.</p><p>We&#8217;re releasing our work on megakernels in the <a href="https://github.com/luminal-ai/luminal">Luminal compiler repo</a>, come check it out and contribute. We&#8217;re leveraging the bitter lesson to build a truly next generation inference compiler, learning from decades of industry progress in ML, compiler engineering, and HPC. The future demands orders of magnitude more efficient compute. If this kind of state-of-the-art inference engineering excites you, we&#8217;re hiring! <a href="https://x.com/joefioti">Shoot me a DM.</a></p><p></p><p>A big thanks to <a href="https://hazyresearch.stanford.edu/">Hazy Research</a> for their pioneering work in megakernels.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.luminal.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Luminal Blog! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Announcing our $5.3M Seed Round]]></title><description><![CDATA[Luminal has raised a $5.3M seed round to bring speed-of-light inference to everyone.]]></description><link>https://blog.luminal.com/p/announcing-our-53m-seed-round</link><guid isPermaLink="false">https://blog.luminal.com/p/announcing-our-53m-seed-round</guid><dc:creator><![CDATA[Luminal]]></dc:creator><pubDate>Tue, 18 Nov 2025 03:39:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/db07ee24-01b0-4d1d-9008-a27332bbcf66_1244x694.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We&#8217;re excited to announce that Luminal has raised a $5.3M seed round to bring speed-of-light inference to everyone. Our round was led by <a href="https://www.felicis.com/">Felicis Ventures</a>, with incredible angels like Paul Graham, Guillermo Rauch, and many more.</p><h2>The Software Problem</h2><p>As increasingly powerful models begin to accelerate various parts of the global economy, demand for compute continues to skyrocket. Every week a new article breaks about some multi-billion dollar datacenter buildout or compute partnership. To meet these demands, the semiconductor industry has shifted to an accelerated pace of development, releasing chips capable of higher and higher FLOPs / $ and FLOPs / watt.</p><p>Meanwhile, the software that runs on those chips continues to lag far behind, leading to huge swatchs of these chips running dark and unutilized. The best chips in the world are only as good as their software, as seen on Nvidia&#8217;s Hopper generation only reaching software maturity a full 2 years after release. The problem is only getting worse: as chip complexity increases, speed-of-light (peak) performance is increasingly out of reach for developers.</p><h2>A Compiled Cloud</h2><p>Luminal is building a future where reaching full hardware utilization (and positive unit economics) is as simple as running <code>luminal.deploy()</code>. AI companies should get back to worrying about their customers and product, not niche CUDA instructions and complex inference infrastructure.</p><p>We&#8217;re building a tightly integrated high-performance compiler and inference cloud to overcome this &#8220;software bottleneck&#8221;. We believe large-scale kernel search holds the key to enabling speed-of-light performance on a wide variety of accelerators, from GPUs to ASICs. And we believe the best way to deliver this capability is in a tightly integrated, high-performance inference cloud.</p><h2>An Open Source Future</h2><p>From the start, <a href="https://github.com/luminal-ai/luminal">Luminal has been an open source project</a>, with incredible community backing and adoption. For us to truely fulfill our mission of speed-of-light inference for all, building the core of our compiler in the open lets us build with the community and lets developers build and run on their own hardware.</p><p>Given the sheer complexity involved in solving accelerated computing, no single company can do it alone. If you&#8217;re an AI engineer excited about deleting 90% of the complexity in AI, <a href="https://github.com/luminal-ai/luminal">come build with us</a>!</p><h2>Looking Forward</h2><p>We&#8217;re working with companies running custom models to drive down latency and increase throughput in our deployments. If you want your models running faster and cheaper, sign up <a href="https://forms.gle/sfwqY4hWgQpUzGet5">here</a> and we&#8217;ll reach out.</p>]]></content:encoded></item></channel></rss>