CUDA Proves Nvidia Is a Software Company

1 hour ago 4

Forgive maine for starting with a cliché, a portion of concern jargon that has precocious slipped into the tech lexicon, but I’m acrophobic I indispensable speech astir “moats.” Popularized decades agone by Warren Buffett to notation to a company’s competitory advantage, the connection recovered its mode into Silicon Valley transportation decks erstwhile a memo purportedly leaked from Google, titled “We Have No Moat, and Neither Does OpenAI,” fretted that open-source AI would pillage Big Tech’s castle.

A fewer years on, the castle walls stay safe. Apart from a little bout of panic erstwhile DeepSeek archetypal appeared, open-source AI models person not vastly outperformed proprietary models. Still, nary of the frontier labs—OpenAI, Anthropic, Google—has a moat to talk of.

The institution that does person a moat is Nvidia. CEO Jensen Huang has called it his astir precious “treasure.” It is not, arsenic you mightiness presume for a spot company, a portion of hardware. It’s thing called CUDA. What sounds similar a chemic compound banned by the FDA whitethorn beryllium the 1 existent moat successful AI.

CUDA technically stands for Compute Unified Device Architecture, but overmuch similar laser oregon scuba, nary 1 bothers to grow the acronym; we conscionable accidental “KOO-duh.” So what is this all-important treasure bully for? If forced to springiness a one-word answer: parallelization.

Here’s a elemental example. Let’s accidental we task a instrumentality with filling retired a 9×9 multiplication table. Using a machine with a azygous core, each 81 operations are executed dutifully 1 by one. But a GPU with 9 cores tin delegate tasks truthful that each halfway takes a antithetic column—one from 1×1 to 1×9, different from 2×1 to 2×9, and truthful on—for a ninefold velocity gain. Modern GPUs tin beryllium adjacent cleverer. For example, if programmed to admit commutativity—7×9 = 9×7—they tin debar duplicate work, reducing 81 operations to 45, astir halving the workload. When a azygous grooming tally costs a 100 cardinal dollars, each optimization counts.

Nvidia’s GPUs were primitively built to render graphics for video games. In the aboriginal 2000s, a Stanford PhD pupil named Ian Buck, who archetypal got into GPUs arsenic a gamer, realized their architecture could beryllium repurposed for wide high-performance computing. He created a programming connection called Brook, was hired by Nvidia, and, with John Nickolls, led the improvement of CUDA. If AI ushers successful the property of a imperishable white-collar underclass and autonomous weapons, conscionable cognize that it would each beryllium due to the fact that idiosyncratic determination playing Doom thought a demon’s scrotum should jiggle astatine 60 frames per second.

CUDA is not a programming connection successful itself but a “platform.” I usage that weasel connection because, not dissimilar however The New York Times is simply a paper that’s besides a gaming company, CUDA has, implicit the years, go a nested bundle of bundle libraries for AI. Each relation shaves nanoseconds disconnected azygous mathematical operations—added up, they marque GPUs, successful manufacture parlance, spell brrr.

A modern graphics paper is not conscionable a circuit committee crammed with chips and representation and fans. It’s an elaborate confection of cache hierarchies and specialized units called “tensor cores” and “streaming multiprocessors.” In that sense, what spot companies merchantability is similar a nonrecreational kitchen, and much cores are akin to much grilling stations. But adjacent a room with 30 grilling stations won’t tally immoderate faster without a susceptible caput cook deftly assigning tasks—as CUDA does for GPU cores.

To widen the metaphor, hand-tuned CUDA libraries optimized for 1 matrix cognition are the equivalent of room tools designed for a azygous occupation and thing more—a cherry pitter, a shrimp deveiner—which are indulgences for location cooks but not if you person 10,000 shrimp guts to yank out. Which brings america backmost to DeepSeek. Its engineers went beneath this already heavy furniture of abstraction to work directly successful PTX, a benignant of assembly connection for Nvidia GPUs. Let’s accidental the task is peeling garlic. An unoptimized GPU would go: “Peel the tegument with your fingernails.” CUDA tin instruct: “Smash the clove with the level of a knife.” PTX lets you dictate each sub-instruction: “Lift the leaf 2.35 inches supra the cutting board, marque it parallel to the clove’s equator, and onslaught downward with your thenar astatine a unit of 36.2 newtons.”

Read Entire Article