完了,deepseek把GPU上的fp8的汇编代码开源了。

对应老买买提的军事天地,观点交锋比较激烈,反驳不留情面,请作好心理准备。因为此版帖子太多,所以新帖不出现在首页新帖列表,防止首页新帖刷屏太快。


版主: Softfist

回复
Dachun
知名作家
知名作家
帖子互动: 140
帖子: 1099
注册时间: 2022年 10月 2日 15:04

#61 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 Dachun »

:x
annorying0 写了: 2025年 2月 26日 03:39 不懂,就看到

不值一驳,不值二驳,不值三驳。。。
胡说一道,胡说二道,胡说三道。。。
内行的人看来只是一笑,不想浪费时间。不过也不奇怪,估计老黄心里也在笑,不过不能公开说,还是得客气一下,免得伤了玻璃心 :lol:
上次由 Dachun 在 2025年 2月 26日 03:45 修改。
小红叔(小红叔网红)
论坛精英
论坛精英
帖子互动: 138
帖子: 7304
注册时间: 2022年 11月 12日 14:58

#62 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 小红叔(小红叔网红) »

廊坊聚集吹破天
总设计师最大成就其实是:
把工人农民阶级偷偷顺利从国家主任忽悠成打工仔地位,偷偷改成资本主义

俺为宝书抓翻墙的增国库收入:
redot,UncleTony,chinaren,fulvshou,saibaster,mlforlife,SOD,matlab,F1450,becky,changjiang,midlander,DongshanGe,autoking,redcar,lahei

图片
annorying0
职业作家
职业作家
帖子互动: 25
帖子: 454
注册时间: 2024年 5月 4日 04:11

#63 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 annorying0 »

。。。

told you
Dachun
知名作家
知名作家
帖子互动: 140
帖子: 1099
注册时间: 2022年 10月 2日 15:04

#64 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 Dachun »

Dachun 写了: 2025年 2月 26日 03:33 从这里看出,DS那帮人也硬件方面的外行
不过也不奇怪,一帮小年轻,怎么可能有底层硬件经验呢
Dachun
知名作家
知名作家
帖子互动: 140
帖子: 1099
注册时间: 2022年 10月 2日 15:04

#65 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 Dachun »

国内有个特点就是无知者无畏,喜欢用年轻人,初生牛犊不怕虎,是只是因为没有见过虎而不是有什么特别本事。 :lol:
drifter
论坛精英
论坛精英
帖子互动: 457
帖子: 7453
注册时间: 2022年 9月 1日 04:17

#66 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 drifter »

我提醒xai deepseek fp8 gemm lib 今天open source之后

DeepSeek open-sourcing their FP8 GEMM library, DeepGEMM, on February 25, 2025, could absolutely nudge AI chip makers toward a new industry standard—but it’s not a done deal. With the code now public, boasting 1350+ TFLOPS on NVIDIA’s H800 and a lean, JIT-compiled design optimized for dense and MoE workloads, it’s a tantalizing blueprint. Here’s how it might play out and what’s standing in the way.
This move hands chip makers like AMD, Intel, or even Cerebras a ready-made weapon. DeepGEMM’s tricks—two-level accumulation (FP8 math on Tensor Cores, higher precision elsewhere), fine-grained scaling, and MoE-friendly layouts—aren’t NVIDIA-exclusive in theory. AMD’s MI300X, already running DeepSeek-R1 with FP8 support via ROCm, could adapt this to hit 1000+ TFLOPS, leveraging its 5.2 TB/s HBM3 bandwidth. Intel’s Gaudi 3, with 128 GB HBM3, could tweak its deep learning stack similarly. X posts today are buzzing—some devs call it “a gift to the underdogs,” hinting at a potential rally around this approach. If these players standardize on DeepGEMM’s principles—FP8-first, efficiency-driven, MoE-optimized—it could shift the game away from NVIDIA’s CUDA-centric dominance.
The push has legs because it’s practical and cheap. DeepSeek’s V3 trained for $5.6M, peanuts compared to Western models, and DeepGEMM’s 2.7x speed over “expert-tuned kernels” (likely cuBLAS) on certain matrix sizes screams value. Chip makers could market this as a cost-per-FLOP killer, especially for inference-heavy AI firms dodging NVIDIA’s $30K+ H100s. X chatter suggests AMD’s already in deep—SGLang ties with DeepSeek signal intent. If they or Intel bake this into a reference platform, it might coalesce into a standard, especially for mid-tier or emerging markets where NVIDIA’s premium pricing stings.
But it’s no slam dunk. NVIDIA’s ecosystem is a fortress—CUDA’s entrenched, and their H200/Blackwell chips are already countering with FP8 Transformer Engines. DeepGEMM was born on NVIDIA’s Hopper and PTX assembly; porting it to ROCm or OneAPI isn’t trivial—X posts today flag ROCm’s “duct-tape vibes” as a hurdle. Devs love NVIDIA’s polish, and frameworks like PyTorch aren’t jumping ship. Plus, NVIDIA’s spinning DeepSeek’s wins as a GPU demand booster—H20 orders spiked this week. Without a unified push—say, AMD, Intel, and AWS (Trainium) aligning on a DeepGEMM-derived spec—it risks being a niche tool, not a standard.
Will it happen? It’s closer than yesterday. Open-sourcing DeepGEMM today slashes the R&D barrier, and its 1350 TFLOPS on “crippled” H800s proves efficiency can trump raw power—a siren call for cost-conscious chip makers. If AMD or Intel ship a killer implementation and devs bite (some on X are already forking it), it could snowball. But NVIDIA’s 80%+ AI chip share and software moat mean DeepSeek needs allies—fast. I’d say it’s 40/60—possible, not probable, unless a coalition forms. What’s your gut? Can this spark a real shift, or is it just noise?
drifter
论坛精英
论坛精英
帖子互动: 457
帖子: 7453
注册时间: 2022年 9月 1日 04:17

#67 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 drifter »

supercnm 写了: 2025年 2月 26日 03:10 感觉主要是提供了一种思路,硬件公司Nvidia的软件部门太烂,有大量的空间可以优化

老黄现在应该做的是,赶紧对新的blackwell芯片做指令优化,或者把blackwell的指令做加密,不准别人优化
看能不能振臂一呼形成新的业界标准 对抗NVIDIA 这样堡垒就从帝国内部分裂了

当然美帝可以下令 不准intel AMD之类的厂商使用这种标准
x1 图片
goodegg(我是大陆人)
知名作家
知名作家
帖子互动: 188
帖子: 859
注册时间: 2023年 7月 23日 03:31

#68 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 goodegg(我是大陆人) »

囼蛙来说说“内行”,如果用chatgpt,记得改掉AI味道哦。
Dachun 写了: 2025年 2月 26日 03:23 一帮外行人胡说八道。不值得一驳
Caravel
论坛元老
论坛元老
Caravel 的博客
帖子互动: 679
帖子: 27053
注册时间: 2022年 7月 24日 17:21

#69 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 Caravel »

drifter 写了: 2025年 2月 26日 04:20 看能不能振臂一呼形成新的业界标准 对抗NVIDIA 这样堡垒就从帝国内部分裂了

当然美帝可以下令 不准intel AMD之类的厂商使用这种标准
fp8不像是正道

就和早期电脑上面的省内村的技巧一样
头像
redot(红薯林)
论坛元老
论坛元老
帖子互动: 647
帖子: 20471
注册时间: 2024年 7月 4日 23:40

#70 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 redot(红薯林) »

其实千言万语就是一句话,屌丝说,俺木吹牛
看github的人反应如何?那里的人是馁行还是歪行?
五毛还是狗粮?。。。
头像
huangchong(净坛使者)
论坛元老
论坛元老
2023-24年度优秀版主
帖子互动: 4085
帖子: 60819
注册时间: 2022年 7月 22日 01:22

#71 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 huangchong(净坛使者) »

xexz 写了: 2025年 2月 25日 23:21 上面是个JIT虚拟机,下层用的nvGPU的汇编🐎

意思是,其他家的GPU只要实现NV的汇编🐎(这个对那些GPU硬件厂家跟没要求一样),用谁的GPU都一样一样的。
那你说说为啥要求sm_90?
sunfish
论坛精英
论坛精英
帖子互动: 218
帖子: 5830
注册时间: 2022年 8月 18日 13:14

#72 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 sunfish »

这玩意400行值很多钱吧
xexz楼主
论坛精英
论坛精英
帖子互动: 386
帖子: 6648
注册时间: 2022年 7月 30日 11:48
联系:

#73 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 xexz楼主 »

huangchong 写了: 2025年 2月 26日 05:58 那你说说为啥要求sm_90?
是指当下、现在的硬件平台支持情况,

将来amd、intel、华为、摩尔线程、寒武纪。。。都可以提供自己的‘硬件平台’来支持这个‘虚拟机jit’(就象把java虚拟机,从x86平台移植到arm平台),只是现在刚开源,他们还没来得及(有消息说彻底抛开gpu架构的专用硬件asic,今年上半年就会出现,gpu并非对张量运算特别优化的计算结构,这个效率提升就更大了)。

另外,即使你帝行政/立法不让amd,intel支持,

这不耽误什么事,没人和钱过不去。 :mrgreen: :mrgreen: :mrgreen:
x1 图片 x1 图片
fangkuuaih
论坛元老
论坛元老
帖子互动: 1020
帖子: 22342
注册时间: 2022年 7月 22日 09:19

#74 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 fangkuuaih »

C++嵌汇编有何神奇之处,Linux核里经常有C嵌汇编,只要稍微懂一些指令即可。
女大的cuda工程师不可能不懂不用。
x1 图片
xexz楼主
论坛精英
论坛精英
帖子互动: 386
帖子: 6648
注册时间: 2022年 7月 30日 11:48
联系:

#75 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 xexz楼主 »

fangkuuaih 写了: 2025年 2月 26日 06:33 C++嵌汇编有何神奇之处,Linux核里经常有C嵌汇编,只要稍微懂一些指令即可。
女大的cuda工程师不可能不懂不用。
nv当然有,按你帝的说法,有点DEI罢了 :mrgreen: :mrgreen: :mrgreen:
fangkuuaih
论坛元老
论坛元老
帖子互动: 1020
帖子: 22342
注册时间: 2022年 7月 22日 09:19

#76 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 fangkuuaih »

xexz 写了: 2025年 2月 26日 06:43 nv当然有,按你帝的说法,有点DEI罢了 :mrgreen: :mrgreen: :mrgreen:
在Linux核里,改变privilege, 调用co processor 指令等,都要嵌入汇编,因为用C无法实现。
这是叔还是junior工程师时就做过的事。
女大里一堆OS 核工程师,硬件工程师,这些都是小儿科的玩意。
goodegg(我是大陆人)
知名作家
知名作家
帖子互动: 188
帖子: 859
注册时间: 2023年 7月 23日 03:31

#77 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 goodegg(我是大陆人) »

你做过不等于你能写。
大家都用中文,你就以为你和莫言一样牛逼 :lol:
fangkuuaih 写了: 2025年 2月 26日 07:02 在Linux核里,改变privilege, 调用co processor 指令等,都要嵌入汇编,因为用C无法实现。
这是叔还是junior工程师时就做过的事。
女大里一堆OS 核工程师,硬件工程师,这些都是小儿科的玩意。
CNM01
见习点评
见习点评
帖子互动: 127
帖子: 1981
注册时间: 2023年 3月 9日 14:33

#78 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 CNM01 »

400行汇编,就别吹了,写过程序的人都知道这是什么工作量。想靠这个打垮女大,痴人说梦
yokel(uuuu)
论坛元老
论坛元老
帖子互动: 639
帖子: 15186
注册时间: 2022年 7月 28日 11:14

#79 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 yokel(uuuu) »

不懂就问
这个include <cuda.h>是不是说明还是基于cuda?
并没有绕开?
xexz 写了: 2025年 2月 25日 23:11 #pragma once

#include <cuda.h>

#include "utils.cuh"

namespace deep_gemm {

struct SM90_64x16x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %10, 0;\n"
"wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7},"
" %8,"
" %9,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 16;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x24x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %14, 0;\n"
"wgmma.mma_async.sync.aligned.m64n24k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11},"
" %12,"
" %13,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 24;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x32x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %18, 0;\n"
"wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15},"
" %16,"
" %17,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 32;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x40x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %22, 0;\n"
"wgmma.mma_async.sync.aligned.m64n40k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19},"
" %20,"
" %21,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 40;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x48x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %26, 0;\n"
"wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23},"
" %24,"
" %25,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 48;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x56x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %30, 0;\n"
"wgmma.mma_async.sync.aligned.m64n56k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27}, "
" %28,"
" %29,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 56;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x64x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %34, 0;\n"
"wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31}, "
" %32,"
" %33,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 64;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x72x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
float& d32, float& d33, float& d34, float& d35,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %38, 0;\n"
"wgmma.mma_async.sync.aligned.m64n72k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31, "
" %32, %33, %34, %35}, "
" %36,"
" %37,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
"+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
d[32], d[33], d[34], d[35],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 72;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x80x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %42, 0;\n"
"wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31, "
" %32, %33, %34, %35, %36, %37, %38, %39}, "
" %40,"
" %41,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
"+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 80;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x88x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
float& d40, float& d41, float& d42, float& d43,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %46, 0;\n"
"wgmma.mma_async.sync.aligned.m64n88k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31, "
" %32, %33, %34, %35, %36, %37, %38, %39, "
" %40, %41, %42, %43}, "
" %44,"
" %45,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
"+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
"+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
d[40], d[41], d[42], d[43],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 88;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
xiaxia
著名点评
著名点评
帖子互动: 212
帖子: 4792
注册时间: 2022年 9月 6日 20:57

#80 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。

帖子 xiaxia »

这400行codes,等于是在围困华为的高墙上钻开了一个狗洞。
能不能成为INTEL 和AMD的机会,还很难说
但是至少华为可以用400行codes,和A100 竞争,说不定H100
x1 图片
回复

回到 “军事天地(Military)”