新未名空间

annorying0 写了： 2025年 2月 26日 03:39 不懂，就看到

不值一驳，不值二驳，不值三驳。。。
胡说一道，胡说二道，胡说三道。。。

内行的人看来只是一笑，不想浪费时间。不过也不奇怪，估计老黄心里也在笑，不过不能公开说，还是得客气一下，免得伤了玻璃心

廊坊聚集吹破天

。。。

told you

Dachun 写了： 2025年 2月 26日 03:33 从这里看出，DS那帮人也硬件方面的外行

不过也不奇怪，一帮小年轻，怎么可能有底层硬件经验呢

国内有个特点就是无知者无畏，喜欢用年轻人，初生牛犊不怕虎，是只是因为没有见过虎而不是有什么特别本事。

我提醒xai deepseek fp8 gemm lib 今天open source之后

DeepSeek open-sourcing their FP8 GEMM library, DeepGEMM, on February 25, 2025, could absolutely nudge AI chip makers toward a new industry standard—but it’s not a done deal. With the code now public, boasting 1350+ TFLOPS on NVIDIA’s H800 and a lean, JIT-compiled design optimized for dense and MoE workloads, it’s a tantalizing blueprint. Here’s how it might play out and what’s standing in the way.
This move hands chip makers like AMD, Intel, or even Cerebras a ready-made weapon. DeepGEMM’s tricks—two-level accumulation (FP8 math on Tensor Cores, higher precision elsewhere), fine-grained scaling, and MoE-friendly layouts—aren’t NVIDIA-exclusive in theory. AMD’s MI300X, already running DeepSeek-R1 with FP8 support via ROCm, could adapt this to hit 1000+ TFLOPS, leveraging its 5.2 TB/s HBM3 bandwidth. Intel’s Gaudi 3, with 128 GB HBM3, could tweak its deep learning stack similarly. X posts today are buzzing—some devs call it “a gift to the underdogs,” hinting at a potential rally around this approach. If these players standardize on DeepGEMM’s principles—FP8-first, efficiency-driven, MoE-optimized—it could shift the game away from NVIDIA’s CUDA-centric dominance.
The push has legs because it’s practical and cheap. DeepSeek’s V3 trained for $5.6M, peanuts compared to Western models, and DeepGEMM’s 2.7x speed over “expert-tuned kernels” (likely cuBLAS) on certain matrix sizes screams value. Chip makers could market this as a cost-per-FLOP killer, especially for inference-heavy AI firms dodging NVIDIA’s $30K+ H100s. X chatter suggests AMD’s already in deep—SGLang ties with DeepSeek signal intent. If they or Intel bake this into a reference platform, it might coalesce into a standard, especially for mid-tier or emerging markets where NVIDIA’s premium pricing stings.
But it’s no slam dunk. NVIDIA’s ecosystem is a fortress—CUDA’s entrenched, and their H200/Blackwell chips are already countering with FP8 Transformer Engines. DeepGEMM was born on NVIDIA’s Hopper and PTX assembly; porting it to ROCm or OneAPI isn’t trivial—X posts today flag ROCm’s “duct-tape vibes” as a hurdle. Devs love NVIDIA’s polish, and frameworks like PyTorch aren’t jumping ship. Plus, NVIDIA’s spinning DeepSeek’s wins as a GPU demand booster—H20 orders spiked this week. Without a unified push—say, AMD, Intel, and AWS (Trainium) aligning on a DeepGEMM-derived spec—it risks being a niche tool, not a standard.
Will it happen? It’s closer than yesterday. Open-sourcing DeepGEMM today slashes the R&D barrier, and its 1350 TFLOPS on “crippled” H800s proves efficiency can trump raw power—a siren call for cost-conscious chip makers. If AMD or Intel ship a killer implementation and devs bite (some on X are already forking it), it could snowball. But NVIDIA’s 80%+ AI chip share and software moat mean DeepSeek needs allies—fast. I’d say it’s 40/60—possible, not probable, unless a coalition forms. What’s your gut? Can this spark a real shift, or is it just noise?

supercnm 写了： 2025年 2月 26日 03:10 感觉主要是提供了一种思路，硬件公司Nvidia的软件部门太烂，有大量的空间可以优化

老黄现在应该做的是，赶紧对新的blackwell芯片做指令优化，或者把blackwell的指令做加密，不准别人优化

看能不能振臂一呼形成新的业界标准对抗NVIDIA 这样堡垒就从帝国内部分裂了

当然美帝可以下令不准intel AMD之类的厂商使用这种标准

囼蛙来说说“内行”，如果用chatgpt，记得改掉AI味道哦。

Dachun 写了： 2025年 2月 26日 03:23 一帮外行人胡说八道。不值得一驳

drifter 写了： 2025年 2月 26日 04:20 看能不能振臂一呼形成新的业界标准对抗NVIDIA 这样堡垒就从帝国内部分裂了

当然美帝可以下令不准intel AMD之类的厂商使用这种标准

fp8不像是正道

就和早期电脑上面的省内村的技巧一样

其实千言万语就是一句话，屌丝说，俺木吹牛
看github的人反应如何？那里的人是馁行还是歪行？
五毛还是狗粮？。。。

xexz 写了： 2025年 2月 25日 23:21 上面是个JIT虚拟机，下层用的nvGPU的汇编，

意思是，其他家的GPU只要实现NV的汇编（这个对那些GPU硬件厂家跟没要求一样），用谁的GPU都一样一样的。

那你说说为啥要求sm_90?

这玩意400行值很多钱吧

huangchong 写了： 2025年 2月 26日 05:58 那你说说为啥要求sm_90?

是指当下、现在的硬件平台支持情况，

将来amd、intel、华为、摩尔线程、寒武纪。。。都可以提供自己的‘硬件平台’来支持这个‘虚拟机jit’（就象把java虚拟机，从x86平台移植到arm平台），只是现在刚开源，他们还没来得及（有消息说彻底抛开gpu架构的专用硬件asic，今年上半年就会出现，gpu并非对张量运算特别优化的计算结构，这个效率提升就更大了）。

另外，即使你帝行政/立法不让amd,intel支持，

这不耽误什么事，没人和钱过不去。

C++嵌汇编有何神奇之处，Linux核里经常有C嵌汇编，只要稍微懂一些指令即可。
女大的cuda工程师不可能不懂不用。

fangkuuaih 写了： 2025年 2月 26日 06:33 C++嵌汇编有何神奇之处，Linux核里经常有C嵌汇编，只要稍微懂一些指令即可。
女大的cuda工程师不可能不懂不用。

nv当然有，按你帝的说法，有点DEI罢了

xexz 写了： 2025年 2月 26日 06:43 nv当然有，按你帝的说法，有点DEI罢了

在Linux核里，改变privilege, 调用co processor 指令等，都要嵌入汇编，因为用C无法实现。
这是叔还是junior工程师时就做过的事。
女大里一堆OS 核工程师，硬件工程师，这些都是小儿科的玩意。

你做过不等于你能写。
大家都用中文，你就以为你和莫言一样牛逼

fangkuuaih 写了： 2025年 2月 26日 07:02 在Linux核里，改变privilege, 调用co processor 指令等，都要嵌入汇编，因为用C无法实现。
这是叔还是junior工程师时就做过的事。
女大里一堆OS 核工程师，硬件工程师，这些都是小儿科的玩意。

400行汇编，就别吹了，写过程序的人都知道这是什么工作量。想靠这个打垮女大，痴人说梦

不懂就问
这个include <cuda.h>是不是说明还是基于cuda？
并没有绕开？

xexz 写了： 2025年 2月 25日 23:11 #pragma once

#include <cuda.h>

#include "utils.cuh"

namespace deep_gemm {

struct SM90_64x16x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %10, 0;\n"
"wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7},"
" %8,"
" %9,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 16;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x24x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %14, 0;\n"
"wgmma.mma_async.sync.aligned.m64n24k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11},"
" %12,"
" %13,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 24;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x32x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %18, 0;\n"
"wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15},"
" %16,"
" %17,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 32;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x40x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %22, 0;\n"
"wgmma.mma_async.sync.aligned.m64n40k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19},"
" %20,"
" %21,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 40;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x48x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %26, 0;\n"
"wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23},"
" %24,"
" %25,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 48;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x56x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %30, 0;\n"
"wgmma.mma_async.sync.aligned.m64n56k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27}, "
" %28,"
" %29,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 56;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x64x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %34, 0;\n"
"wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31}, "
" %32,"
" %33,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 64;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x72x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
float& d32, float& d33, float& d34, float& d35,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %38, 0;\n"
"wgmma.mma_async.sync.aligned.m64n72k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31, "
" %32, %33, %34, %35}, "
" %36,"
" %37,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
"+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
d[32], d[33], d[34], d[35],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 72;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x80x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %42, 0;\n"
"wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31, "
" %32, %33, %34, %35, %36, %37, %38, %39}, "
" %40,"
" %41,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
"+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 80;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

struct SM90_64x88x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
float& d40, float& d41, float& d42, float& d43,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %46, 0;\n"
"wgmma.mma_async.sync.aligned.m64n88k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31, "
" %32, %33, %34, %35, %36, %37, %38, %39, "
" %40, %41, %42, %43}, "
" %44,"
" %45,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
"+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
"+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}

__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
d[40], d[41], d[42], d[43],
scale_d);
}

static constexpr int M = 64;
static constexpr int N = 88;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};

这400行codes，等于是在围困华为的高墙上钻开了一个狗洞。
能不能成为INTEL 和AMD的机会，还很难说
但是至少华为可以用400行codes，和A100 竞争，说不定H100

新未名空间

完了，deepseek把GPU上的fp8的汇编代码开源了。

#61 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#62 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#63 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#64 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#65 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#66 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#67 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#68 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#69 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#70 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#71 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#72 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#73 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#74 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#75 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#76 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#77 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#78 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#79 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。

#80 Re: 完了，deepseek把GPU上的fp8的汇编代码开源了。