SGLang Notes(主要是 Awesome-ML-SYS-Tutorial Notes)

notes 主要来自：https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/README-cn.md
以及阅读中拓展出来的一些东西

Radix Attention

比较好理解，参考:https://github.com/CalvinXKY/InfraTech/blob/main/llm_infer/sglang_radix_attention.ipynb
加了 kv cache 的 attention
用 radix tree 来压缩 prefix，有点像 trie
引用计数法来做生命周期管理

SGLang

https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/sglang/code-walk-through/readme-CN.md
- 启动 http_server/grpc_server，里面包括 TokenizerManager/DeTokenizerManager/Scheduler，其中 Tokenizer 和 Detokenizer 的核心作用是做消息路由，模型侧的逻辑都是在 scheduler 里，scheduler 的线程模型是用 async 的 eventloop，先把 request 塞到 queue 了，然后每次攒一个 batch 的请求再调用 model worker 处理;
- 支持 PD 分离（启动不同的 server），prefill 和 decode 可以启动不同的 server。
  - P 和 D 是如何交互的？Prefill 节点会调用 _register_to_bootstrap 往 bootstrap server 注册 prefill 节点信息，然后启动 eventloop，处理完拿到 tokens 会往 decode 节点去发（这里发的策略取决于 disagg_kv_sender 的实现，对于每个 backend 来说都不一样）; 比如对于 mooncake，在 MooncakeKVManager 里有三个角色，用了很多 kv cache 的 status 判断来完成 kvcache 的轮转（要重点关注下他的线程模型/进程模型）：
    1. prefill: 执行请求，调用 disagg_kv_sender.send_kvcache()；
    2. decode: 连接 prefill 注册的节点，然后从 prefill poll 数据;有两个 Queue，一个用来占位 preallocator，一个用来轮询 KV 是否传完（poll()）
    3. prefill 里的 transfer worker：主要作用是搬运数据（这里涉及到 PD 节点的 shard 映射）
- 支持 overlap (上一个 batch 的 sampling 和下一个 batch 的 forward 可以 overlap，减少 GPU 等待)，（代码有点难理解），重点看下面这段：

@torch.no_grad()
def event_loop_overlap_disagg_prefill(self: Scheduler) -> None:
    self.result_queue = deque()

    while True:
        # Receive requests
        recv_reqs = self.recv_requests()
        self.process_input_requests(recv_reqs)
        self.waiting_queue.extend(
            self.disagg_prefill_bootstrap_queue.pop_bootstrapped()
        )

        # Get the next batch to run
        batch = self.get_next_disagg_prefill_batch_to_run()
        self.cur_batch = batch

        # Launch the current batch
        if batch:
            batch_result = self.run_batch(batch)
            self.result_queue.append((batch.copy(), batch_result))
        else:
            batch_result = None

        # Process the last batch
        if self.last_batch:
            tmp_batch, tmp_result = self.result_queue.popleft()
            self.process_batch_result_disagg_prefill(tmp_batch, tmp_result)
        elif batch is None:
            # When the server is idle, do self-check and re-init some states
            self.self_check_during_idle()

        self.process_disagg_prefill_inflight_queue()

        # Run sample of the current batch
        # It depends on the result of the last batch (e.g., grammar), so we run it after the last batch is processed.
        self.launch_batch_sample_if_needed(batch_result)

        # Update last_batch
        self.last_batch = batch

https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/sglang/code-walk-through/multimodal_request_lifecycle.md
- 关于 multi-model 和 rope 的处理，看看 MRotaryEmbedding 里的 get_rope_index;

Chunked Prefill

https://zhuanlan.zhihu.com/p/718715866
- decode 的 cost-per-token > prefill （这个很好理解，prefill 可以 batch 操作，decode 只能 per token 操作并且强依赖 KV Cache 的 IO）
- 所以选择把 prefill 的 prompts 进行 chunk 划分后，在 stage/io 阶段插入 decode 处理（decode 只有 1 个 token，插入到 prefill chunk 填不满的地方），虽然降低了 prefill 效率（拆分了多个 stage 并且引入了 kv 的 io），但是给了更多 decode 阶段的资源和时间;

ML System 基本功

torch-memory-savor
- cuda graph,可以手动把若干个 torch 指令编成 cuda graph，这样减少 gpu<->cpu 的指令传输（否则对于复杂拓扑如 fsdp 可能有 cpu bottleneck），整体思路和 bigdata 里 spark/ray 把固定的 operator 编排做成 dag 是类似的原理；和 torch.compile 有点像，但是 torch.compile 更加面向开发者，提供通用的性能加速而非 cuda。
- torch-memory-savor 是 sglang 的 cuda 内存管理（为了更好的维护显存从而保证 cuda graph 可用），调用 cuda 底层 api 来实现「逻辑内存」和「物理内存」的映射和管理。有 pause/resume/malloc 等方法。这里没太理解为什么要频繁释放物理内存，直接申请一大块出来反复用，做 gc 不是更好？
nccl(nvidia collective communication library)
- AllReduce：所有 GPU 的数据先进行规约（如求和），然后广播到所有 GPU,
- Broadcast：从一个源 GPU 向其他所有 GPU 广播数据
- Reduce：将所有 GPU 的数据规约到一个目标 GPU
- AllGather: 收集所有 GPU 的数据并分发给所有 GPU, dim * world_size
- ReduceScatter: 规约后将结果分散到所有 GPU
- Send/Recv: 点对点通信
- AllToAll: 将数据分发到所有 GPU
- send/recv
- isend/irecv/batch_isend_irecv: 异步的，要调用 wait
通信算法，ring algo vs tree algo
- ring: allreduce 通信次数要 2*(n-1) (n-1 次求和 + n-1 次广播)
- tree: allreduce 通信次数 2*log(N)
- 从 NCCL 2.4 版本开始，对于node数较多的跨节点通信使用 Double Binary Tree 算法。相比于传统的 Tree Algorithm，构造了两棵互补的二叉树用于平衡通信开销。
rl-memory-management(https://hebiao064.github.io/rl-memory-management)
- 还没看，等学一下 rl 再看