#speculative-decoding 아티클 모음

Dev.to

Gemini-Qwen 하이브리드 파이프라인을 통한 PDF 분석 최적화

How I Built an AI Exam App in 8 Months to outsource studying

AI/MLintermediate12 분 소요2026년 6월 28일

GeekNews

DSpark: Speculative decoding을 활용한 LLM 추론 가속화 [pdf]

DSpark: 준자기회귀 구조로 Speculative Decoding 속도 60~85% 가속

AI/MLadvanced24 분 소요2026년 6월 28일

Dev.to

Grafted-head 기반 DSpark를 통한 Lossless 추론 성능 2-4배 가속

DeepSeek's DSpark Brings Speculative Decoding Back Into the Spotlight — Here's What Developers Need to Know

AI/MLadvanced11 분 소요2026년 6월 28일

Dev.to

从SGLang、vLLM的异同推演未来AI演化

RadixAttention 통한 Agent 처리량 최대 70% 향상 및 PD 분리 아키텍처 구현

AI/MLadvanced16 분 소요2026년 6월 26일

Dev.to

Self-hosting LLM: VRAM 제약으로 인한 모델 Tier 하락과 제어권 확보의 Trade-off

The Open-Model Cost Chart Everyone's Sharing Is API Prices. Here's What Self-Hosting Actually Gets You (Measured)

AI/MLintermediate13 분 소요2026년 6월 23일

Dev.to

DeepSeek V4-Pro의 75% 가격 인하와 Gemini 3.5 Flash의 시장 진입

AI API Price War: DeepSeek V4-Pro Cuts 75% & Gemini 3.5 Flash Lands

AI/MLintermediate12 분 소요2026년 6월 22일

Dev.to

96GB VRAM 환경에서 CPU 오케스트레이션 병목 해결 및 API 경제성 분석

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

AI/MLadvanced4 분 소요2026년 6월 20일

GeekNews

로컬 Qwen은 더 나쁜 Opus가 아니라 다른 도구다

RTX 6000 기반 로컬 Qwen 도입을 통한 데이터 주권 확보 및 매출 회수 달성

AI/MLadvanced32 분 소요2026년 6월 19일

Hacker News

IndexShare 기반 1M Context 확보 및 오픈 모델 성능 1위 달성

GLM-5.2: The Most Powerful Open Model yet and the Brutal Reality of Running It

AI/MLadvanced15 분 소요2026년 6월 19일

Dev.to

NVFP4 양자화 통한 Qwen3.6-35B VRAM 71GB에서 23GB로 3.06배 절감

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

AI/MLadvanced25 분 소요2026년 6월 18일

Dev.to

Speculative Decoding의 수치적 불일치 해결을 통한 1.9배 Throughput 확보 및 신뢰성 검증

Speculative decoding shifted our output distribution and evals missed it

AI/MLadvanced12 분 소요2026년 6월 18일

Dev.to

Ollama의 추상화 벽을 넘어 llama.cpp 기반의 고성능 Open Inference 환경으로의 전환

1,175 Redditors Just Told You to Stop Using Ollama — Here's Why Local AI Tooling Got Serious

AI/MLintermediate17 분 소요2026년 6월 18일

Dev.to

IndexShare 기반 compute FLOPs 2.9배 절감 및 1M context 확보

Open-Weights Long-Horizon Coding LLMs: India's AI Future 2026

AI/MLadvanced52 분 소요2026년 6월 17일

Hugging Face Blog

IndexShare 도입으로 1M Context 구현 및 per-token FLOPs 2.9배 절감

GLM-5.2: Built for Long-Horizon Tasks

AI/MLadvanced36 분 소요2026년 6월 17일

GeekNews

MiMo-V2.5-Pro-UltraSpeed: 초당 1000토큰을 생성하는 1T 모델

Commodity GPU로 1T 모델 1000 TPS 달성한 모델-시스템 Codesign

AI/MLadvanced20 분 소요2026년 6월 9일

Dev.to

MTP와 최적화 스택으로 Qwen3.6-27B 추론 속도 2.25배 향상

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

AI/MLadvanced13 분 소요2026년 6월 9일

Hacker News

Commodity GPU 환경에서 1T 모델 1000 tokens/s 돌파

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

AI/MLadvanced27 분 소요2026년 6월 8일

InfoQ

MTP 기반 Speculative Decoding으로 Gemma 4 추론 속도 최대 2.2배 향상

Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction

AI/MLadvanced7 분 소요2026년 6월 5일

Dev.to

Speculative Decoding 도입으로 p50 TTFT 380ms에서 140ms로 단축

Speculative decoding: when and why it actually speeds up inference

AI/MLadvanced26 분 소요2026년 6월 5일

Dev.to

llama.cpp Tensor Parallelism 도입으로 vLLM급 70t/s 성능 달성

llama.cpp b9455 Finally Caught vLLM: 70t/s on 2x3090 Qwen 27B UQ8

AI/MLadvanced8 분 소요2026년 6월 3일