ํ”ผ๋“œ๋กœ ๋Œ์•„๊ฐ€๊ธฐ
Scrapy ๐Ÿ•ท๏ธ, but in Go: Building High-Performance Scrapers without the Boilerplate
Dev.toDev.to
Backend

Go์˜ Concurrency ๊ธฐ๋ฐ˜ Scrapy ์Šคํƒ€์ผ ํ”„๋ ˆ์ž„์›Œํฌ GoScrapy ์„ค๊ณ„

Scrapy ๐Ÿ•ท๏ธ, but in Go: Building High-Performance Scrapers without the Boilerplate

Goscrapy2026๋…„ 4์›” 15์ผ7๋ถ„intermediate

Context

Python Scrapy์˜ ๊ตฌ์กฐ์  ์ด์ ์„ Go ์–ธ์–ด๋กœ ์ด์‹ํ•˜์—ฌ ๊ณ ์„ฑ๋Šฅ ์Šคํฌ๋ž˜ํ•‘ ํ™˜๊ฒฝ ๊ตฌ์ถ• ์‹œ๋„. ๊ธฐ์กด Go ๊ธฐ๋ฐ˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋“ค์˜ ๋‹จ์ˆœ ๊ธฐ๋Šฅ ์ œ๊ณต์œผ๋กœ ์ธํ•ด ๋Œ€๊ทœ๋ชจ ์š”์ฒญ ์ฒ˜๋ฆฌ ๋ฐ ์žฌ์‹œ๋„ ๋กœ์ง ๊ตฌํ˜„ ์‹œ ๋ฐœ์ƒํ•˜๋Š” Boilerplate ์ฝ”๋“œ ์ฆ๊ฐ€ ๋ฌธ์ œ ํ•ด๊ฒฐ ํ•„์š”.

Technical Solution

  • Go 1.22+ ๋ฒ„์ „์˜ ๋‚ด์žฅ Concurrency๋ฅผ ํ™œ์šฉํ•œ ๊ณ ์„ฑ๋Šฅ ๋น„๋™๊ธฐ ์š”์ฒญ ์ฒ˜๋ฆฌ ๊ตฌ์กฐ ์„ค๊ณ„
  • Spider-Engine-Pipeline์œผ๋กœ ์ด์–ด์ง€๋Š” ๊ด€์‹ฌ์‚ฌ ๋ถ„๋ฆฌ ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ ๋กœ์ง์˜ ๋ชจ๋“ˆํ™”
  • Middleware ์ธํ„ฐํŽ˜์ด์Šค ๋„์ž…์„ ํ†ตํ•œ Retry with Backoff ๋ฐ DupeFilter์˜ ํ”Œ๋Ÿฌ๊ทธ์ธ ๋ฐฉ์‹ ๊ตฌํ˜„
  • Request ๊ฐ์ฒด ์ฒด์ด๋‹ ๋ฐฉ์‹์˜ DSL์„ ๋„์ž…ํ•˜์—ฌ URL, Meta, Header ์„ค์ •์„ ์ง๊ด€์ ์œผ๋กœ ์ฒ˜๋ฆฌ
  • IResponseReader ์ธํ„ฐํŽ˜์ด์Šค ๊ธฐ๋ฐ˜์˜ CSS Selector ์ถ”์ƒํ™”๋กœ ๋‹ค์–‘ํ•œ HTML ํŒŒ์‹ฑ ์ „๋žต ์ง€์›
  • Telemetry Hub ๊ธฐ๋ฐ˜์˜ Observer ํŒจํ„ด์„ ์ ์šฉํ•˜์—ฌ TUI ๋Œ€์‹œ๋ณด๋“œ์— ์‹ค์‹œ๊ฐ„ ์ง€ํ‘œ ์ „์†ก

- ๋Œ€๊ทœ๋ชจ ์Šคํฌ๋ž˜ํ•‘ ์„ค๊ณ„ ์‹œ ์š”์ฒญ ์ƒ์„ฑ๊ณผ ์‘๋‹ต ์ฒ˜๋ฆฌ ํ•จ์ˆ˜๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” Callback ๊ตฌ์กฐ ๊ฒ€ํ†  - ์ค‘๋ณต ์š”์ฒญ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ DupeFilter ๋ฐ ๋„คํŠธ์›Œํฌ ๋ถˆ์•ˆ์ • ๋Œ€์‘์„ ์œ„ํ•œ Exponential Backoff ๋ฏธ๋“ค์›จ์–ด ์ ์šฉ - ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ถ„๋ฆฌํ•˜์—ฌ ์ถ”์ถœ ๋กœ์ง๊ณผ ์ €์žฅ ๋กœ์ง(CSV, DB ๋“ฑ) ๊ฐ„์˜ ๊ฒฐํ•ฉ๋„ ์ œ๊ฑฐ

์›๋ฌธ ์ฝ๊ธฐ