MagnetDB

A Longitudinal Torrent-Discovery Dataset

SecretLab
University of Oklahoma

Abstract

MagnetDB is the longest running and continuous longitudinal torrent dataset that captures torrents discovered through the BitTorrent Distributed Hash Table (DHT). The 2024 release covers five-years (Dec 2018 – Sep 2024) and indexes over 28.6 million torrents, 950 million files, covering 82.9 PB of content, and enriches 1.56 million video files (≈751k movies, 811k TV episodes) with IMDb identifiers.

By exposing granular supply-side metadata, to include release-group "encoders," distribution "sites," file-level tags, and cross-referenced IMDb attributes, MagnetDB enables new empirical work on piracy dynamics, cultural diffusion of digital media, and P2P ecosystem evolution. The dataset is released under a CC BY 4.0 licence and follows FAIR principles; magnet links are provided to accredited researchers on request.

Dataset Scale

28.6M
Torrents
950M
Files
82.9PB
Content Volume
1.56M
IMDb-Matched Videos
300
Weeks Coverage

Key Features

Feature Why it matters
Longitudinal coverage (2018-2024) Supports time-series analyses of supply trends and policy interventions.
Scale: 28.6M torrents / 950M files / 82.9PB Orders of magnitude larger than prior open torrent datasets; suitable for large-scale ML and statistical studies.
Supply-side focus Captures who uploads what, not merely who downloads, enabling research on release-group behaviour, gift-economy incentives, and subcultural norms.
Rich, structured metadata Extracts >40 attributes (quality, language, codec, resolution, release group, etc.) from Scene-style file names.
IMDb matching (1.56M files above 2σ BM25 threshold) Adds title, year, genre, ratings, cast, runtime, and much more rich metadata.
FAIR distribution Persistent identifiers, open formats (SQLite + CSV), annual updates, and documented provenance.

Methodology

DHT Crawler File Parser Metadata IMDb Matcher Data Release

1. DHT Crawling

Deployed the open-source magnetico crawler on self-hosted infrastructure. Parameter indexer-max-neighbors=10000 maintained concurrent queries to 10k peers, yielding daily discovery of new torrents immediately after publication. Burn-in phase captured historic torrents, followed by steady-state ingestion.

2. Torrent & File Parsing

Identified video files via extensions; applied Scene naming conventions (scenerules.org) to parse titles and >40 metadata tags (e.g., WEB-DL, HEVC, language, group). Mean 33.2 files per torrent; encoded 86.6M video files, discarding 8.6M with hash or encoding errors.

3. IMDb Title Matching

Loaded IMDb Non-Commercial Dataset into Elasticsearch with custom edge-ngram analyser (4–15 grams). Queried each candidate title, ranked by BM25; adopted conservative 2σ cut-off (score ≥ 138) to minimise false positives, retaining 1.81% of video files.

4. Data Packaging & Release

Stored raw torrents, parsed metadata, and matched subsets in SQLite; provided slim CSVs for quick start. Public release omits magnet URIs; researchers can request full version under controlled access to prevent facilitation of infringement.

Results & Impact

Selective Coverage Insight

Despite its size, MagnetDB covers < 5% of IMDb titles for most release years, revealing that only a fraction of global media is ever torrented; coverage spikes for culturally salient eras (e.g., 1940s classics).

Encoder & Site Ecology

A small number of elite release groups (e.g., RARBG, YTS, ION10) dominate supply, confirming the persistence of Scene hierarchies and gift-economy prestige incentives.

Streaming-Service Targeting

Amazon MGM, Netflix, BBC, Disney+, Hulu, and HBO Max titles appear most frequently, highlighting how exclusivity and regional licensing drive piracy supply.

Cross-disciplinary Utility

Already cited by the Kiwi Torrent Research corpus and positioned for:

  • Cultural analytics: linguistic diffusion, fan-sub activity, temporal popularity arcs
  • Policy & enforcement: evaluating takedown efficacy, modelling release-window effects
  • Security: malware risk in software torrents; comparative studies of emerging P2P platforms (IPFS, Filecoin)

Get Started with MagnetDB

Citation

@article{magnetdb2024, title={MagnetDB: A Longitudinal Torrent-Discovery Dataset with IMDb-Matched Movies & TV Shows}, author={Scott Seidenberger and Noah Pursell and Anindya Maiti}, journal={Proceedings of the International AAAI Conference on Web and Social Media}, volume={19}, number={1}, pages={2575--2586}, year={2025}, doi={10.1609/icwsm.v19i1.35959} }