Partial Scrapes of 8chan

Partial Scrapes of 8chan (8ch.net)Return to dark.fail

August 6, 2019

Torrent: 8chan Archive by dark.fail - 2019-08-06 (3.22 GB) Magnet

NOTICE: Much of this content is NSFW and may be illegal to view or posess in your jurisdiction. This is for reseachers only. I have not yet viewed any content in here and am not aware of anything inside other than counts of how many HTML and JPG files exist.

I scraped 8chan using three different methods between 6 August and 8 August 2019 for research and journalistic purposes in fear that it might disappear from the internet at any minute. I believe its content is worth analyzing.

This scrape was accomplished using wget and httrack pointed at a nginx reverse proxy, which proxied to 8ch.net's origin servers. These true server IPs, behind Cloudflare/BitMitigate, were discovered and widely shared within the cyber research community. The only content I modified was to replace "https://8ch.net" and subdomains with "http://8ch.net" to simplify the reverse proxying.

Consider this archive to be NSFW. I do not endorse viewing it especially if you are not a researcher.

Due to the dynamic nature of the internet, there is no such thing as a "complete scrape". All scrapes of any website are partial snapshots in time on a per-request basis. To capture the most complete view in a limited amount of time, I used three approaches concurrently.

What's inside

- 8ch-httrack-media: The most browse-able and NSFW scrape, but least complete. 111 HTML files, 5140 images, other media. This scrape included the media.8ch.net image server.
- 8ch-httrack: 596 HTML files scraped using httrack with a focus on text content.
- 8ch-wget: 39,246 HTML files scraped using the tool wget with a focus on text content.

This is provided to help researchers and journalists grok what 8chan is/was while it is offline. I do not endorse any content or imply any license.

I am a cybersecurity researcher and journalist that tracks the uptime of interesting sites on Tor at https://dark.fail

Say hello if you find this useful. I love to collaborate.

Twitter: @DarkDotFail - Email: hello aaaat dark.fail

Shouts to Gwern for his excellent DNM archives and research in general: https://www.gwern.net/DNM-archives#how-to-crawl-markets