Partial Scrapes of 8chan (8ch.net)Return to dark.fail
August 6, 2019
NOTICE: Much of this content is NSFW and may be illegal to view or posess in your jurisdiction. This is for reseachers only. I have not yet viewed any content in here and am not aware of anything inside other than counts of how many HTML and JPG files exist.
I scraped 8chan using three different methods between 6 August and 8 August 2019 for research and journalistic purposes in fear that it might disappear from the internet at any minute. I believe its content is worth analyzing.
This scrape was accomplished using wget and httrack pointed at a nginx reverse proxy, which proxied to 8ch.net's origin servers. These true server IPs, behind Cloudflare/BitMitigate, were discovered and widely shared within the cyber research community. The only content I modified was to replace "https://8ch.net" and subdomains with "http://8ch.net" to simplify the reverse proxying.
Consider this archive to be NSFW. I do not endorse viewing it especially if you are not a researcher.
Due to the dynamic nature of the internet, there is no such thing as a "complete scrape". All scrapes of any website are partial snapshots in time on a per-request basis. To capture the most complete view in a limited amount of time, I used three approaches concurrently.
- 8ch-httrack-media: The most browse-able and NSFW scrape, but least complete. 111 HTML files, 5140 images, other media. This scrape included the media.8ch.net image server.
- 8ch-httrack: 596 HTML files scraped using httrack with a focus on text content.
- 8ch-wget: 39,246 HTML files scraped using the tool wget with a focus on text content.
This is provided to help researchers and journalists grok what 8chan is/was while it is offline. I do not endorse any content or imply any license.
I am a cybersecurity researcher and journalist that tracks the uptime of interesting sites on Tor at https://dark.fail
Say hello if you find this useful. I love to collaborate.
Twitter: @DarkDotFail - Email: hello aaaat dark.fail
Shouts to Gwern for his excellent DNM archives and research in general: https://www.gwern.net/DNM-archives#how-to-crawl-markets