Currently, the Aurora wiki is online. (Sometimes, the wiki is taken offline.) But, I found a page that would not display due to some error.
Anyway, I tried to use the Internet Archive Wayback Machine (archive.org) to see their copy of the page. Unfortunately, all they gave me is an error message:
Page cannot be crawled or displayed due to robots.txt.
See aurorawiki.pentarch.org robots.txt page. Learn more about robots.txt.
I've seen this problem more and more with archive.org. It's usually due to domain parking outfits that buy out domain names purely to display ads and generate ad revenue. They have extremely restrictive robot.txt exclusions that prevent any and all forms of bot crawling and also legally prevents Archive.org from displaying any archived/cached content they may have (even content from before robot.txt was changed by the new owner).
Currently, your robot.txt is very simple:
User-agent: *
Disallow: /
If I understand correctly, doesn't this block
any and all forms of bots and web caching?
My question:
I understand the need to have a robot.txt that blocks unnecessary bots which suck up precious bandwidth. But isn't a compromise possible to allow Archive.org and search engines like Google to cache your pages
without allowing just any bots? Perhaps
exceptions could be added specifically for Google and archive.org?
Most websites that I've checked out on Archive.org display just fine. To me, that says they struck a balance between blocking unnecessary bots/crawling while still allowing Archive.org and search engines to do their thing.