Author Topic: Robot.txt Query & Archive.org Exclusion (Read 6301 times)

Thundercraft · « **on:** December 10, 2015, 10:10:00 AM »

Currently, the Aurora wiki is online. (Sometimes, the wiki is taken offline.) But, I found a page that would not display due to some error.

Anyway, I tried to use the Internet Archive Wayback Machine (archive.org) to see their copy of the page. Unfortunately, all they gave me is an error message:

Quote

Page cannot be crawled or displayed due to robots.txt.

See aurorawiki.pentarch.org robots.txt page. Learn more about robots.txt.

I've seen this problem more and more with archive.org. It's usually due to domain parking outfits that buy out domain names purely to display ads and generate ad revenue. They have extremely restrictive robot.txt exclusions that prevent any and all forms of bot crawling and also legally prevents Archive.org from displaying any archived/cached content they may have (even content from before robot.txt was changed by the new owner).

Currently, your robot.txt is very simple:

Quote

User-agent: *
Disallow: /

If I understand correctly, doesn't this block any and all forms of bots and web caching?

My question:
I understand the need to have a robot.txt that blocks unnecessary bots which suck up precious bandwidth. But isn't a compromise possible to allow Archive.org and search engines like Google to cache your pages without allowing just any bots? Perhaps exceptions could be added specifically for Google and archive.org?

Most websites that I've checked out on Archive.org display just fine. To me, that says they struck a balance between blocking unnecessary bots/crawling while still allowing Archive.org and search engines to do their thing.

Erik L · « **Reply #1 on:** December 10, 2015, 10:19:46 AM »

Which page didn't display?

You are right about the robots.txt. It was also to combat spam before I made the logins really restrictive (needing a valid account here on the forums).

I can look into freeing it up some, though that probably won't happen until the weekend.

Thundercraft · « **Reply #2 on:** December 10, 2015, 10:26:43 AM »

Quote from: Erik Luken on December 10, 2015, 10:19:46 AM

Which page didn't display?

It was Beam Weapons and CIWS. However, when I tried again it displayed fine and I can't recreate the error. I think the page may have merely timed-out due to a hiccup and my slow connection.

Quote from: Erik Luken on December 10, 2015, 10:19:46 AM

I can look into freeing it up some, though that probably won't happen until the weekend.

It's appreciated. Though, hopefully, we won't have to read archived wiki pages any time soon.

Erik L · « **Reply #3 on:** December 10, 2015, 10:39:04 AM »

My host has issues with the Aurora wiki... They may have throttled it.

I would like to get the data from the wiki and put it in the KB here.

83athom · « **Reply #4 on:** December 10, 2015, 11:07:02 AM »

I got onto the Beam Overview page. You want me to C&P it and other important pages to a word file(s) so one of you can have a field day editing them?

Erik L · « **Reply #5 on:** December 10, 2015, 11:21:59 AM »

Quote from: 83athom on December 10, 2015, 11:07:02 AM

I got onto the Beam Overview page. You want me to C&P it and other important pages to a word file(s) so one of you can have a field day editing them?

You could just copy & paste it into a new KB article

83athom · « **Reply #6 on:** December 10, 2015, 11:33:00 AM »

They're getting 403s when I try to post, and I'm busy atm so I can't go into it and fix. Although it does look good it a word document ~~(I'll attach below)~~ (cannot attach .docx apparently)

Erik L · « **Reply #7 on:** December 10, 2015, 01:19:53 PM »

Quote from: 83athom on December 10, 2015, 11:33:00 AM

They're getting 403s when I try to post, and I'm busy atm so I can't go into it and fix. Although it does look good it a word document ~~(I'll attach below)~~ (cannot attach .docx apparently)

Allowed file types gif, jpg, pdf, png, csv, txt, zip

Zip it.

Mor · « **Reply #8 on:** January 08, 2016, 10:38:17 AM »

Quote from: Thundercraft on December 10, 2015, 10:10:00 AM

Currently, the Aurora wiki is online. (Sometimes, the wiki is taken offline.) But, I found a page that would not display due to some error.

I think that Erik or host, changed the setting. Previously, the setting exuded the forum, now it exclude the wiki.. appeantly we can't we have both

Quote from: 83athom on December 10, 2015, 11:07:02 AM

I got onto the Beam Overview page. You want me to C&P it and other important pages to a word file(s) so one of you can have a field day editing them?

Honestly, KB is a waste of time. People always get excited about new things, but I have yet to see single good implementation. KB is like tutorial post and suffer from much of the same limitations\issues, especially with continued development.

Quote from: Erik Luken on December 10, 2015, 10:39:04 AM

My host has issues with the Aurora wiki... They may have throttled it.

I would like to get the data from the wiki and put it in the KB here.

You might want to backup wiki DB to avoid disappearing info, especially if you are considering changing hosts.

Author Topic: Robot.txt Query & Archive.org Exclusion (Read 6301 times)

Thundercraft (OP)

Robot.txt Query & Archive.org Exclusion

Erik L

Re: Robot.txt Query & Archive.org Exclusion

Thundercraft (OP)

Re: Robot.txt Query & Archive.org Exclusion

Erik L

Re: Robot.txt Query & Archive.org Exclusion

83athom

Re: Robot.txt Query & Archive.org Exclusion

Erik L

Re: Robot.txt Query & Archive.org Exclusion

83athom

Re: Robot.txt Query & Archive.org Exclusion

Erik L

Re: Robot.txt Query & Archive.org Exclusion

Mor

Re: Robot.txt Query & Archive.org Exclusion