Aurora 4x

Off Topic => Forum Issues => Topic started by: Thundercraft on December 10, 2015, 10:10:00 AM

Title: Robot.txt Query & Archive.org Exclusion
Post by: Thundercraft on December 10, 2015, 10:10:00 AM
Currently, the Aurora wiki is online. (Sometimes, the wiki is taken offline.) But, I found a page that would not display due to some error.

Anyway, I tried to use the Internet Archive Wayback Machine (archive.org) to see their copy of the page. Unfortunately, all they gave me is an error message:

Quote
Page cannot be crawled or displayed due to robots.txt.

See aurorawiki.pentarch.org robots.txt (http://aurorawiki.pentarch.org/robots.txt) page. Learn more (http://en.wikipedia.org/wiki/Robots_exclusion_standard) about robots.txt.

I've seen this problem more and more with archive.org. It's usually due to domain parking outfits that buy out domain names purely to display ads and generate ad revenue. They have extremely restrictive robot.txt exclusions that prevent any and all forms of bot crawling and also legally prevents Archive.org from displaying any archived/cached content they may have (even content from before robot.txt was changed by the new owner).

Currently, your robot.txt is very simple:
Quote
User-agent: *
Disallow: /

If I understand correctly, doesn't this block any and all forms of bots and web caching?

My question:
I understand the need to have a robot.txt that blocks unnecessary bots which suck up precious bandwidth. But isn't a compromise possible to allow Archive.org and search engines like Google to cache your pages without allowing just any bots? Perhaps exceptions could be added specifically for Google and archive.org?

Most websites that I've checked out on Archive.org display just fine. To me, that says they struck a balance between blocking unnecessary bots/crawling while still allowing Archive.org and search engines to do their thing.
Title: Re: Robot.txt Query & Archive.org Exclusion
Post by: Erik L on December 10, 2015, 10:19:46 AM
Which page didn't display?

You are right about the robots.txt. It was also to combat spam before I made the logins really restrictive (needing a valid account here on the forums).

I can look into freeing it up some, though that probably won't happen until the weekend. :)
Title: Re: Robot.txt Query & Archive.org Exclusion
Post by: Thundercraft on December 10, 2015, 10:26:43 AM
Which page didn't display?

It was Beam Weapons and CIWS (http://aurorawiki.pentarch.org/index.php?title=Beam_Overview). However, when I tried again it displayed fine and I can't recreate the error. I think the page may have merely timed-out due to a hiccup and my slow connection.

I can look into freeing it up some, though that probably won't happen until the weekend. :)

It's appreciated. Though, hopefully, we won't have to read archived wiki pages any time soon.  ;)
Title: Re: Robot.txt Query & Archive.org Exclusion
Post by: Erik L on December 10, 2015, 10:39:04 AM
My host has issues with the Aurora wiki... They may have throttled it.

I would like to get the data from the wiki and put it in the KB here.
Title: Re: Robot.txt Query & Archive.org Exclusion
Post by: 83athom on December 10, 2015, 11:07:02 AM
I got onto the Beam Overview page. You want me to C&P it and other important pages to a word file(s) so one of you can have a field day editing them?
Title: Re: Robot.txt Query & Archive.org Exclusion
Post by: Erik L on December 10, 2015, 11:21:59 AM
I got onto the Beam Overview page. You want me to C&P it and other important pages to a word file(s) so one of you can have a field day editing them?

You could just copy & paste it into a new KB article :D
Title: Re: Robot.txt Query & Archive.org Exclusion
Post by: 83athom on December 10, 2015, 11:33:00 AM
They're getting 403s when I try to post, and I'm busy atm so I can't go into it and fix. Although it does look good it a word document (I'll attach below) (cannot attach .docx apparently)
Title: Re: Robot.txt Query & Archive.org Exclusion
Post by: Erik L on December 10, 2015, 01:19:53 PM
They're getting 403s when I try to post, and I'm busy atm so I can't go into it and fix. Although it does look good it a word document (I'll attach below) (cannot attach .docx apparently)

Allowed file types gif, jpg, pdf, png, csv, txt, zip

Zip it. :)
Title: Re: Robot.txt Query & Archive.org Exclusion
Post by: Mor on January 08, 2016, 10:38:17 AM
Currently, the Aurora wiki is online. (Sometimes, the wiki is taken offline.) But, I found a page that would not display due to some error.
I think that Erik or host, changed the setting. Previously, the setting exuded the forum, now it exclude the wiki.. appeantly we can't we have both  :(

I got onto the Beam Overview page. You want me to C&P it and other important pages to a word file(s) so one of you can have a field day editing them?
Honestly, KB is a waste of time. People always get excited about new things, but I have yet to see single good implementation. KB is like tutorial post and suffer from much of the same limitations\issues, especially with continued development.

My host has issues with the Aurora wiki... They may have throttled it.

I would like to get the data from the wiki and put it in the KB here.

You might want to backup wiki DB to avoid disappearing info, especially if you are considering changing hosts.