Author Topic: Robot.txt Query & Archive.org Exclusion  (Read 4974 times)

0 Members and 1 Guest are viewing this topic.

Offline Thundercraft (OP)

  • Warrant Officer, Class 1
  • *****
  • Posts: 86
  • Thanked: 7 times
  • Ensign Navigator
Robot.txt Query & Archive.org Exclusion
« on: December 10, 2015, 10:10:00 AM »
Currently, the Aurora wiki is online. (Sometimes, the wiki is taken offline.) But, I found a page that would not display due to some error.

Anyway, I tried to use the Internet Archive Wayback Machine (archive.org) to see their copy of the page. Unfortunately, all they gave me is an error message:

Quote
Page cannot be crawled or displayed due to robots.txt.

See aurorawiki.pentarch.org robots.txt page. Learn more about robots.txt.

I've seen this problem more and more with archive.org. It's usually due to domain parking outfits that buy out domain names purely to display ads and generate ad revenue. They have extremely restrictive robot.txt exclusions that prevent any and all forms of bot crawling and also legally prevents Archive.org from displaying any archived/cached content they may have (even content from before robot.txt was changed by the new owner).

Currently, your robot.txt is very simple:
Quote
User-agent: *
Disallow: /

If I understand correctly, doesn't this block any and all forms of bots and web caching?

My question:
I understand the need to have a robot.txt that blocks unnecessary bots which suck up precious bandwidth. But isn't a compromise possible to allow Archive.org and search engines like Google to cache your pages without allowing just any bots? Perhaps exceptions could be added specifically for Google and archive.org?

Most websites that I've checked out on Archive.org display just fine. To me, that says they struck a balance between blocking unnecessary bots/crawling while still allowing Archive.org and search engines to do their thing.
"Not only is the universe stranger than we imagine, it is stranger than we can imagine." - Sir Arthur Stanley Eddington
 

Offline Erik L

  • Administrator
  • Admiral of the Fleet
  • *****
  • Posts: 5654
  • Thanked: 366 times
  • Forum Admin
  • Discord Username: icehawke
  • 2020 Supporter 2020 Supporter : Donate for 2020
    2022 Supporter 2022 Supporter : Donate for 2022
    Gold Supporter Gold Supporter : Support the forums with a Gold subscription
    2021 Supporter 2021 Supporter : Donate for 2021
Re: Robot.txt Query & Archive.org Exclusion
« Reply #1 on: December 10, 2015, 10:19:46 AM »
Which page didn't display?

You are right about the robots.txt. It was also to combat spam before I made the logins really restrictive (needing a valid account here on the forums).

I can look into freeing it up some, though that probably won't happen until the weekend. :)

Offline Thundercraft (OP)

  • Warrant Officer, Class 1
  • *****
  • Posts: 86
  • Thanked: 7 times
  • Ensign Navigator
Re: Robot.txt Query & Archive.org Exclusion
« Reply #2 on: December 10, 2015, 10:26:43 AM »
Which page didn't display?

It was Beam Weapons and CIWS. However, when I tried again it displayed fine and I can't recreate the error. I think the page may have merely timed-out due to a hiccup and my slow connection.

I can look into freeing it up some, though that probably won't happen until the weekend. :)

It's appreciated. Though, hopefully, we won't have to read archived wiki pages any time soon.  ;)
"Not only is the universe stranger than we imagine, it is stranger than we can imagine." - Sir Arthur Stanley Eddington
 

Offline Erik L

  • Administrator
  • Admiral of the Fleet
  • *****
  • Posts: 5654
  • Thanked: 366 times
  • Forum Admin
  • Discord Username: icehawke
  • 2020 Supporter 2020 Supporter : Donate for 2020
    2022 Supporter 2022 Supporter : Donate for 2022
    Gold Supporter Gold Supporter : Support the forums with a Gold subscription
    2021 Supporter 2021 Supporter : Donate for 2021
Re: Robot.txt Query & Archive.org Exclusion
« Reply #3 on: December 10, 2015, 10:39:04 AM »
My host has issues with the Aurora wiki... They may have throttled it.

I would like to get the data from the wiki and put it in the KB here.

Offline 83athom

  • Big Ship Commander
  • Vice Admiral
  • **********
  • Posts: 1261
  • Thanked: 86 times
Re: Robot.txt Query & Archive.org Exclusion
« Reply #4 on: December 10, 2015, 11:07:02 AM »
I got onto the Beam Overview page. You want me to C&P it and other important pages to a word file(s) so one of you can have a field day editing them?
« Last Edit: December 10, 2015, 11:10:06 AM by 83athom »
Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life.
 

Offline Erik L

  • Administrator
  • Admiral of the Fleet
  • *****
  • Posts: 5654
  • Thanked: 366 times
  • Forum Admin
  • Discord Username: icehawke
  • 2020 Supporter 2020 Supporter : Donate for 2020
    2022 Supporter 2022 Supporter : Donate for 2022
    Gold Supporter Gold Supporter : Support the forums with a Gold subscription
    2021 Supporter 2021 Supporter : Donate for 2021
Re: Robot.txt Query & Archive.org Exclusion
« Reply #5 on: December 10, 2015, 11:21:59 AM »
I got onto the Beam Overview page. You want me to C&P it and other important pages to a word file(s) so one of you can have a field day editing them?

You could just copy & paste it into a new KB article :D

Offline 83athom

  • Big Ship Commander
  • Vice Admiral
  • **********
  • Posts: 1261
  • Thanked: 86 times
Re: Robot.txt Query & Archive.org Exclusion
« Reply #6 on: December 10, 2015, 11:33:00 AM »
They're getting 403s when I try to post, and I'm busy atm so I can't go into it and fix. Although it does look good it a word document (I'll attach below) (cannot attach .docx apparently)
Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life.
 

Offline Erik L

  • Administrator
  • Admiral of the Fleet
  • *****
  • Posts: 5654
  • Thanked: 366 times
  • Forum Admin
  • Discord Username: icehawke
  • 2020 Supporter 2020 Supporter : Donate for 2020
    2022 Supporter 2022 Supporter : Donate for 2022
    Gold Supporter Gold Supporter : Support the forums with a Gold subscription
    2021 Supporter 2021 Supporter : Donate for 2021
Re: Robot.txt Query & Archive.org Exclusion
« Reply #7 on: December 10, 2015, 01:19:53 PM »
They're getting 403s when I try to post, and I'm busy atm so I can't go into it and fix. Although it does look good it a word document (I'll attach below) (cannot attach .docx apparently)

Allowed file types gif, jpg, pdf, png, csv, txt, zip

Zip it. :)

Offline Mor

  • Commander
  • *********
  • Posts: 305
  • Thanked: 11 times
Re: Robot.txt Query & Archive.org Exclusion
« Reply #8 on: January 08, 2016, 10:38:17 AM »
Currently, the Aurora wiki is online. (Sometimes, the wiki is taken offline.) But, I found a page that would not display due to some error.
I think that Erik or host, changed the setting. Previously, the setting exuded the forum, now it exclude the wiki.. appeantly we can't we have both  :(

I got onto the Beam Overview page. You want me to C&P it and other important pages to a word file(s) so one of you can have a field day editing them?
Honestly, KB is a waste of time. People always get excited about new things, but I have yet to see single good implementation. KB is like tutorial post and suffer from much of the same limitations\issues, especially with continued development.

My host has issues with the Aurora wiki... They may have throttled it.

I would like to get the data from the wiki and put it in the KB here.

You might want to backup wiki DB to avoid disappearing info, especially if you are considering changing hosts.