Alright guys, I had some spare time, so I decided to write a quick script that probably works. You need: The latest version of Python 3, The latest version of BeautifulSoup and the latest version of Requests.
The variables you need to set are at the bottom. One called url, the other called file_path. url needs to be the full url, for example
http://aurora2.pentarch.org/index.php?topic=11579.0file_path is where you want to put the text file and what you want to call it. If you scrape text from two pages without renaming the txt file or changing the file_path, the original file will be edited harmlessly, you won't need to scrape the first page again. The file_path I have currently works for my mac, although I need to enter my actual username of course. Other operating systems may be different.
To everybody who actually knows how to write Python code, I am so sorry that you had to look at my code. You could probably do all this in 20 lines in half the time if you knew what you were doing lol

Va bene ragazzi, avevo un po 'di tempo libero, quindi ho deciso di scrivere una sceneggiatura veloce che probabilmente funziona. Hai bisogno di: L'ultima versione di Python 3, L'ultima versione di BeautifulSoup e l'ultima versione di Requests.
Le variabili che devi impostare sono in fondo. Uno chiamato url, l'altro chiamato file_path. url deve essere l'URL completo, ad esempio
http://aurora2.pentarch.org/index.php?topic=11579.0file_path è dove vuoi mettere il file di testo e come vuoi chiamarlo. Se raschi il testo da due pagine senza rinominare il file txt o cambiare file_path, il file originale verrà modificato in modo innocuo, non avrai bisogno di raschiare di nuovo la prima pagina. Il file_path che ho attualmente funziona per il mio Mac, anche se ovviamente devo inserire il mio nome utente effettivo. Altri sistemi operativi potrebbero essere diversi.
A tutti quelli che sanno davvero come scrivere codice Python, mi dispiace così tanto che tu abbia dovuto guardare il mio codice. Probabilmente potresti fare tutto questo in 20 righe nella metà del tempo se sapessi cosa stavi facendo lol

Code here
#For the Aurora 4X forum
import requests
from bs4 import BeautifulSoup as bs
#The two libraries.
def find_text(webpage):
"""Takes the webpage as input, output is a list containing all the text as strings"""
step1 = webpage.find('body') #Finding the forum posts.
step2 = step1.find('div', attrs = {'id': 'forumposts'}) #This forum uses alternating background formats.
step3 = step2.find_all('div', attrs = {'class': 'windowbg'}) #windowbg2 is slightly darker.
step3_b = step2.find_all('div', attrs = {'class': 'windowbg2'}) #You need to grab them separately.
step4 = []
step4_b = []
for i in range(len(step3)): #Finding the text from the larger forumpost and putting it in another list.
var = step3[i].find('div', attrs = {'class': 'inner'})
var_t = var.text
step4.append(var_t)
for i in range(len(step3_b)): #Doing the same with the dark background posts.
var = step3_b[i].find('div', attrs = {'class': 'inner'})
var_t = var.text
step4_b.append(var_t)
if len(step4) > len(step4_b): #I am too lazy to work out a better way to make sure the posts are merged in order.
step4_b.append('delete this later') #So if there is not an equal number of windowbg and windowbg2 posts,
length = len(step4) + len(step4_b) #a fake one will be added, it will be removed later.
final_step = []
for i in range(int(length/2)): #Merging all the text to a final list in order with seperating lines.
final_step.append(step4[i])
final_step.append('\n')
final_step.append(step4_b[i])
final_step.append('\n')
if final_step[-1] == 'delete this later': #This is easier for me to work out, but probably terrible.
final_step.pop(-1)
elif final_step[-2] == 'delete this later':
final_step.pop(-2)
elif final_step[-3] == 'delete this later':
final_step.pop(-3)
output = formatting_func(final_step) #Finalise the formatting so the text is not all on one line.
#for i in range(len(final_step)): #Uncomment this out if you want to print the text with your IDE.
# print(final_step[i], '\n')
return output
def formatting_func(final_step):
"""There was an issue with all the text being on one line, which was confusing in Google translate."""
output = []
max_length = 151
for line in final_step:
old_index = 0
if len(line) > 1: #Empty lines are length 1.
loops = len(line)//max_length #Set the number of lines per post.
else:
loops = 0
if loops == 0:
output.append(line)
else:
for j in range(loops+1):
if len(line) > (max_length*(j+1)):
for i in range(20): #If you try to make a new line, you might cut a word in half, which messes up
if line[((max_length*(j+1))-i)] == ' ': #Google translate. This goes backwards and finds a
index = ((max_length*(j+1))-i) #suitable space to make the new line at.
break
output.append(line[old_index:index])
old_index = index
else:
output.append(line[old_index:])
return output #Returns the final formated list.
def download_text(text_file, file_path, document_list):
"""Writes the text onto your computer as a new file."""
for line in text_file:
document_list.append(f'{line}\n') #Extra newline.
with open(file_path, 'a+') as handler: #Writing to file_path
for lines in document_list:
handler.write(lines)
url = '' #Needs to be the full url, example: http://aurora2.pentarch.org/index.php?topic=11579.0
file_path = '/users/username/desktop/filename.txt'
#file_path is the place you want to save your text file and what you want to call it.
#Make sure it is formatted correctly!!
r = requests.get(url)
webpage = bs(r.content)
document_list = []
text_file = find_text(webpage) #Grabs the text.
download_text(text_file, file_path, document_list) #Writes the file.
[/spoiler]
End code