2022-12-19

How to extract RSS links from website with Python

I am trying to extract all RSS feed links from some websites. Ofc if RSS exists. These are some website links that have RSS, and below is list of RSS links from those websites.

website_links = ["https://www.diepresse.com/", 
"https://www.sueddeutsche.de/", 
"https://www.berliner-zeitung.de/", 
"https://www.aargauerzeitung.ch/", 
"https://www.luzernerzeitung.ch/", 
"https://www.nzz.ch/",
"https://www.spiegel.de/", 
"https://www.blick.ch/",
"https://www.berliner-zeitung.de/", 
"https://www.ostsee-zeitung.de/", 
"https://www.kleinezeitung.at/", 
"https://www.blick.ch/", 
"https://www.ksta.de/", 
"https://www.tagblatt.ch/", 
"https://www.srf.ch/", 
"https://www.derstandard.at/"]


website_rss_links = ["https://www.diepresse.com/rss/Kunst", 
"https://rss.sueddeutsche.de/rss/Kultur", 
"https://www.berliner-zeitung.de/feed.id_kultur-kunst.xml", 
"https://www.aargauerzeitung.ch/leben-kultur.rss", 
"https://www.luzernerzeitung.ch/kultur.rss", 
"https://www.nzz.ch/technologie.rss", 
"https://www.spiegel.de/kultur/literatur/index.rss", 
"https://www.luzernerzeitung.ch/wirtschaft.rss", 
"https://www.blick.ch/wirtschaft/rss.xml", 
"https://www.berliner-zeitung.de/feed.id_abgeordnetenhauswahl.xml", 
"https://www.ostsee-zeitung.de/arc/outboundfeeds/rss/category/wissen/", 
"https://www.kleinezeitung.at/rss/politik", 
"https://www.blick.ch/wirtschaft/rss.xml", 
"https://feed.ksta.de/feed/rss/politik/index.rss", 
"https://www.tagblatt.ch/wirtschaft.rss", 
"https://www.srf.ch/news/bnf/rss/1926", 
"https://www.derstandard.at/rss/wirtschaft"]

My approach is to extract all links, and then check if some of them has RSS in them, but that is just a first step:

for url in all_links:
    
    response = requests.get(url)
    print(response)
    soup = BeautifulSoup(response.content, 'html.parser')
    list_of_links = soup.select("a[href]")
    list_of_links = [link["href"] for link in list_of_links]
    print("Number of links", len(list_of_links))
 

    for l in list_of_links:
        if "rss" in l:
            print(url)
            print(l)
    print()
    

I have heard that I can look for RSS links like this, but I do not know how to incorporate this in my code.

type=application/rss+xml

My goal is to get working RSS urls at the end. Maybe it is an issue because I am sending request on the first page, and maybe I should crawl different pages in order to extract all RSS Links, but I hope that there is a faster/better way for RSS extraction.



No comments:

Post a Comment