How to extract RSS links from website with Python

By Ritesh Sahu - December 19, 2022

I am trying to extract all RSS feed links from some websites. Ofc if RSS exists. These are some website links that have RSS, and below is list of RSS links from those websites.

website_links = ["https://www.diepresse.com/", 
"https://www.sueddeutsche.de/", 
"https://www.berliner-zeitung.de/", 
"https://www.aargauerzeitung.ch/", 
"https://www.luzernerzeitung.ch/", 
"https://www.nzz.ch/",
"https://www.spiegel.de/", 
"https://www.blick.ch/",
"https://www.berliner-zeitung.de/", 
"https://www.ostsee-zeitung.de/", 
"https://www.kleinezeitung.at/", 
"https://www.blick.ch/", 
"https://www.ksta.de/", 
"https://www.tagblatt.ch/", 
"https://www.srf.ch/", 
"https://www.derstandard.at/"]


website_rss_links = ["https://www.diepresse.com/rss/Kunst", 
"https://rss.sueddeutsche.de/rss/Kultur", 
"https://www.berliner-zeitung.de/feed.id_kultur-kunst.xml", 
"https://www.aargauerzeitung.ch/leben-kultur.rss", 
"https://www.luzernerzeitung.ch/kultur.rss", 
"https://www.nzz.ch/technologie.rss", 
"https://www.spiegel.de/kultur/literatur/index.rss", 
"https://www.luzernerzeitung.ch/wirtschaft.rss", 
"https://www.blick.ch/wirtschaft/rss.xml", 
"https://www.berliner-zeitung.de/feed.id_abgeordnetenhauswahl.xml", 
"https://www.ostsee-zeitung.de/arc/outboundfeeds/rss/category/wissen/", 
"https://www.kleinezeitung.at/rss/politik", 
"https://www.blick.ch/wirtschaft/rss.xml", 
"https://feed.ksta.de/feed/rss/politik/index.rss", 
"https://www.tagblatt.ch/wirtschaft.rss", 
"https://www.srf.ch/news/bnf/rss/1926", 
"https://www.derstandard.at/rss/wirtschaft"]

My approach is to extract all links, and then check if some of them has RSS in them, but that is just a first step:

for url in all_links:
    
    response = requests.get(url)
    print(response)
    soup = BeautifulSoup(response.content, 'html.parser')
    list_of_links = soup.select("a[href]")
    list_of_links = [link["href"] for link in list_of_links]
    print("Number of links", len(list_of_links))
 

    for l in list_of_links:
        if "rss" in l:
            print(url)
            print(l)
    print()

I have heard that I can look for RSS links like this, but I do not know how to incorporate this in my code.

type=application/rss+xml

My goal is to get working RSS urls at the end. Maybe it is an issue because I am sending request on the first page, and maybe I should crawl different pages in order to extract all RSS Links, but I hope that there is a faster/better way for RSS extraction.

Search This Blog

Theprogrammersfirst | A technical portal.

How to extract RSS links from website with Python

Comments

Post a Comment

Popular posts from this blog

Spring Elasticsearch Operations

Hibernate Search - Elasticsearch with JSON manipulation

Today Walkin 14th-Sept