2023-02-25

Scraping data from https://ift.tt/Fq2rypk in Python

I'm trying to follow along the steps from this article to scrape data from the transfermarkt website but I'm not getting the desired output. It seems some of the classes have changed since the article was written so I've had to change

Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"}) to

Players = pageSoup.find_all("td", {"class": "hauptlink"})

from bs4 import BeautifulSoup
import requests
import pandas as pd

headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/Version 110.0.5481.100 Safari/537.36'}

page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Players = pageSoup.find_all("td", {"class": "hauptlink"})
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})

PlayersList = []
ValuesList = []

for i in range(0,25):
    PlayersList.append(Players[i].text)
    ValuesList.append(Values[i].text)
    
df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})

df.head(10)

The problem with this is it finds other classes of this type and adds them to the Players variable, eg Players[0].text returns '\nLuĂ­s Figo ' and Players[1].text returns '\nReal Madrid' because team names are also the same class as Player names. How can I select the first hauptlink class or somehow differentiate which one I want if they are the same?



No comments:

Post a Comment