首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >从网站(metacritc) Python抓取数据时如何匹配数组长度

从网站(metacritc) Python抓取数据时如何匹配数组长度
EN

Stack Overflow用户
提问于 2018-06-19 08:52:30
回答 1查看 229关注 0票数 0

当我尝试从网页(metacritc)获取数据时,我遇到了一个问题,其中一些数据不在那里

代码语言:javascript
复制
print(len(names))
print(len(metascores))
print(len(userscores))
print(len(release_datesNew))
print(len(publishers))
print(len(ratings))

109
105
103
109
100
33

正如你从上面看到的,当我用下面的代码抓取数据时,我得到了不同的数组长度:

代码语言:javascript
复制
#Define year
year_number = 2018

# Define the URL
i = range(0, 1)

names = []
metascores = []
userscores = []
userscoresNew = []
release_dates = []
release_datesNew = []
publishers = []
ratings = []
genres = []
genresNew = []




for element in i:

    url = "http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?view=detailed&sort=desc&year_selected=" + format(year_number)
    print(url)
    year_number -= 1
    # not sure about this but it works (I was getting blocked by something and this the way I found around it)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

    web_byte = urlopen(req).read()

    webpage = web_byte.decode('utf-8')

    #this grabs the all the text from the page
    html_soup = BeautifulSoup(webpage, 'html5lib')

    #this is for selecting all the games in from 1 to 100 (the list of them)
    game_names = html_soup.find_all("div", class_="main_stats")
    game_metas = html_soup.find_all("a", class_="basic_stat product_score")  
    game_users = html_soup.find_all("li", class_='stat product_avguserscore')
    game_releases = html_soup.find_all("ul", class_='more_stats')
    game_publishers = html_soup.find_all("li", class_='stat publisher')
    game_ratings = html_soup.find_all("li", class_='stat maturity_rating')
    game_genres = html_soup.find_all("li", class_='stat genre')



    #Extract data from each game
    for games in game_names:
        name = games.find()
        names.append(name.text.strip())
    else:
        names.append("NA")


    for games2 in game_metas:
        metascore = games2.find()
        metascores.append(metascore.text.strip())
    else:
        metascore.append("NA")

    for games3 in game_releases:
        release_date = games3.find()
        release_dates.append(release_date.text.strip())
    else:
        release_dates.append("NA")

    for games4 in game_users:
        userscore  = games4.find('span', class_="data textscore textscore_favorable") or games4.find('span', class_="data textscore textscore_mixed")
        if userscore:
            userscores.append(userscore.text)
        else:
            userscores.append("NA")

    for games5 in game_publishers:
        publisher = games5.find("span", class_ = "data")
        if publisher:
            publishers.append(publisher.text)
        else:
                publishers.append("NA")

    for games6 in game_ratings:
        rating = games6.find("span", class_ = "data")
        if rating:
            ratings.append(rating.text)
        else:
            userscores.append("NA")

    for games7 in game_genres:
        genre = games7.find("span", class_ = "data")
        if genre:
            genres.append(genre.text)
        else:
            genres.append("NA")


for x in release_dates:
    temp = str(x)
    temp2 = temp.replace("Release Date:\n                        ", "")
    release_datesNew.append(temp2)

for z in genres:
    temp3 = str(z)
    temp4 = temp3.strip()
    temp5 = temp4.replace("                                                            ", "")
    genresNew.append(temp5)

# df = pd.DataFrame({'Games:': names,
#                     'Metascore:': metascores,
#                     'Userscore:': userscores,
#                     'Release date:': release_datesNew,
#                     'Publisher:': publishers,
#                     'Rating:': ratings,
#                     'Genre:': genresNew}) 

# df.to_csv("metacritic scrape.csv")

df = pd.DataFrame({'Games:': names})

我如何编写它,以便当游戏名称被抓取时,如果没有其他数组的值(或者它不在那里),它会为该值放置一个占位符(例如NA)

似乎找不到答案

更新:

这是我拥有的代码,它给了我一个错误

代码语言:javascript
复制
from bs4 import BeautifulSoup as soup
import requests, contextlib, re
@contextlib.contextmanager
def get_page(page_num = 1, year = 2018):
    d = soup(requests.get(f'http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?view=detailed&sort=desc&year_selected={year}&page={page_num-1}'), 'html.parser').find('ol', {'class':'list_products'}).find_all('li', {'class':'product'})
    headers = [['h3', 'product_title', False], ['span', 'metascore_w', False], ['span', 'data', True]]
    new_results = [[(lambda x:[re.sub('\n|\s{2,}', '', i.text) for i in x] if isinstance(x, list) else getattr(x, 'text', 'N/A'))(getattr(i, ['find', 'find_all'][c])(a, {'class':b})) for a, b, c in headers] for i in d]
    yield new_results

for i in range(1):
    with get_page(page_num = i) as results:
        print(results)

这是错误:

代码语言:javascript
复制
TypeError                                 Traceback (most recent call last)
<ipython-input-1-bd8574826c21> in <module>()
      9 
     10 for i in range(1):
---> 11     with get_page(page_num = i) as results:
     12         print(results)

~/anaconda3/lib/python3.6/contextlib.py in __enter__(self)
     79     def __enter__(self):
     80         try:
---> 81             return next(self.gen)
     82         except StopIteration:
     83             raise RuntimeError("generator didn't yield") from None

<ipython-input-1-bd8574826c21> in get_page(page_num, year)
      3 @contextlib.contextmanager
      4 def get_page(page_num = 1, year = 2018):
----> 5     d = soup(requests.get(f'http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?view=detailed&sort=desc&year_selected={year}&page={page_num-1}'), 'html.parser').find('ol', {'class':'list_products'}).find_all('li', {'class':'product'})
      6     headers = [['h3', 'product_title', False], ['span', 'metascore_w', False], ['span', 'data', True]]
      7     new_results = [[(lambda x:[re.sub('\n|\s{2,}', '', i.text) for i in x] if isinstance(x, list) else getattr(x, 'text', 'N/A'))(getattr(i, ['find', 'find_all'][c])(a, {'class':b})) for a, b, c in headers] for i in d]

~/anaconda3/lib/python3.6/site-packages/bs4/__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, **kwargs)
    190         if hasattr(markup, 'read'):        # It's a file-type object.
    191             markup = markup.read()
--> 192         elif len(markup) <= 256 and (
    193                 (isinstance(markup, bytes) and not b'<' in markup)
    194                 or (isinstance(markup, str) and not '<' in markup)

TypeError: object of type 'Response' has no len()
EN

回答 1

Stack Overflow用户

发布于 2018-06-19 09:22:16

您可以创建一个包含所有要抓取的标记和类名的标头:

代码语言:javascript
复制
from bs4 import BeautfulSoup as soup
import requests, contextlib, re
@contextlib.contextmanager
def get_page(page_num = 1, year = 2018):
   d = soup(requests.get(f'http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?view=detailed&sort=desc&year_selected={year}&page={page_num-1}'), 'html.parser').find('ol', {'class':'list_products'}).find_all('li', {'class':'product'})
   headers = [['h3', 'product_title', False], ['span', 'metascore_w', False], ['span', 'data', True]]
   new_results = [[(lambda x:[re.sub('\n|\s{2,}', '', i.text) for i in x] if isinstance(x, list) else getattr(x, 'text', 'N/A'))(getattr(i, ['find', 'find_all'][c])(a, {'class':b})) for a, b, c in headers] for i in d]
   yield new_results

for i in range(1):
  with get_page(page_num = i) as results:
    print(results)

输出(第一页):

代码语言:javascript
复制
/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释 
* [['Into the Breach', '89', ['Feb 27, 2018', 'Subset Games', '7.6']], ['Pillars of Eternity II: Deadfire', '88', ['May 8, 2018', 'M', 'Obsidian Entertainment, Versus Evil', 'Role-Playing,Western-Style', '7.8']], ['Celeste', '88', ['Jan 25, 2018', 'Matt Makes Games Inc.', 'Action,Platformer,2D', '7.0']], ['Subnautica', '87', ['Jan 23, 2018', 'E10+', 'Unknown Worlds Entertainment', 'Adventure,General', '8.2']], ['Iconoclasts', '87', ['Jan 23, 2018', 'Bifrost Entertainment', 'Action Adventure,Open-World', '7.3']], ['Final Fantasy XII: The Zodiac Age', '85', ['Feb 1, 2018', 'T', 'Square Enix', 'Role-Playing,Japanese-Style', '5.9']], ['Final Fantasy XV: Windows Edition', '85', ['Mar 6, 2018', 'T', 'Square Enix', 'Role-Playing,Action RPG', '7.2']], ['Dragon Ball FighterZ', '85', ['Jan 26, 2018', 'T', 'Bandai Namco Games', 'Action,2D,Fighting', '7.8']], ['Batman: The Enemy Within - Episode 5: Same Stitch', '85', ['Mar 26, 2018', 'Telltale Games', 'Adventure,Point-and-Click', '8.1']], ['Full Metal Furies', '85', ['Jan 17, 2018', 'Cellar Door Games', "Action,2D,Beat-'Em-Up", '7.0']], ['Frostpunk', '84', ['Apr 24, 2018', 'M', '11 bit studios, Merge Games', 'Action Adventure,Survival', '8.5']], ['Sprint Vector', '84', ['Feb 8, 2018', 'Survios', 'Racing,Other,Arcade', '6.6']], ['Total War: WARHAMMER II - Rise of the Tomb Kings', '84', ['Jan 23, 2018', 'Sega', '7.2']], ['The Elder Scrolls Online: Summerset', '84', ['May 21, 2018', 'M', 'Bethesda Softworks', '7.4']], ["Yoku's Island Express", '84', ['May 29, 2018', 'E10+', 'Villa Gorilla', 'Action,Pinball', '7.3']], ['The Forest', '83', ['Apr 30, 2018', 'SKS Games, Endnight Studios', 'Action Adventure,Horror,Horror,Survival', '7.2']], ['Warhammer: Vermintide 2', '82', ['Mar 8, 2018', 'Fatshark AB, Fatshark', 'Action,First-Person,Shooter,Arcade', '7.5']], ['The Elder Scrolls V: Skyrim VR', '81', ['Apr 2, 2018', 'M', 'Bethesda Softworks', 'Role-Playing,Western-Style', '6.7']], ['CHUCHEL', '81', ['Mar 7, 2018', 'Amanita Design', 'Adventure,General', '7.5']], ['Ni no Kuni II: Revenant Kingdom', '81', ['Mar 23, 2018', 'T', 'Level 5, Bandai Namco Games', '7.9']], ['The Red Strings Club', '81', ['Jan 22, 2018', 'Devolver Digital', '7.9']], ['Guns, Gore & Cannoli 2', '80', ['Mar 2, 2018', 'Crazy Monkey Studios', 'Action,Third-Person,Shooter,Arcade', '7.2']], ['A Way Out', '80', ['Mar 23, 2018', 'M', 'Electronic Arts', 'Action,Action Adventure,General', '7.7']], ['Space Invaders Extreme', '79', ['Feb 12, 2018', 'Degica', 'tbd']], ['FAR: Lone Sails', '79', ['May 17, 2018', 'Mixtvision', '7.9']], ['Minit', '79', ['Apr 3, 2018', 'Devolver Digital', 'Adventure,General', '7.1']], ["Sid Meier's Civilization VI: Rise and Fall", '79', ['Feb 8, 2018', '2K Games', '6.1']], ['Far Cry 5', '79', ['Mar 27, 2018', 'M', 'Ubisoft', 'Action,First-Person,Shooter,Arcade', '6.0']], ['For the King', '78', ['Apr 19, 2018', 'IronOak Games', 'Role-Playing,Roguelike', 'tbd']], ['BattleTech', '78', ['Apr 24, 2018', 'Harebrained Schemes LLC', 'Strategy,Turn-Based,Tactics', '7.1']], ['Sairento VR', '78', ['Jan 19, 2018', 'Mixed Realms Pte Ltd', 'Action,Shooter,Light Gun', '8.0']], ['Pit People', '78', ['Mar 2, 2018', 'The Behemoth', 'Action,Strategy,Turn-Based,General,Tactics', '7.7']], ['Brass Tactics', '78', ['Feb 22, 2018', 'Hidden Path Entertainment']], ['Moonlighter', '78', ['May 29, 2018', 'E10+', '11 bit studios', 'Role-Playing,Action RPG', '7.5']], ['Legendary Gary', '78', ['Feb 20, 2018', 'Evan Rogers', 'tbd']], ['Ancestors Legacy', '77', ['May 22, 2018', '1C Company', 'Strategy,Real-Time,General', '8.1']], ['Remothered: Tormented Fathers', '77', ['Jan 30, 2018', 'M', 'Stormind Games', '7.8']], ['Dead In Vinland', '77', ['Apr 12, 2018', 'Playdius', '7.7']], ['A Case of Distrust', '77', ['Feb 8, 2018', 'Serenity Forge', 'tbd']], ['Batman: The Enemy Within - Episode 4: What Ails You', '77', ['Jan 22, 2018', 'Telltale Games', 'Adventure,Point-and-Click', '7.9']], ['Cultist Simulator', '76', ['May 31, 2018', 'Weather Factory', 'Strategy,Turn-Based,Card Battle', '6.6']], ['Kerbal Space Program: Making History', '76', ['Mar 13, 2018', 'Squad', 'tbd']], ['A Total War Saga: Thrones of Britannia', '76', ['May 2, 2018', 'Sega Europe', 'Strategy,Real-Time,General', '5.4']], ['Forgotton Anne', '76', ['May 15, 2018', 'Square Enix', 'Adventure,General', '8.0']], ['Aegis Defenders', '76', ['Feb 8, 2018', 'GUTS Department', 'Role-Playing,Action RPG']], ['Q.U.B.E. 2', '76', ['Mar 13, 2018', 'E', 'Toxic Games', 'Action,First-Person,Shooter,Arcade', '7.7']], ['Candleman: The Complete Journey', '76', ['Jan 31, 2018', 'Spotlightor Interactive', 'Action,Platformer,3D', '8.0']], ['Surviving Mars', '76', ['Mar 15, 2018', 'E10+', 'Paradox Interactive', 'Strategy,Management,Government', '6.3']], ['Kingdom Come: Deliverance', '76', ['Feb 13, 2018', 'M', 'Warhorse Studios', 'Role-Playing,Action Adventure,General,Action RPG,Historic', '8.0']], ['Solo', '76', ['Apr 26, 2018', 'Team Gotham', '6.2']], ['Tower of Time', '76', ['Apr 12, 2018', 'Event Horizon Software', '8.3']], ['Ghost of a Tale', '75', ['Mar 13, 2018', 'Ghost of a Tale, SeithCG', '8.3']], ['Monster Prom', '75', ['Apr 27, 2018', 'Those Awesome Guys', 'Adventure,General', '8.1']], ['Where the Water Tastes Like Wine', '75', ['Feb 28, 2018', 'Dim Bulb Games', 'Miscellaneous,General', '5.1']], ['Railway Empire', '74', ['Jan 26, 2018', 'E', 'Kalypso', 'Simulation,Train,Vehicle', '6.5']], ['Nantucket', '74', ['Jan 18, 2018', 'Fish Eagle', '7.5']], ['Vampyr', '74', ['Jun 4, 2018', 'M', 'Focus Home Interactive', 'Role-Playing,General,Action RPG', '7.2']], ['The Swords of Ditto', '74', ['Apr 24, 2018', 'E10+', 'Devolver Digital', 'Role-Playing,Action RPG', '6.7']], ['Jurassic World Evolution', '74', ['Jun 12, 2018', 'T', 'Frontier Developments', 'Strategy,Management,Business / Tycoon', '8.1']], ['Omensight', '74', ['May 15, 2018', 'Spearhead Games', 'First-Person,Adventure,3D', '7.7']], ['Tesla vs Lovecraft', '73', ['Jan 26, 2018', '10tons', 'tbd']], ['Apocalipsis', '73', ['Mar 1, 2018', 'PlayWay', 'tbd']], ['The Council - Episode 1: The Mad Ones', '73', ['Mar 13, 2018', 'M', 'Focus Home Interactive', '7.2']], ['Downward Spiral: Horus Station', '73', ['May 31, 2018', '3rd Eye Studios', 'First-Person,Adventure,3D', 'tbd']], ['The Fall Part 2: Unbound', '73', ['Feb 13, 2018', 'T', 'Over The Moon', 'Action Adventure,General', 'tbd']], ['Rad Rodgers', '73', ['Feb 21, 2018', 'M', '3D Realms', 'Action,Platformer,2D', '6.9']], ['Fe', '73', ['Feb 16, 2018', 'E', 'Electronic Arts', 'Action,General', '5.4']], ['Stellaris: Apocalypse', '73', ['Feb 22, 2018', 'Paradox Interactive', '6.9']], ['All Walls Must Fall', '72', ['Feb 23, 2018', 'inbetweengames', 'tbd']], ['The Station', '72', ['Feb 20, 2018', 'The Station', 'First-Person,Adventure,3D', '6.6']], ['Warhammer 40,000: Inquisitor - Martyr', '72', ['Jun 5, 2018', 'Games Workshop', 'Strategy,Real-Time,General', '8.0']], ['City of Brass', '71', ['May 4, 2018', 'Uppercut Games Pty Ltd', 'Action Adventure,General', '4.0']], ['Attack of the Earthlings', '71', ['Feb 8, 2018', 'Junkfish Limited', '6.8']], ['Light Fall', '71', ['Apr 26, 2018', 'Bishop Games', 'Action,Platformer,2D', 'tbd']], ['Crossing Souls', '70', ['Feb 13, 2018', 'T', 'Devolver Digital', 'Action Adventure,General', '6.1']], ['H1Z1', '70', ['Feb 28, 2018', 'T', 'Daybreak Games', 'Action,Action Adventure,Third-Person,Shooter,Arcade,Survival', '4.5']], ['Dandara', '70', ['Feb 6, 2018', 'Raw Fury', 'Action,General,Platformer,2D', '6.2']], ['The Council - Episode 2: Hide and Seek', '70', ['May 15, 2018', 'Focus Home Interactive', 'Adventure,General', '8.3']], ['39 Days to Mars', '70', ['Apr 25, 2018', "It's Anecdotal", 'Action,Puzzle', '4.5']], ['State of Decay 2', '69', ['May 22, 2018', 'M', 'Microsoft Game Studios', 'Action Adventure,Open-World', '4.9']], ['Pure Farming 2018', '69', ['Mar 13, 2018', 'E', 'Techland', 'Simulation,Virtual,Career', '7.7']], ['Age of Empires: Definitive Edition', '69', ['Feb 20, 2018', 'T', 'Microsoft Game Studios', '4.7']], ['Inked', '69', ['Apr 26, 2018', 'Starbreeze publishing AB', 'Adventure,General', '7.9']], ['We Were Here Too', '69', ['Feb 2, 2018', 'Total Mayhem Games', 'First-Person,Adventure,3D', 'tbd']], ['Antigraviator', '69', ['Jun 6, 2018', 'Iceberg Interactive', 'Racing,Futuristic,Arcade', '3.8']], ['Lost Sphear', '69', ['Jan 23, 2018', 'E10+', 'Square Enix', 'Role-Playing,Japanese-Style', '6.9']], ['Rust', '69', ['Feb 8, 2018', 'Facepunch Studios', 'Action Adventure,Historic,Survival', '6.0']], ['Smoke and Sacrifice', '69', ['May 31, 2018', 'Curve Digital Games, Curve Digital', 'Action Adventure,General', 'tbd']], ['Empires Apart', '68', ['Mar 29, 2018', 'Slitherine', 'Strategy,Real-Time,General', 'tbd']], ['Trailblazers', '68', ['May 8, 2018', 'Rising Star Games', 'Racing,Futuristic,Arcade', 'tbd']], ['WARTILE', '68', ['Feb 8, 2018', 'Playwood Project', 'Strategy,Turn-Based,Tactics', 'tbd']], ['TT Isle of Man', '68', ['Mar 27, 2018', 'E', 'Bigben Interactive', 'Racing,Arcade,Automobile', '4.6']], ['The Thin Silence', '68', ['Apr 27, 2018', 'Nkidu Games Inc.', 'Adventure,General', '7.3']], ['Battlezone: Combat Commander', '68', ['Mar 1, 2018', 'Rebellion', 'Strategy,Real-Time,General', 'tbd']], ['Genital Jousting', '68', ['Jan 18, 2018', 'Devolver Digital', 'Miscellaneous,Party / Minigame', '7.5']], ['LEGRAND LEGACY: Tale of the Fatebounds', '67', ['Jan 24, 2018', 'SEMISOFT', '6.8']], ['Zwei: The Arges Adventure', '67', ['Jan 24, 2018', 'XSEED Games', 'tbd']], ['Conan Exiles', '67', ['May 8, 2018', 'Funcom', 'Action Adventure,Survival', '6.4']], ['Sea of Thieves', '67', ['Mar 20, 2018', 'T', 'Microsoft Game Studios', '4.1']], ['Fallen Legion+', '66', ['Jan 5, 2018', 'YummyYummyTummy', 'tbd']]]
*/
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50919238

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档