python - Beautiful Soup is Missing Tables from Wikipedia -
i trying la liga league table wikipedia, can't seem use find_all table i'm trying scrape. moreover, exact same code wrote scrapes epl data wikipedia fine...
the full html here: view-source:https://en.wikipedia.org/wiki/2015%e2%80%9316_la_liga
the part in question here:
<h2><span class="mw-headline" id="league_table">league table</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=2015%e2%80%9316_la_liga&action=edit&section=6" title="edit section: league table">edit</a><span class="mw-editsection-bracket">]</span></span></h2> <h3><span class="mw-headline" id="standings">standings</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=2015%e2%80%9316_la_liga&action=edit&section=7" title="edit section: standings">edit</a><span class="mw-editsection-bracket">]</span></span></h3> <table class="wikitable" style="text-align:center;"> <tr> <th scope="col" width="28"><abbr title="position">pos</abbr> </th> <th scope="col" width="190">team <div class="plainlinks hlist navbar mini" style="float:right"> <ul> <li class="nv-view"><a href="/wiki/template:2015%e2%80%9316_la_liga_table" title="template:2015–16 la liga table"><span title="view template">v</span></a> </li> <li class="nv-talk"><a href="/wiki/template_talk:2015%e2%80%9316_la_liga_table" title="template talk:2015–16 la liga table"><span title="discuss template">t</span></a> </li> <li class="nv-edit"><a class="external text" href="//en.wikipedia.org/w/index.php?title=template:2015%e2%80%9316_la_liga_table&action=edit"><span title="edit template">e</span></a> </li> </ul> </div> </th> this how request page , cleansing of code before try find of tables:
soup = beautifulsoup(requests.get("https://en.wikipedia.org/wiki/2015-16_la_liga").text, "html.parser") superscript in soup.find_all("sup"): superscript.decompose() print len(soup.find_all("table", attrs={"class": "wikitable"})) however getting length of 2 when looking @ page html, should getting @ least 14 tables attributes...
i have no idea go here, appreciated
--edit--
everything work fine.
pyquery version
from pyquery import pyquery pq = pyquery(url="https://en.wikipedia.org/wiki/2015-16_la_liga") all_tables = pq(".wikitable") print len(all_tables) beautifulsoup version
__author__ = "leonard richardson (leonardr@segfault.org)" __version__ = "4.3.2" __copyright__ = "copyright (c) 2004-2013 leonard richardson" __license__ = "mit" bs4 import beautifulsoup import requests soup = beautifulsoup(requests.get("https://en.wikipedia.org/wiki/2015-16_la_liga").text, "html.parser") superscript in soup.find_all("sup"): superscript.decompose() print len(soup.find_all("table", attrs={"class": "wikitable"})) return 13 both version
maybe should install 4.3.2 version of bs or use pyquery?


Comments
Post a Comment