I am trying to scrape a table from a Javascript website using Pandas. For this, I used Selenium to first reach my desired page. I am able to print the table in text format (as shown in mented script), but I want to be able to have the table in Pandas, too. I am attaching my script as below and I hope someone could help me figure this out.
import time
from selenium import webdriver
import pandas as pd
chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = '/?
filter=BS02'
page = driver.get(url)
time.sleep(2)
driver.find_element_by_xpath('//*[@id="bursa_boards"]/option[2]').click()
driver.find_element_by_xpath('//*[@id="bursa_sectors"]/option[11]').click()
time.sleep(2)
driver.find_element_by_xpath('//*[@id="bm_equity_price_search"]').click()
time.sleep(5)
target = driver.find_elements_by_id('bm_equities_prices_table')
##for data in target:
## print (data.text)
for data in target:
dfs = pd.read_html(target,match = '+')
for df in dfs:
print (df)
Running the above script, i get the below error:
Traceback (most recent call last):
File "E:\Coding\Python\BS_Bursa Properties\Selenium_Pandas_Bursa Properties.py", line 29, in <module>
dfs = pd.read_html(target,match = '+')
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 906, in read_html
keep_default_na=keep_default_na)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 728, in _parse
piled_match = repile(match) # you can pass a piled regex here
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 233, in pile
return _pile(pattern, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 301, in _pile
p = sre_pilepile(pattern, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_pile.py", line 562, in pile
p = sre_parse.parse(p, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 616, in _parse
source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 0
I've tried using pd.read_html on the url also, but it returned an error of "No Table Found". The url is: /?filter=BS08&board=MAIN-MKT§or=PROPERTIES&page=1.
I am trying to scrape a table from a Javascript website using Pandas. For this, I used Selenium to first reach my desired page. I am able to print the table in text format (as shown in mented script), but I want to be able to have the table in Pandas, too. I am attaching my script as below and I hope someone could help me figure this out.
import time
from selenium import webdriver
import pandas as pd
chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia./market/securities/equities/prices/#/?
filter=BS02'
page = driver.get(url)
time.sleep(2)
driver.find_element_by_xpath('//*[@id="bursa_boards"]/option[2]').click()
driver.find_element_by_xpath('//*[@id="bursa_sectors"]/option[11]').click()
time.sleep(2)
driver.find_element_by_xpath('//*[@id="bm_equity_price_search"]').click()
time.sleep(5)
target = driver.find_elements_by_id('bm_equities_prices_table')
##for data in target:
## print (data.text)
for data in target:
dfs = pd.read_html(target,match = '+')
for df in dfs:
print (df)
Running the above script, i get the below error:
Traceback (most recent call last):
File "E:\Coding\Python\BS_Bursa Properties\Selenium_Pandas_Bursa Properties.py", line 29, in <module>
dfs = pd.read_html(target,match = '+')
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 906, in read_html
keep_default_na=keep_default_na)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 728, in _parse
piled_match = re.pile(match) # you can pass a piled regex here
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 233, in pile
return _pile(pattern, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 301, in _pile
p = sre_pile.pile(pattern, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_pile.py", line 562, in pile
p = sre_parse.parse(p, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 616, in _parse
source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 0
I've tried using pd.read_html on the url also, but it returned an error of "No Table Found". The url is: http://www.bursamalaysia./market/securities/equities/prices/#/?filter=BS08&board=MAIN-MKT§or=PROPERTIES&page=1.
Share Improve this question edited Jul 29, 2017 at 22:11 Eric Choi asked Jul 29, 2017 at 21:47 Eric ChoiEric Choi 8453 gold badges8 silver badges14 bronze badges2 Answers
Reset to default 7You can get the table using the following code
import time
from selenium import webdriver
import pandas as pd
chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia./market/securities/equities/prices/#/?filter=BS02'
page = driver.get(url)
time.sleep(2)
df = pd.read_html(driver.page_source)[0]
print(df.head())
This is the output
No Code Name Rem Last Done LACP Chg % Chg Vol ('00) Buy Vol ('00) Buy Sell Sell Vol ('00) High Low
0 1 5284CB LCTITAN-CB s 0.025 0.020 0.005 +25.00 406550 19878 0.020 0.025 106630 0.025 0.015
1 2 1201 SUMATEC [S] s 0.050 0.050 - - 389354 43815 0.050 0.055 187301 0.055 0.050
2 3 5284 LCTITAN [S] s 4.470 4.700 -0.230 -4.89 367335 430 4.470 4.480 34 4.780 4.140
3 4 0176 KRONO [S] - 0.875 0.805 0.070 +8.70 300473 3770 0.870 0.875 797 0.900 0.775
4 5 5284CE LCTITAN-CE s 0.130 0.135 -0.005 -3.70 292379 7214 0.125 0.130 50 0.155 0.100
To get data from all pages you can crawl the remaining pages and use df.append
Answer:
df = pd.read_html(target[0].get_attribute('outerHTML'))
Result:
Reason for target[0]
:
driver.find_elements_by_id('bm_equities_prices_table')
returns a list of selenium webelements, in your case, there's only 1 element, hence [0]
Reason for get_attribute('outerHTML')
:
we want to get the 'html' of the element. There are 2 types of such get_attribute methods
: 'innerHTML'
vs 'outerHTML'
. We chose the 'outerHTML'
becasue we need to include the current element, where the table headers are, I suppose, instead of only the inner contents of the element.
Reason for df[0]
pd.read_html()
returns a list of data frames, the first of which is the result we want, hence [0]
.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1743633364a4481737.html
评论列表(0条)