javascript - Trying to scrape table using Pandas from Selenium's result - Stack Overflow

I am trying to scrape a table from a Javascript website using Pandas. For this, I used Selenium to firs

I am trying to scrape a table from a Javascript website using Pandas. For this, I used Selenium to first reach my desired page. I am able to print the table in text format (as shown in mented script), but I want to be able to have the table in Pandas, too. I am attaching my script as below and I hope someone could help me figure this out.

import time
from selenium import webdriver
import pandas as pd

chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = '/?
filter=BS02'

page = driver.get(url)
time.sleep(2)


driver.find_element_by_xpath('//*[@id="bursa_boards"]/option[2]').click()


driver.find_element_by_xpath('//*[@id="bursa_sectors"]/option[11]').click()
time.sleep(2)

driver.find_element_by_xpath('//*[@id="bm_equity_price_search"]').click()
time.sleep(5)

target = driver.find_elements_by_id('bm_equities_prices_table')
##for data in target:
##    print (data.text)

for data in target:
    dfs = pd.read_html(target,match = '+')
for df in dfs:
    print (df)  

Running the above script, i get the below error:

Traceback (most recent call last):
  File "E:\Coding\Python\BS_Bursa Properties\Selenium_Pandas_Bursa Properties.py", line 29, in <module>
    dfs = pd.read_html(target,match = '+')
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 906, in read_html
    keep_default_na=keep_default_na)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 728, in _parse
    piled_match = repile(match)  # you can pass a piled regex here
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 233, in pile
    return _pile(pattern, flags)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 301, in _pile
    p = sre_pilepile(pattern, flags)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_pile.py", line 562, in pile
    p = sre_parse.parse(p, flags)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 855, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 416, in _parse_sub
    not nested and not items))
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 616, in _parse
    source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 0

I've tried using pd.read_html on the url also, but it returned an error of "No Table Found". The url is: /?filter=BS08&board=MAIN-MKT&sector=PROPERTIES&page=1.

I am trying to scrape a table from a Javascript website using Pandas. For this, I used Selenium to first reach my desired page. I am able to print the table in text format (as shown in mented script), but I want to be able to have the table in Pandas, too. I am attaching my script as below and I hope someone could help me figure this out.

import time
from selenium import webdriver
import pandas as pd

chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia./market/securities/equities/prices/#/?
filter=BS02'

page = driver.get(url)
time.sleep(2)


driver.find_element_by_xpath('//*[@id="bursa_boards"]/option[2]').click()


driver.find_element_by_xpath('//*[@id="bursa_sectors"]/option[11]').click()
time.sleep(2)

driver.find_element_by_xpath('//*[@id="bm_equity_price_search"]').click()
time.sleep(5)

target = driver.find_elements_by_id('bm_equities_prices_table')
##for data in target:
##    print (data.text)

for data in target:
    dfs = pd.read_html(target,match = '+')
for df in dfs:
    print (df)  

Running the above script, i get the below error:

Traceback (most recent call last):
  File "E:\Coding\Python\BS_Bursa Properties\Selenium_Pandas_Bursa Properties.py", line 29, in <module>
    dfs = pd.read_html(target,match = '+')
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 906, in read_html
    keep_default_na=keep_default_na)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 728, in _parse
    piled_match = re.pile(match)  # you can pass a piled regex here
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 233, in pile
    return _pile(pattern, flags)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 301, in _pile
    p = sre_pile.pile(pattern, flags)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_pile.py", line 562, in pile
    p = sre_parse.parse(p, flags)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 855, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 416, in _parse_sub
    not nested and not items))
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 616, in _parse
    source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 0

I've tried using pd.read_html on the url also, but it returned an error of "No Table Found". The url is: http://www.bursamalaysia./market/securities/equities/prices/#/?filter=BS08&board=MAIN-MKT&sector=PROPERTIES&page=1.

Share Improve this question edited Jul 29, 2017 at 22:11 Eric Choi asked Jul 29, 2017 at 21:47 Eric ChoiEric Choi 8453 gold badges8 silver badges14 bronze badges
Add a ment  | 

2 Answers 2

Reset to default 7

You can get the table using the following code

import time
from selenium import webdriver
import pandas as pd

chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia./market/securities/equities/prices/#/?filter=BS02'

page = driver.get(url)
time.sleep(2)

df = pd.read_html(driver.page_source)[0]
print(df.head())

This is the output

No  Code    Name    Rem Last Done   LACP    Chg % Chg   Vol ('00)   Buy Vol ('00)   Buy Sell    Sell Vol ('00)  High    Low
0   1   5284CB  LCTITAN-CB  s   0.025   0.020   0.005   +25.00  406550  19878   0.020   0.025   106630  0.025   0.015
1   2   1201    SUMATEC [S] s   0.050   0.050   -   -   389354  43815   0.050   0.055   187301  0.055   0.050
2   3   5284    LCTITAN [S] s   4.470   4.700   -0.230  -4.89   367335  430 4.470   4.480   34  4.780   4.140
3   4   0176    KRONO [S]   -   0.875   0.805   0.070   +8.70   300473  3770    0.870   0.875   797 0.900   0.775
4   5   5284CE  LCTITAN-CE  s   0.130   0.135   -0.005  -3.70   292379  7214    0.125   0.130   50  0.155   0.100

To get data from all pages you can crawl the remaining pages and use df.append

Answer:

df = pd.read_html(target[0].get_attribute('outerHTML'))

Result:

Reason for target[0]:

driver.find_elements_by_id('bm_equities_prices_table') returns a list of selenium webelements, in your case, there's only 1 element, hence [0]

Reason for get_attribute('outerHTML'):

we want to get the 'html' of the element. There are 2 types of such get_attribute methods: 'innerHTML' vs 'outerHTML'. We chose the 'outerHTML' becasue we need to include the current element, where the table headers are, I suppose, instead of only the inner contents of the element.

Reason for df[0]

pd.read_html() returns a list of data frames, the first of which is the result we want, hence [0].

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1743633364a4481737.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信