rvest - Using R to gather data from dynamic webpage - Stack Overflow

I am trying to automatize data extraction from a website on Austrian employemnt figures using R:.aspx

I am trying to automatize data extraction from a website on Austrian employemnt figures using R: .aspx

For example, I would like to specify

  • On the left selection box: Erwerbstätige -> Unselbständig Beschäftige
  • On the second box I do not check any of the options.
  • And on the third column (outputformtat) I choose Ausgabeformat = Zeitreihe (i.e. timeseries) and specify a start date.

By clicking on Ausführen (i.e Execute) the data is generated and I can export it as JSON or xlsx.

The data is also shown in a panel with id = main_UpdatePanel1

I am not sure how to generate this data using R: as I have to physcially select all the options.

I found out the options on the left column are specified in this part of the html:

<div id="tree">
                <ul id="treeData" style="display: none;" class="ui-fancytree-source fancytree-helper-hidden">                                          
                    <li id="id_1" class="folder">Erwerbstätige
                        <ul>
                            <li id="id1.1" data-content="hvs_Bestand_UB">Unselbständig Beschäftigte</li>
                            <li id="id1.2" data-content="hvs_Bestand_FD">Freie Dienstverträge</li>
                            <li title="Geringfügig Beschäftigte (siehe Hinweis in Information)" id="id1.3" data-content="hvs_Bestand_GB">Geringfügig Beschäftigte</li>                        
                            <li title="Geringfügig Freie Dienstverträge (siehe Hinweis in Information)" id="id1.4" data-content="hvs_Bestand_GD">Geringfügig Freie Dienstverträge</li>
                            <li id="id1.5" data-content="sbe_Bestand_SB">Selbständig Beschäftigte</li>                              
                        </ul>
                     </li>                        
                     <li id="id_2" class="folder">Arbeitskräftepotential
                        <ul><li id="id2.1" data-content="akpalq_Bestand_PO">Bestand</li></ul>
                     </li> 
                     <li id="id_3" class="folder">(Register-)Arbeitslosenquoten
                        <ul><li id="id3.1" data-content="akpalq_Bestand_QU">Bestand</li></ul>
                     </li>
                     <li id="id_4" class="folder">Quoten
                        <ul>
                            <li id="id4.1" data-content="quoten_Bestand_beQU">Beschäftigungsquote</li> 
                            <li id="id4.2" data-content="quoten_Bestand_erQU">Erwerbsquote</li>      
                        </ul>
                     </li>
                </ul>

Using this, I can see that the first option Unselbständig Beschäftigte corresponds to id = id1.1

Similarly, the outputformat is controlled here:

<select id="lstAusgabe" class="form-control form-control-sm">
                        <option value="TA">Tabelle</option>
                        <option value="ZR">Zeitreihe</option>
                    </select>

So I would need value = "ZR".

But I am absolutely clueless on what to do with this information.

I am trying to automatize data extraction from a website on Austrian employemnt figures using R: https://www.dnet.at/amis/Datenbank/DB_Be.aspx

For example, I would like to specify

  • On the left selection box: Erwerbstätige -> Unselbständig Beschäftige
  • On the second box I do not check any of the options.
  • And on the third column (outputformtat) I choose Ausgabeformat = Zeitreihe (i.e. timeseries) and specify a start date.

By clicking on Ausführen (i.e Execute) the data is generated and I can export it as JSON or xlsx.

The data is also shown in a panel with id = main_UpdatePanel1

I am not sure how to generate this data using R: as I have to physcially select all the options.

I found out the options on the left column are specified in this part of the html:

<div id="tree">
                <ul id="treeData" style="display: none;" class="ui-fancytree-source fancytree-helper-hidden">                                          
                    <li id="id_1" class="folder">Erwerbstätige
                        <ul>
                            <li id="id1.1" data-content="hvs_Bestand_UB">Unselbständig Beschäftigte</li>
                            <li id="id1.2" data-content="hvs_Bestand_FD">Freie Dienstverträge</li>
                            <li title="Geringfügig Beschäftigte (siehe Hinweis in Information)" id="id1.3" data-content="hvs_Bestand_GB">Geringfügig Beschäftigte</li>                        
                            <li title="Geringfügig Freie Dienstverträge (siehe Hinweis in Information)" id="id1.4" data-content="hvs_Bestand_GD">Geringfügig Freie Dienstverträge</li>
                            <li id="id1.5" data-content="sbe_Bestand_SB">Selbständig Beschäftigte</li>                              
                        </ul>
                     </li>                        
                     <li id="id_2" class="folder">Arbeitskräftepotential
                        <ul><li id="id2.1" data-content="akpalq_Bestand_PO">Bestand</li></ul>
                     </li> 
                     <li id="id_3" class="folder">(Register-)Arbeitslosenquoten
                        <ul><li id="id3.1" data-content="akpalq_Bestand_QU">Bestand</li></ul>
                     </li>
                     <li id="id_4" class="folder">Quoten
                        <ul>
                            <li id="id4.1" data-content="quoten_Bestand_beQU">Beschäftigungsquote</li> 
                            <li id="id4.2" data-content="quoten_Bestand_erQU">Erwerbsquote</li>      
                        </ul>
                     </li>
                </ul>

Using this, I can see that the first option Unselbständig Beschäftigte corresponds to id = id1.1

Similarly, the outputformat is controlled here:

<select id="lstAusgabe" class="form-control form-control-sm">
                        <option value="TA">Tabelle</option>
                        <option value="ZR">Zeitreihe</option>
                    </select>

So I would need value = "ZR".

But I am absolutely clueless on what to do with this information.

Share Improve this question asked Nov 15, 2024 at 19:10 CetttCettt 12k8 gold badges39 silver badges60 bronze badges 2
  • 2 In general the preferred way to do this would be via API if the anization provides them. To access these you would use httr package. If no API exists your second option would be to scrape the website, this is more analogous to what you are doing manually. To do this you would use RSelenium package. Both packages have extensive documentation so hopefully this can help you get a start. – Adam Commented Nov 15, 2024 at 20:18
  • @Adam. Thank you. I will look into that – Cettt Commented Nov 15, 2024 at 20:23
Add a comment  | 

1 Answer 1

Reset to default 1

You probably could automate it with rvest::read_html_live() and resulting LiveHTML object that let's you interact with a live page through chromote.
But let's try this with {selenider} instead, for richer interaction (quote from ?rvest::LiveHTML).

As a first step you should probably just poke that page a bit in your browser's dev tools, i.e. how it's all glued together, what triggers additional requests and what gets requested, what js libraries are used for controls & widgets, can any of existing javascript be used instead of generating clicks and keypresses, are there any constraints set in frontend (max year span seems to be 5, so perhaps try to respect that) etc. Set breakpoints, check documentation of used libraries, dig into call stacks in network tab, search for elements that stick out in js code.

Apparently frontend javascript (mostly here & here ) is pretty well structured, not minified and super-verbose with lots of comments; as we can evaluate javascript with selenider / chromote, many objects & functions are already exposed for use to use, so there's really no need to invent everything from scratch. Left pane for example is a Fancytree widget and we can use its API to select items, which in turn will trigger required events.

library(selenider)
library(rvest)
library(dplyr)
library(tidyr)

selenider_session(
  "chromote",
  timeout = 10
)
#> A selenider session object
#> • Open for 2ms
#> • Session: "chromote"
#> • Browser: "Chrome"
#> • Port: NA
#> • Timeout: 10s

open_url("https://www.dnet.at/amis/Datenbank/DB_Be.aspx")

# we can view current (chromote) session in a browser
get_session()$driver$view()
#> [1] 0

# get fancytree keys (id values):
ss("#tree li[data-content]")
#> { selenider_elements (9) }
#> [1] <li id="id1.1" data-content="hvs_Bestand_UB">Unselbständig Beschäftigte</li>
#> [2] <li id="id1.2" data-content="hvs_Bestand_FD">Freie Dienstverträge</li>
#> [3] <li title="Geringfügig Beschäftigte (siehe Hinweis in Information)" id="id1.3 ...
#> [4] <li title="Geringfügig Freie Dienstverträge (siehe Hinweis in Information)" i ...
#> [5] <li id="id1.5" data-content="sbe_Bestand_SB">Selbständig Beschäftigte</li>
#> [6] <li id="id2.1" data-content="akpalq_Bestand_PO">Bestand</li>
#> [7] <li id="id3.1" data-content="akpalq_Bestand_QU">Bestand</li>
#> [8] <li id="id4.1" data-content="quoten_Bestand_beQU">Beschäftigungsquote</li>
#> [9] <li id="id4.2" data-content="quoten_Bestand_erQU">Erwerbsquote</li>

# activate `id1.3`, "Geringfügig Beschäftigte (siehe Hinweis in Information)"
execute_js_expr("$.ui.fancytree.getTree('#tree').activateKey(arguments[0]);", "id1.3")

# switch Ausgabeformat to Zeitreihe
execute_js_expr("lstAusgabe.value = arguments[0]; lstAusgabeOnChange();", "ZR")

# set years
execute_js_expr(
  "lstJahrBis.value = arguments[1]; 
  lstJahrBisOnChange();
  lstJahrVon.value = arguments[0]; 
  lstJahrVonOnChange();",
  2023, 2024)

# we need some kind of a marker to know when request is completed,
# for this let's remove some content and later wait for it to reappear
execute_js_expr("document.querySelector('#divContentTemplate').innerHTML = ''")

s("[name = 'ctl00$main$btnAspBtn']") |> 
  elem_click()

# successful request recreates #divContentTemplate content,
# wait for it for max 30s 
s("#divContentTemplate > div") |> 
  elem_expect(is_present, timeout = 30)

# request title:
s("#headerAktAuswahl") |> 
  elem_text()
#> [1] "Erwerbstätige: Geringfügig Beschäftigte - Zeitreihe: Monate 2024-2024"

# parse table:
s("table#main_gAktuell") |> 
  # switch to rvest
  read_html() |> 
  html_table() |>
  # read_html() creates a new html doc, so html_table() returns a list 
  # with a single tibble
  first() |> 
  pivot_longer(everything())
#> # A tibble: 12 × 2
#>    name    value
#>    <chr>   <dbl>
#>  1 2024_01  339.
#>  2 2024_02  340.
#>  3 2024_03  341.
#>  4 2024_04  336.
#>  5 2024_05  339.
#>  6 2024_06  341.
#>  7 2024_07  327.
#>  8 2024_08  319.
#>  9 2024_09  321.
#> 10 2024_10  327.
#> 11 2024_11    0 
#> 12 2024_12    0

Created on 2024-11-17 with reprex v2.1.1

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745675386a4639656.html

相关推荐

  • rvest - Using R to gather data from dynamic webpage - Stack Overflow

    I am trying to automatize data extraction from a website on Austrian employemnt figures using R:.aspx

    5小时前
    10

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信