web scraping - WebScraping in nodes and elements in R - Stack Overflow

I am trying to scrape the name and location of the followingbut I get an empty DF - any help is appre

I am trying to scrape the name and location of the following but I get an empty DF - any help is appreciated !Ive tried using a CSS selector and xpath and it is still not picking up anything

library(rvest)

Aeroframe<-data.frame()
url <- ";
webpage <- read_html(url)
title<-webpage%>%html_nodes(".field_item")%>%html_text()
location<-webpage%>%html_nodes(".field_label")%>%html_text()
AeroFrame<-data.frame(title,location)

I am trying to scrape the name and location of the following https://www.casa.gov.au/search-centre/aerodromes but I get an empty DF - any help is appreciated !Ive tried using a CSS selector and xpath and it is still not picking up anything

library(rvest)

Aeroframe<-data.frame()
url <- "https://www.casa.gov.au/search-centre/aerodromes"
webpage <- read_html(url)
title<-webpage%>%html_nodes(".field_item")%>%html_text()
location<-webpage%>%html_nodes(".field_label")%>%html_text()
AeroFrame<-data.frame(title,location)
Share Improve this question edited Mar 18 at 8:07 margusl 18.4k3 gold badges22 silver badges29 bronze badges asked Mar 18 at 7:41 evanievani 11 silver badge1 bronze badge 2
  • 3 Welcome to SO! You might be dealing with a typo here, your selectors seem to miss a 2nd underscore, e.g. .field_item should read .field__item – margusl Commented Mar 18 at 7:51
  • Obtaining .field__item & .field__label solves your issue as per margusl's great comment, it does however not give two equal-sized vectors. So I would do Aeroframe <- webpage %>% html_nodes(".card-fields") %>% html_text() %>% sub("Aerodrome operator:", "", .) %>% gsub("Location:", ", ",.); res <- do.call(rbind.data.frame, strsplit(Aeroframe, ", ")); res <- data.frame(Aerodrome_operator = trimws(res[,1]),Location = trimws(res[,2])) and crawl over https://www.casa.gov.au/search-centre/aerodromes?page=x x=page, where 1 = page 2 – Tim G Commented Mar 18 at 10:26
Add a comment  | 

1 Answer 1

Reset to default 3

It would help to know what exactly did you recieve from read_html(), but you may face couple of issues here.

By inspecting elements (and source) we can see that actual classes are spelled bit differently:

<div class="field field--label-inline">
  <div class="field__label">Aerodrome operator:</div>
  <div class="field__item"> Abra Mining Pty Limited </div>
</div>

Though there's a good chance that you never actually received any relevant content from read_html(). At least with my setup and from my location I first need to fiddle with request headers a bit to get anything back, something like:

library(httr2)
request(url) |> 
  req_user_agent("Mozilla/5.0") |> 
  req_headers(Connection = "Keep-Alive") |> 
  req_perform() |> 
  resp_body_html()

And then I'm treated with a small JavaScript challenge that is there to block some automated tools (like rvest ).

If you have Chrome or any other Chromium-based browser, like Edge, and {chromote} installed, you can try replacing read_html() with read_html_live(). And perhaps adjust your strategy a bit:

library(rvest)

url_ <- "https://www.casa.gov.au/search-centre/aerodromes"
webpage <- read_html_live(url_)

# collect containers
cards <- webpage |> html_elements(".card-fields")

# extract 1st & 2nd set of labels & fields from every container:
tibble::tibble(
  f1_label = cards |> html_element(xpath = "./div[1]/div[@class='field__label']") |> html_text(trim = TRUE),
  f1_item  = cards |> html_element(xpath = "./div[1]/div[@class='field__item']" ) |> html_text(trim = TRUE),
  f2_label = cards |> html_element(xpath = "./div[2]/div[@class='field__label']") |> html_text(trim = TRUE),
  f2_item  = cards |> html_element(xpath = "./div[2]/div[@class='field__item']" ) |> html_text(trim = TRUE)
)
#> # A tibble: 15 × 4
#>    f1_label            f1_item                                  f2_label f2_item
#>    <chr>               <chr>                                    <chr>    <chr>  
#>  1 Aerodrome operator: Abra Mining Pty Limited                  Locatio… WA     
#>  2 Aerodrome operator: Adelaide Airport Limited                 Locatio… SA     
#>  3 Aerodrome operator: City of Albany                           Locatio… WA     
#>  4 Aerodrome operator: Albury City Council                      Locatio… NSW    
#>  5 Aerodrome operator: Alice Springs Airport Pty Ltd            Locatio… NT     
#>  6 Aerodrome operator: Barcaldine Regional Council              Locatio… Qld    
#>  7 Aerodrome operator: Ararat Rural City Council                Locatio… Vic    
#>  8 Aerodrome operator: Archerfield Airport Corporation Pty Ltd  Locatio… Qld    
#>  9 Aerodrome operator: Argyle Diamonds Limited                  Locatio… WA     
#> 10 Aerodrome operator: Armidale Regional Council                Locatio… NSW    
#> 11 Aerodrome operator: Aurukun Shire Council                    Locatio… Qld    
#> 12 Aerodrome operator: Avalon Airport Australia Pty Ltd         Locatio… Vic    
#> 13 Aerodrome operator: Voyages Indigenous Tourism Australia Pt… Locatio… NT     
#> 14 Aerodrome operator: East Gippsland Shire Council             Locatio… Vic    
#> 15 Aerodrome operator: Wirrimanu Aboriginal Corporation         Locatio… WA

Created on 2025-03-18 with reprex v2.1.1

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744517426a4578361.html

相关推荐

  • web scraping - WebScraping in nodes and elements in R - Stack Overflow

    I am trying to scrape the name and location of the followingbut I get an empty DF - any help is appre

    1天前
    20

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信