I built a small web scraper that has run successfully in a Google Colab over the last few months. It downloads a set of billing codes from the CMS website. Recently the driver started throwing timeout exceptions when retrieving some but not all urls. The reprex below downloads a file from two urls. It executes successfully when I run it locally and it attempts to and fails trying to retrieve the second url when running in Google Colab.
The timeout happens in driver.get(url)
. Strangely, the code works so long as the driver has not previously visited another url. For example, in the code below, not_working_url
will successfully retrieve the webpage and download the file if it does not come after working_url
.
from selenium import webdriver
from seleniummon.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdrivermon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
def download_documents() -> None:
"""Download billing code documents from CMS"""
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=chrome_options)
working_url = ".aspx?articleid=59626&ver=6"
not_working_url = ".aspx?lcdid=36377&ver=19"
for row in [working_url, not_working_url]:
print(f"Retrieving from {row}...")
driver.get(row) # Fails on second url
print("Wait for webdriver...")
wait = WebDriverWait(driver, 2)
print("Attempting license accept...")
# Accept license
try:
wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
except TimeoutException:
pass
wait = WebDriverWait(driver, 4)
print("Attempting pop up close...")
# Click on Close button of the second pop-up
try:
wait.until(
EC.element_to_be_clickable(
(
By.XPATH,
"//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
)
)
).click()
except TimeoutException:
pass
print("Attempting download...")
driver.find_element(By.ID, "btnDownload").click()
download_documents()
Expected behavior: The code above runs successfully in Google Colab, just like it does locally.
A potentially related issue: Selenium TimeoutException in Google Colab
I built a small web scraper that has run successfully in a Google Colab over the last few months. It downloads a set of billing codes from the CMS website. Recently the driver started throwing timeout exceptions when retrieving some but not all urls. The reprex below downloads a file from two urls. It executes successfully when I run it locally and it attempts to and fails trying to retrieve the second url when running in Google Colab.
The timeout happens in driver.get(url)
. Strangely, the code works so long as the driver has not previously visited another url. For example, in the code below, not_working_url
will successfully retrieve the webpage and download the file if it does not come after working_url
.
from selenium import webdriver
from seleniummon.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdrivermon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
def download_documents() -> None:
"""Download billing code documents from CMS"""
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=chrome_options)
working_url = "https://www.cms.gov/medicare-coverage-database/view/article.aspx?articleid=59626&ver=6"
not_working_url = "https://www.cms.gov/medicare-coverage-database/view/lcd.aspx?lcdid=36377&ver=19"
for row in [working_url, not_working_url]:
print(f"Retrieving from {row}...")
driver.get(row) # Fails on second url
print("Wait for webdriver...")
wait = WebDriverWait(driver, 2)
print("Attempting license accept...")
# Accept license
try:
wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
except TimeoutException:
pass
wait = WebDriverWait(driver, 4)
print("Attempting pop up close...")
# Click on Close button of the second pop-up
try:
wait.until(
EC.element_to_be_clickable(
(
By.XPATH,
"//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
)
)
).click()
except TimeoutException:
pass
print("Attempting download...")
driver.find_element(By.ID, "btnDownload").click()
download_documents()
Expected behavior: The code above runs successfully in Google Colab, just like it does locally.
A potentially related issue: Selenium TimeoutException in Google Colab
Share Improve this question edited Nov 17, 2024 at 14:08 Marshall K asked Nov 16, 2024 at 16:13 Marshall KMarshall K 3331 silver badge14 bronze badges2 Answers
Reset to default 0I was able to run my script successfully by initializing (and closing) the driver on every iteration of the loop rather than just once before it started.
For example, the loop below retrieves the url without timing out on each iteration. I would still appreciate any commentary explaining why I would ever need to reinitialize the driver regardless of my programming environment, but hopefully this solution is helpful for others who run into this issue.
for row in [working_url, not_working_url]:
driver = webdriver.Chrome(options=chrome_options)
driver.get(row)
driver.close()
Try these below arguments:
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745652883a4638370.html
评论列表(0条)