get_dynamic_url

pyhelpers.ops.get_dynamic_url(page_url, pattern, base_url)[source]

Gets a dynamic URL matching a specific regex pattern.

This function is primarily used to retrieve direct download links for datasets (e.g. timestamped files) whose filenames may change. It parses the HTML of the page_url and looks for the first <a> tag with an href attribute that matches the provided pattern.

Parameters:
  • page_url (str) – The URL of the webpage to scrape for links.

  • pattern (str) – A regular expression pattern to search for within the href attributes.

  • base_url (str) – The base URL used to resolve relative links found on the page.

Returns:

The fully qualified URL if a match is found; None otherwise.

Return type:

str | None

Examples:

>>> from pyhelpers.ops import get_dynamic_url
>>> page_url = "https://northerngasopendataportal.co.uk/datasets/"
>>> pattern = ".*\.pdf"
>>> base_url = "https://northerngasopendataportal.co.uk"
>>> download_url = get_dynamic_url(page_url, pattern, base_url)
>>> print(download_url)
https://www.northerngasnetworks.co.uk/wp-content/uploads/2018/05/GDPR-Privacy-Statement...