get_page_catalogue

pyrcs.parser.get_page_catalogue(url, head_tag_name='nav', head_tag_txt='Jump to: ', feature_tag_name='h3', verbose=False, raise_error=False)[source]

Gets the catalogue of features from the main page of a data cluster.

This function extracts structured data (features) from a web page by parsing specific tags, typically used for features like headings and links in railway-related databases.

Parameters:
  • url (str) – The URL of the main page of a data cluster.

  • head_tag_name (str) – The tag name of the feature list at the top of the page; defaults to 'nav'.

  • head_tag_txt (str) – Text contained in the head tag; defaults to 'Jump to: '.

  • feature_tag_name (str) – The tag name of the headings of each feature; defaults to 'h3'.

  • verbose (bool | int) – Whether to print relevant information to the console; defaults to False.

  • raise_error (bool) – Whether to raise the provided exception; if raise_error=False (default), the error will be suppressed.

Returns:

A dataframe containing the page’s feature catalogue with columns for feature, URL and heading.

Return type:

pandas.DataFrame

Examples:

>>> from pyrcs.parser import get_page_catalogue
>>> from pyhelpers.settings import pd_preferences
>>> pd_preferences(max_columns=1)
>>> elec_url = 'http://www.railwaycodes.org.uk/electrification/mast_prefix2.shtm'
>>> elec_catalogue = get_page_catalogue(elec_url)
>>> elec_catalogue
                                              Feature  ...
0                                     Beamish Tramway  ...
1                                  Birkenhead Tramway  ...
2                         Black Country Living Museum  ...
3                                   Blackpool Tramway  ...
4   Brighton and Rottingdean Seashore Electric Rai...  ...
..                                                ...  ...
17                                     Seaton Tramway  ...
18                                Sheffield Supertram  ...
19                          Snaefell Mountain Railway  ...
20  Summerlee, Museum of Scottish Industrial Life ...  ...
21                                  Tyne & Wear Metro  ...
[22 rows x 3 columns]
>>> elec_catalogue.columns.to_list()
['Feature', 'URL', 'Heading']