📖 Beginning Python Programming /
Module: Web Scraping

Finding the URL

📖 Beginning Python Programming / Web Scraping / Finding the URL

Finding URL

We need to know the URL In order to download files, or web scrap a web page. Usually it is finding the variable patterns in URL.
For example, from the following URL, we can find the pattern of the search query.

  • https://docs.python.org/3/search.html?q=namedtuple&check_keywords=yes&area=default
  • https://duckduckgo.com/?q=python+doc
  • https://www.google.com/maps/search/Libraries/@22.1612464,113.5303786,13z
  • http://macaodaily.com/html/2020-05/04/node_2.htm
  • http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html
  • https://bis.dsat.gov.mo:37812/macauweb/routeLine.html?routeName=3&direction=0&language=zh-tw&ver=3.5.12

Let’s take a closer look at DSAT.gov.mo bus route page. If we can the bus routes, we can observe that the page URL doesn’t change. There may be 2 reasons:

  1. The page changes are generated via JavaScript rendering.
  2. The page is inside an iframe so that page changes do not change the top-level URL.

If it is the first reason, we will need a more advanced browser driver technique. If it is the second reason, we can get the URL by opening the link in a new tab, or simply copying the link location via right-click.

Now we can observe the URL for each route has the following pattern.

https://bis.dsat.gov.mo:37812/macauweb/routeLine.html?routeName=3&direction=0&language=zh-tw&ver=3.5.12


Take DICJ.gov.mo example, the URL is:

http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html

If we inspect the network requests, we can find the behind-the-scene XML URL:

http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/report_cn.xml?id=10
Screen Shot 2020-05-04 at 5.27.49 PM.png 893 KB