Page 366 - Special Topic Session (STS) - Volume 2
P. 366
STS500 Fauzana I. et al.
NSOs with 300 to 500 GB data have been loaded everyday into our Price Data
Lake. Price Data Lake is a Hadoop Environment under Big Data Analytics
initiative. There are two sources of external data that will be captured through
internet using crawling process. Online retailers as one the sources are e-
commerce kind of source and another one is from online government agencies
in the format of prices, index, and rates. Crawler engine works based on rules
defined for each website to determine the objects required to be captured.
Rules are developed in the format of programming scripts to parse HTML code
of the website pages. There is a scheduler component in the crawler engine to
define the time of the engine to work. The scheduler will be based on each
website that working on some algorithm to avoid blocking action by the
sources. The crawler also has component for monitoring to ensure the data
captured properly based on the rules and schedule defined. Data that have
been captured by crawler engine will be sent and stored into Price Lake. Data
from internal sources stored in existing DOSM CPI system will be transferred to
the Price Lake through FTP approach.
Web Crawling and Scraping Approach
External Data Sources are sources of data located outside DOSM
environment. Mostly they are in the form of website and portals and
connectivity to those data will be using internet connection. Online retailers are
portal or website that provide list of prices of daily goods. DOSM crawl and
scrap most of the price items varies from consumer electronics and electrical
product, household goods, toys, sport equipment, hand phone, camera,
groceries, house rental, furniture, fashion product, etc. Each of the website
provides various content categories and most of the contents are on the title,
picture, price, seller, description, etc.
Crawling is a process of capturing the HTML pages of the website and store
them into local repository. HTML pages contains various codes that
constructing the web page. A scrapping process is required to extract object
interest within the HTML codes. Because of dynamism most of the webpages,
a native programming by using Python was used to develop web crawling and
scraping program. Crawlers works based on program developed to parse the
HTML files and to define codes that represent the interested objects. The
program will result in the format of rules and the rules have high dependency
with the layout of target website. Once the layout of targeted website changes,
rules need to be change in order for the crawler engine to capture the data.
Hence, a continuous monitoring is required. Crawler works as robots that
simulate human visiting the website and collect the content. Several library are
needed for programming the crawling script in Python. Those of the libraries
are “Request”, “Beautifulsoap”,” re”,” os” and “time”. “Request” is a library in
Python that being used for collecting the HTML in a website while “Beautiful
355 | I S I W S C 2 0 1 9