Page 366 - Special Topic Session (STS) - Volume 2
P. 366

STS500 Fauzana I. et al.
                  NSOs with 300 to 500 GB data have been loaded everyday into our Price Data
                  Lake.  Price  Data  Lake  is  a  Hadoop  Environment  under  Big  Data  Analytics
                  initiative. There are two sources of external data that will be captured through
                  internet  using  crawling  process.  Online  retailers  as  one  the  sources  are  e-
                  commerce kind of source and another one is from online government agencies
                  in the format of prices, index, and rates. Crawler engine works based on rules
                  defined for each website to determine the objects required to be captured.
                  Rules are developed in the format of programming scripts to parse HTML code
                  of the website pages. There is a scheduler component in the crawler engine to
                  define the time of the engine to work. The scheduler will be based on each
                  website  that  working  on  some  algorithm  to  avoid  blocking  action  by  the
                  sources. The crawler also has component for monitoring to ensure the data
                  captured properly based on the rules and schedule defined. Data that have
                  been captured by crawler engine will be sent and stored into Price Lake. Data
                  from internal sources stored in existing DOSM CPI system will be transferred to
                  the Price Lake through FTP approach.

                  Web Crawling and Scraping Approach
                     External  Data  Sources  are  sources  of  data  located  outside  DOSM
                  environment.  Mostly  they  are  in  the  form  of  website  and  portals  and
                  connectivity to those data will be using internet connection. Online retailers are
                  portal or website that provide list of prices of daily goods. DOSM crawl and
                  scrap most of the price items varies from consumer electronics and electrical
                  product,  household  goods,  toys,  sport  equipment,  hand  phone,  camera,
                  groceries,  house  rental,  furniture,  fashion  product,  etc.  Each  of  the  website
                  provides various content categories and most of the contents are on the title,
                  picture, price, seller, description, etc.
                     Crawling is a process of capturing the HTML pages of the website and store
                  them  into  local  repository.  HTML  pages  contains  various  codes  that
                  constructing the web page. A scrapping process is required to extract object
                  interest within the HTML codes. Because of dynamism most of the webpages,
                  a native programming by using Python was used to develop web crawling and
                  scraping program. Crawlers works based on program developed to parse the
                  HTML  files  and  to  define  codes  that  represent  the  interested  objects.  The
                  program will result in the format of rules and the rules have high dependency
                  with the layout of target website. Once the layout of targeted website changes,
                  rules need to be change in order for the crawler engine to capture the data.
                  Hence,  a  continuous  monitoring  is  required.  Crawler  works  as  robots  that
                  simulate human visiting the website and collect the content. Several library are
                  needed for programming the crawling script in Python. Those of the libraries
                  are “Request”, “Beautifulsoap”,” re”,” os” and “time”. “Request” is a library in
                  Python that being used for collecting the HTML in a website while “Beautiful

                                                                     355 | I S I   W S C   2 0 1 9
   361   362   363   364   365   366   367   368   369   370   371