Page 367 - Special Topic Session (STS) - Volume 2
P. 367

STS500 Fauzana I. et al.
            Soap” was a library being used for taking the data that we need in HTML. “re”
            was used for taking the string in the string and “os” being used for to make a
            directory automatically. “time” library was used for make the program sleep
            whenever possible. First of all, we create one function in a python program to
            get all the category link in the website. Then we create one function to initiate
            the  variable  and  the  constant.  We  parsing  the  category  link  by  using
            “Beautifulsoap” library that we mentioned above in the form of “bs4”. The “bs4”
            was always in the form list, thus “for” was using for looping of the data. The
            selector in HTML needs to be identified before we can perform parsing the
            HTML. The selector are referring to “tag” in HTML, as a “class” or “id” in CSS.
            The selector in HTML can be identified by utilizing CSS and Xpath selector. After
            we have parsing the category link with data looping and put it on the variable
            result, we created one function code to request URL category link to get data
            in the HTML and we store it in URL index 1. Again we parsing the HTML data
            to get  URL items  link  that have one parameter that also  being  put  on the
            variable result as a list. The next function was to get the data and put it on two
            parameters with one for index name and another one for data items link. The
            “os” library being used for create the folder automatically and the data being
            store based on their indexes. Then we created function to get data until the last
            pages of the website to get all indexes in the category link. The last page will
            be looping to get all the index link before we created the last function for
            request all the index link. There are different approach and techniques for a
            different website to develop the crawling program in python. In some cases,
            the website are using AJAX and there is no HTML content revealed. This needs
            to  be  treated  differently.  Asynchronous  JavaScript  and  XMLHTTP,  or
            abbreviated  AJAX,  is  a  web-based  programming  technique  for  creating
            interactive web applications. The goal is to move most of the interactions on
            the web surfer's computer, to exchange data with the server behind the scenes,
            so that the web page does not have to be re-read in its entirety every time a
            user changes. First, we need to check the AJAX by inspecting in a website, then
            we can click on the network button and select XHR. The analysis needs to be
            made on AJAX base link, id of AJAX and the end of AJAX link. The category of
            URL AJAX in a website needs to be collected by searching the ID in URL AJAX
            that have been always changed in the content element. In this case, Selenium
            library must be used because the element cannot be requested. Figure 3 are
            one of the example of parsing html result that had been stored in Hbase after
            the robot successfully save the page description including the indexes.
                How  crawlers  work  is  not  much  different  on  how  spam  agent  works.
            Sometimes the target websites implement security to detect robots accessing
            the website, and block them to reduce the target website’s bandwidth. In order
            to avoid detection as robots, crawler must be smart enough to make the target
            website not to detect crawler is a robot. DOSM have been implemented several

                                                               356 | I S I   W S C   2 0 1 9
   362   363   364   365   366   367   368   369   370   371   372