Page 367 - Special Topic Session (STS) - Volume 2
P. 367
STS500 Fauzana I. et al.
Soap” was a library being used for taking the data that we need in HTML. “re”
was used for taking the string in the string and “os” being used for to make a
directory automatically. “time” library was used for make the program sleep
whenever possible. First of all, we create one function in a python program to
get all the category link in the website. Then we create one function to initiate
the variable and the constant. We parsing the category link by using
“Beautifulsoap” library that we mentioned above in the form of “bs4”. The “bs4”
was always in the form list, thus “for” was using for looping of the data. The
selector in HTML needs to be identified before we can perform parsing the
HTML. The selector are referring to “tag” in HTML, as a “class” or “id” in CSS.
The selector in HTML can be identified by utilizing CSS and Xpath selector. After
we have parsing the category link with data looping and put it on the variable
result, we created one function code to request URL category link to get data
in the HTML and we store it in URL index 1. Again we parsing the HTML data
to get URL items link that have one parameter that also being put on the
variable result as a list. The next function was to get the data and put it on two
parameters with one for index name and another one for data items link. The
“os” library being used for create the folder automatically and the data being
store based on their indexes. Then we created function to get data until the last
pages of the website to get all indexes in the category link. The last page will
be looping to get all the index link before we created the last function for
request all the index link. There are different approach and techniques for a
different website to develop the crawling program in python. In some cases,
the website are using AJAX and there is no HTML content revealed. This needs
to be treated differently. Asynchronous JavaScript and XMLHTTP, or
abbreviated AJAX, is a web-based programming technique for creating
interactive web applications. The goal is to move most of the interactions on
the web surfer's computer, to exchange data with the server behind the scenes,
so that the web page does not have to be re-read in its entirety every time a
user changes. First, we need to check the AJAX by inspecting in a website, then
we can click on the network button and select XHR. The analysis needs to be
made on AJAX base link, id of AJAX and the end of AJAX link. The category of
URL AJAX in a website needs to be collected by searching the ID in URL AJAX
that have been always changed in the content element. In this case, Selenium
library must be used because the element cannot be requested. Figure 3 are
one of the example of parsing html result that had been stored in Hbase after
the robot successfully save the page description including the indexes.
How crawlers work is not much different on how spam agent works.
Sometimes the target websites implement security to detect robots accessing
the website, and block them to reduce the target website’s bandwidth. In order
to avoid detection as robots, crawler must be smart enough to make the target
website not to detect crawler is a robot. DOSM have been implemented several
356 | I S I W S C 2 0 1 9