Home » Articles » AutoCrawler – Crawling dynamic web applications and furthering deep web outreach
Click Here To Hide Tor

AutoCrawler – Crawling dynamic web applications and furthering deep web outreach

Web crawlers, also referred to as web spiders, have been extensively studied ever since the world wide web was launched. More recently, researchers have been concerned with web crawlers that attempt to crawl parts of the web that require completion of forms, which represent parts of the deep web.

According to recent research studies and the statistics of BrightPlanet published in 2012, the amount of data included in the Deep Web is 400-500 times more than what one can find via the Surface Web. Automatic data harvesting from the Deep Web has evolved to become one of the hot research topics in the field of web crawlers.

web-crawler.jpg

During a recent research study involving automatic dynamic web crawlers, two inevitable situations were discovered which have not been previously studied properly by existing tools. One is that some websites utilize CSS pseudo-class to identify clickable elements, which prevents the current technologies from simulating user actions. Another is that pop-up windows can be triggered during automatic clicking of elements, where the procedures of currently used web crawlers would be terminated. In an attempt to solve the CSS pseudo-class problem, the researchers adopted a proxy server in our system to analyze the code of JavaScript of target websites statically, as well as dynamically. Throughout their research, they proposed a checking mechanism to deal with the pop-up windows. They deployed an automatic web crawler, which they named “AutoCrawler”, with the aid of the proxy server.

Evaluation results have shown that AutoCrawler covers more dynamic interactions and also fetches more valuable data from the real-world web applications, which offers new means for crawling a large amount of data from the deep web.

AutoCrawlers’ applications and contributions of the research study:

Crawling modern AJAX-based web systems requires a different approach than the traditional way of extracting hypertext links from web pages and sending requests to the server. This study proposed an automated crawling technique for AJAX-based web applications, via AutoCrawler, which is based on dynamic analysis of the client-side web user interface in embedded browsers.

The main contributions of this study are:

– An analysis of the main problems involved in crawling AJAX-based applications including pop-up windows and clickable elements.

– A systematic process and algorithm to drive an AJAX application and infer a state machine from the detected state changes and transitions. Challenges addressed include the identification of clickable elements, the detection of DOM changes, and the construction of the state machine.

– A concurrent multi-browser crawling algorithm introduced via Autocrawler, to improve the runtime performance and increase the yield of crawled material from the deep web.

– The open source tool called CRAWLJAX, which implements the crawling algorithms implemented via AutoCrawler.

– Two studies, including seven AJAX applications, utilized to evaluate the effectiveness, performance, correctness, and scalability of the proposed crawling approach.

Although the study has focused on AJAX and associated web applications, it is believed, according to authors of the paper, that the same approach could be applied to any DOM-based web application and related websites. The fact that the tool will soon be freely available for download, being an open source piece of code, will help to identify a myriad of exciting case studies in the near future.

Furthermore, further strengthening of the tool by extending its set of functionalities, improving the accuracy, performance, and the state explosion algorithms are directions that will be subject of the authors’ future work, which denote that the yield of crawling of AJAX based web applications will increase greatly in the near future and will totally reformulate the current definition of the deep web.

Controlled experiments are to be conducted in the near future to systematically analyze and identify new means for optimization of the back tracking algorithm and implementation. Many AJAX applications use hash fragments in the URLs of various domains nowadays. Investigating how such hash fragments can be utilized during crawling of dynamic web applications is another interesting direction. Exploring the hidden-web induced by client-site JAVASCRIPT using CRAWLJAX and AutoCrawler, and continuing with automated web analysis and testing are other application domains that authors of the paper will be working on in the near future.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Captcha: *