Browsing tag

Crawler

Google Open Sources Its ‘Web Crawler’ After 20 Years

Google’s Robot Exclusion Protocol (REP), also known as robots.txt, is a standard used by many websites to tell the automated crawlers which parts of the site should be crawled or not. However, it isn’t the officially adopted standard, leading to different interpretations. In a bid to make REP an official web standard, Google has open-sourced […]

ACHE – A Web Crawler For Domain-Specific Search

ACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern. ACHE differs from generic crawlers in sense that it uses page classifiers to distinguish between relevant and irrelevant pages in a given domain. A page classifier […]

Dirhunt v0.6.0 – Find Web Directories Without Bruteforce

DEVELOPMENT BRANCH: The current branch is a development version. Go to the stable release by clicking on the master branch. Dirhunt is a web crawler optimize for search and analyze directories. This tool can find interesting things if the server has the “index of” mode enabled. Dirhunt is also useful if the directory listing is […]

EKFiddle v.0.8.2 – A Framework Based On The Fiddler Web Debugger To Study Exploit Kits, Malvertising And Malicious Traffic In General

A framework based on the Fiddler web debugger to study Exploit Kits, malvertising and malicious traffic in general. Installation Download and install the latest version of Fiddlerhttps://www.telerik.com/fiddler Special instructions for Linux and Mac here:https://www.telerik.com/blogs/fiddler-for-linux-beta-is-herehttps://www.telerik.com/blogs/introducing-fiddler-for-os-x-beta-1 Enable C# scripting (Windows only) Launch Fiddler, and go to Tools -> Options In the Scripting tab, change the default (JScript.NET) […]

Paskto – Passive Web Scanner

Paskto will passively scan the web using the Common Crawl internet index either by downloading the indexes on request or parsing data from your local system. URLs are then processed through Nikto and known URL lists to identify interesting content. Hash signatures are also used to identify known default content for some IoT devices or […]

How to Build a Basic Web Crawler in Python

Short Bytes: Web crawler is a program that browses the Internet (World Wide Web) in a predetermined, configurable and automated manner and performs given action on crawled content. Search engines like Google and Yahoo use spidering as a means of providing up-to-date data. Webhose.io, a company which provides direct access to live data from hundreds of […]