Wombat Technology
From MediaWombat
The wombat (crawler) that we use to crawl the internet looking for flash media was written from scratch with this project in mind. It's a sophisticated combination of Perl and MySQL and currently runs on a single Linux machine on a home internet connection.
We plan on parallelizing the crawler in the near future to speed up the whole crawling process. Crawling the internet has to be done 'nicely', so we've written code to crawl at an acceptable rate as to not anger site admins that have limited bandwidth or limited server resources.
The crawler gives priority to a smaller 'depth' of the site, so if a site is submitted, the crawler will pick up the URL and crawl the first page, then all of the child links on that page fairly quickly. Once that site gets crawled to a depth of 2 or 3, it will share the same priority as the rest of the URLs in the database.
Current
- Only grabs and indexes images, text, audio and actionscript.
- Our Crawler's UserAgent String:
FlashRip 1.0 Bot (http://mediawombat.com)
- Site responds to:
- mediawombat.com
- mediawombat.net
- mediawombat.org
