[Yanel-dev] crawler

Fri Mar 2 14:34:20 CET 2007

Josias Thöny wrote:

> Michael Wechner wrote:
>
>> Josias Thöny wrote:
>>
>>> Hi,
>>>
>>> I've had a look at the crawler of lenya 1.2, and it seems that a few 
>>> features are missing:
>>>
>>> basic missing features:
>>> - download of images
>>> - download of css
>>> - download of scripts
>>> - link rewriting
>>> - limits for max level / max documents
>>>
>>> advanced missing features:
>>> - handling of frames / iframes
>>> - tidy html -> xhtml
>>> - extraction of body content
>>> - resolving of links in css (background images etc.)
>>>
>>> Or am I misunderstanding something...?
>>
>>
>>
>> no ;-)
>>
>>>
>>> IMHO some of these features are quite essential, because we want to 
>>> use the crawler in yanel to import the complete pages with images 
>>> and everything, not only text content.
>>>
>>> The question is now, does it make sense to implement the missing 
>>> features into that crawler, or should we look for an alternative?
>>
>>
>>
>> sure, if there is an alternative :-) Is there?
>
>
> The lenya crawler uses websphinx for the robot exclusion, which is 
> actually a complete crawler framework, and I think we could use it 
> instead of the lenya crawler. It supports the basic features that I 
> mentioned above.
> I wrote a class DumpingCrawler which is based on the websphinx 
> crawler. Basically it should be able to create a complete dump of a 
> website including images, css, etc. It also rewrites links in the html 
> code.
>
> The source code is at:
> https://svn.wyona.com/repos/public/crawler
>
> I also added the websphinx source code to our svn because I had to 
> patch a few things.

I think it's important that we also add the patches separately in order 
to know what has been patched.

> The license is apache-like, so it should be ok.
>
> The usage is shown in the following example:
>
> --------------------------------------------------
> String crawlStartURL = "http://wyona.org";
> String crawlScopeURL = "http://wyona.org";
> String dumpDir = "/tmp/dump";
>
> DumpingCrawler crawler = new DumpingCrawler(crawlStartURL, 
> crawlScopeURL, dumpDir);
>
> EventLog eventLog = new EventLog(System.out);
> crawler.addCrawlListener(eventLog);
> crawler.addLinkListener(eventLog);
>
> crawler.run();
> crawler.close();
> --------------------------------------------------
>
> Remarks:
> - the EventLog is optional (it creates some log output)

what is the EventLog good for

> - the crawlScopeURL limits the scope of the retrieved pages, i.e. only 
> urls starting with the scope url are being downloaded.
>
> For more information, see
> http://www.cs.cmu.edu/~rcm/websphinx/doc/websphinx/Crawler.html

sounds very good :-) Have you already uploaded the library to our Maven 
repo?

Thanks

Michi

>
> Josias
>
>
>>
>> Thanks
>>
>> Michi
>>
>>>
>>> Josias
>>>
>>> _______________________________________________
>>> Yanel-development mailing list
>>> Yanel-development at wyona.com
>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>
>>
>>
>
>
> _______________________________________________
> Yanel-development mailing list
> Yanel-development at wyona.com
> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>

-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner at wyona.com                        michi at apache.org
+41 44 272 91 61