org.apache.cocoon.components.crawler
Class SimpleCocoonCrawlerImpl

java.lang.Object
  extended by org.apache.avalon.framework.logger.AbstractLogEnabled
      extended by org.apache.cocoon.components.crawler.SimpleCocoonCrawlerImpl
All Implemented Interfaces:
Poolable, Recyclable, Disposable, Component, Configurable, LogEnabled, CocoonCrawler

public class SimpleCocoonCrawlerImpl
extends AbstractLogEnabled
implements CocoonCrawler, Configurable, Disposable, Recyclable

A simple cocoon crawler.

Version:
CVS $Id: SimpleCocoonCrawlerImpl.html 1304258 2012-03-23 10:09:27Z ilgrosso $
Author:
Bernhard Huber

Nested Class Summary
static class SimpleCocoonCrawlerImpl.CocoonCrawlerIterator
          Helper class implementing an Iterator This Iterator implementation calculates the links of an URL before returning in the next() method.
 
Field Summary
static String ACCEPT_CONFIG
          Config element name specifying http header value for accept.
static String ACCEPT_DEFAULT
          Default value of accept configuration option.
protected  int depth
           
static String EXCLUDE_CONFIG
          Config element name specifying excluding regular expression pattern.
static String INCLUDE_CONFIG
          Config element name specifying including regular expression pattern.
static String LINK_CONTENT_TYPE_CONFIG
          Config element name specifying expected link content-typ.
 String LINK_CONTENT_TYPE_DEFAULT
          Default value of link-content-type configuration value.
static String LINK_VIEW_QUERY_CONFIG
          Config element name specifying query-string appendend for requesting links of an URL.
static String LINK_VIEW_QUERY_DEFAULT
          Default value of link-view-query configuration option.
protected  HashSet urlsNextDepth
           
protected  HashSet urlsToProcess
           
static String USER_AGENT_CONFIG
          Config element name specifying http header value for user-Agent.
static String USER_AGENT_DEFAULT
          Default value of user-agent configuration option.
 
Fields inherited from interface org.apache.cocoon.components.crawler.CocoonCrawler
ROLE
 
Constructor Summary
SimpleCocoonCrawlerImpl()
          Constructor for the SimpleCocoonCrawlerImpl object
 
Method Summary
 void configure(Configuration configuration)
          Configure the crawler component.
 void crawl(URL url)
          The same as calling crawl(url,-1);
 void crawl(URL url, int maxDepth)
          Start crawling a URL.
 void dispose()
          dispose at end of life cycle, releasing all resources.
 Iterator iterator()
          Return iterator, iterating over all links of the currently crawled URL.
 void recycle()
          recylcle this object, relasing resources
 
Methods inherited from class org.apache.avalon.framework.logger.AbstractLogEnabled
enableLogging, getLogger, setupLogger, setupLogger, setupLogger
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LINK_CONTENT_TYPE_CONFIG

public static final String LINK_CONTENT_TYPE_CONFIG
Config element name specifying expected link content-typ.

Its value is link-content-type.

See Also:
Constant Field Values

LINK_CONTENT_TYPE_DEFAULT

public final String LINK_CONTENT_TYPE_DEFAULT
Default value of link-content-type configuration value.

Its value is application/x-cocoon-links.

See Also:
Constant Field Values

LINK_VIEW_QUERY_CONFIG

public static final String LINK_VIEW_QUERY_CONFIG
Config element name specifying query-string appendend for requesting links of an URL.

Its value is link-view-query.

See Also:
Constant Field Values

LINK_VIEW_QUERY_DEFAULT

public static final String LINK_VIEW_QUERY_DEFAULT
Default value of link-view-query configuration option.

Its value is ?cocoon-view=links.

See Also:
Constant Field Values

EXCLUDE_CONFIG

public static final String EXCLUDE_CONFIG
Config element name specifying excluding regular expression pattern.

Its value is exclude.

See Also:
Constant Field Values

INCLUDE_CONFIG

public static final String INCLUDE_CONFIG
Config element name specifying including regular expression pattern.

Its value is include.

See Also:
Constant Field Values

USER_AGENT_CONFIG

public static final String USER_AGENT_CONFIG
Config element name specifying http header value for user-Agent.

Its value is user-agent.

See Also:
Constant Field Values

USER_AGENT_DEFAULT

public static final String USER_AGENT_DEFAULT
Default value of user-agent configuration option.

See Also:
Constants.COMPLETE_NAME

ACCEPT_CONFIG

public static final String ACCEPT_CONFIG
Config element name specifying http header value for accept.

Its value is accept.

See Also:
Constant Field Values

ACCEPT_DEFAULT

public static final String ACCEPT_DEFAULT
Default value of accept configuration option.

Its value is * / *

See Also:
Constant Field Values

depth

protected int depth

urlsToProcess

protected HashSet urlsToProcess

urlsNextDepth

protected HashSet urlsNextDepth
Constructor Detail

SimpleCocoonCrawlerImpl

public SimpleCocoonCrawlerImpl()
Constructor for the SimpleCocoonCrawlerImpl object

Method Detail

configure

public void configure(Configuration configuration)
               throws ConfigurationException
Configure the crawler component.

Configure can specify which URI to include, and which URI to exclude from crawling. You specify the patterns as regular expressions.

Morover you can configure the required content-type of crawling request, and the query-string appended to each crawling request.


 <include>.*\.html?</exclude> or <exclude>.*\.html?, .*\.xsp</exclude>
 <exclude>.*\.gif</exclude> or <exclude>.*\.gif, .*\.jpe?g</exclude>
 <link-content-type> application/x-cocoon-links </link-content-type>
 <link-view-query> ?cocoon-view=links </link-view-query>
 

Specified by:
configure in interface Configurable
Parameters:
configuration - XML configuration of this avalon component.
Throws:
ConfigurationException - is throwing if configuration is invalid.

dispose

public void dispose()
dispose at end of life cycle, releasing all resources.

Specified by:
dispose in interface Disposable

recycle

public void recycle()
recylcle this object, relasing resources

Specified by:
recycle in interface Recyclable

crawl

public void crawl(URL url)
The same as calling crawl(url,-1);

Specified by:
crawl in interface CocoonCrawler
Parameters:
url - Crawl this URL, getting all links from this URL.

crawl

public void crawl(URL url,
                  int maxDepth)
Start crawling a URL.

Use this method to start crawling. Get the this url, and all its children by using iterator(). The Iterator object will return URL objects.

You may use the crawl(), and iterator() methods the following way:


   SimpleCocoonCrawlerImpl scci = ....;
   scci.crawl( "http://foo/bar" );
   Iterator i = scci.iterator();
   while (i.hasNext()) {
     URL url = (URL)i.next();
     ...
   }
 

The i.next() method returns a URL, and calculates the links of the URL before return it.

Specified by:
crawl in interface CocoonCrawler
Parameters:
url - Crawl this URL, getting all links from this URL.
maxDepth - maximum depth to crawl to. -1 for no maximum.

iterator

public Iterator iterator()
Return iterator, iterating over all links of the currently crawled URL.

The Iterator object will return URL objects at its next() method.

Specified by:
iterator in interface CocoonCrawler
Returns:
Iterator iterator of all links from the crawl URL.
Since:


Copyright © 1999-2010 The Apache Software Foundation. All Rights Reserved.