|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.avalon.framework.logger.AbstractLogEnabled org.apache.cocoon.components.crawler.SimpleCocoonCrawlerImpl
public class SimpleCocoonCrawlerImpl
A simple cocoon crawler.
Nested Class Summary | |
---|---|
static class |
SimpleCocoonCrawlerImpl.CocoonCrawlerIterator
Helper class implementing an Iterator This Iterator implementation calculates the links of an URL before returning in the next() method. |
Field Summary | |
---|---|
static String |
ACCEPT_CONFIG
Config element name specifying http header value for accept. |
static String |
ACCEPT_DEFAULT
Default value of accept configuration option. |
protected int |
depth
|
static String |
EXCLUDE_CONFIG
Config element name specifying excluding regular expression pattern. |
static String |
INCLUDE_CONFIG
Config element name specifying including regular expression pattern. |
static String |
LINK_CONTENT_TYPE_CONFIG
Config element name specifying expected link content-typ. |
String |
LINK_CONTENT_TYPE_DEFAULT
Default value of link-content-type configuration value. |
static String |
LINK_VIEW_QUERY_CONFIG
Config element name specifying query-string appendend for requesting links of an URL. |
static String |
LINK_VIEW_QUERY_DEFAULT
Default value of link-view-query configuration option. |
protected HashSet |
urlsNextDepth
|
protected HashSet |
urlsToProcess
|
static String |
USER_AGENT_CONFIG
Config element name specifying http header value for user-Agent. |
static String |
USER_AGENT_DEFAULT
Default value of user-agent configuration option. |
Fields inherited from interface org.apache.cocoon.components.crawler.CocoonCrawler |
---|
ROLE |
Constructor Summary | |
---|---|
SimpleCocoonCrawlerImpl()
Constructor for the SimpleCocoonCrawlerImpl object |
Method Summary | |
---|---|
void |
configure(Configuration configuration)
Configure the crawler component. |
void |
crawl(URL url)
The same as calling crawl(url,-1); |
void |
crawl(URL url,
int maxDepth)
Start crawling a URL. |
void |
dispose()
dispose at end of life cycle, releasing all resources. |
Iterator |
iterator()
Return iterator, iterating over all links of the currently crawled URL. |
void |
recycle()
recylcle this object, relasing resources |
Methods inherited from class org.apache.avalon.framework.logger.AbstractLogEnabled |
---|
enableLogging, getLogger, setupLogger, setupLogger, setupLogger |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String LINK_CONTENT_TYPE_CONFIG
Its value is link-content-type
.
public final String LINK_CONTENT_TYPE_DEFAULT
link-content-type
configuration value.
Its value is application/x-cocoon-links
.
public static final String LINK_VIEW_QUERY_CONFIG
Its value is link-view-query
.
public static final String LINK_VIEW_QUERY_DEFAULT
link-view-query
configuration option.
Its value is ?cocoon-view=links
.
public static final String EXCLUDE_CONFIG
Its value is exclude
.
public static final String INCLUDE_CONFIG
Its value is include
.
public static final String USER_AGENT_CONFIG
Its value is user-agent
.
public static final String USER_AGENT_DEFAULT
user-agent
configuration option.
Constants.COMPLETE_NAME
public static final String ACCEPT_CONFIG
Its value is accept
.
public static final String ACCEPT_DEFAULT
accept
configuration option.
Its value is * / *
protected int depth
protected HashSet urlsToProcess
protected HashSet urlsNextDepth
Constructor Detail |
---|
public SimpleCocoonCrawlerImpl()
Method Detail |
---|
public void configure(Configuration configuration) throws ConfigurationException
Configure can specify which URI to include, and which URI to exclude from crawling. You specify the patterns as regular expressions.
Morover you can configure the required content-type of crawling request, and the query-string appended to each crawling request.
<include>.*\.html?</exclude> or <exclude>.*\.html?, .*\.xsp</exclude> <exclude>.*\.gif</exclude> or <exclude>.*\.gif, .*\.jpe?g</exclude> <link-content-type> application/x-cocoon-links </link-content-type> <link-view-query> ?cocoon-view=links </link-view-query>
configure
in interface Configurable
configuration
- XML configuration of this avalon component.
ConfigurationException
- is throwing if configuration is invalid.public void dispose()
dispose
in interface Disposable
public void recycle()
recycle
in interface Recyclable
public void crawl(URL url)
crawl
in interface CocoonCrawler
url
- Crawl this URL, getting all links from this URL.public void crawl(URL url, int maxDepth)
Use this method to start crawling.
Get the this url, and all its children by using iterator()
.
The Iterator object will return URL objects.
You may use the crawl(), and iterator() methods the following way:
SimpleCocoonCrawlerImpl scci = ....; scci.crawl( "http://foo/bar" ); Iterator i = scci.iterator(); while (i.hasNext()) { URL url = (URL)i.next(); ... }
The i.next() method returns a URL, and calculates the links of the URL before return it.
crawl
in interface CocoonCrawler
url
- Crawl this URL, getting all links from this URL.maxDepth
- maximum depth to crawl to. -1 for no maximum.public Iterator iterator()
The Iterator object will return URL objects at its next()
method.
iterator
in interface CocoonCrawler
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |