Life Long Programmer's Community Log: Nutch2Crawling

Please Visit: http://lifelongprogrammer.blogspot.com

Nutch2Crawling

https://wiki.apache.org/nutch/Nutch2Crawling

IndexorJob

currentJob.setOutputFormatClass(GoraOutputFormat.class);

DataStore<String, WebPage> store = StorageUtils.createWebStore(currentJob.getConfiguration(),

String.class, WebPage.class);

GoraOutputFormat.setOutput(currentJob, store, true);

GeneratorJob

getConf().set(URLPartitioner.PARTITION_MODE_KEY, URLPartitioner.PARTITION_MODE_HOST);

org.apache.nutch.crawl.URLPartitioner.getPartition(String, int)

if (mode.equals(PARTITION_MODE_HOST)) {

hashCode = url.getHost().hashCode();

} else if (mode.equals(PARTITION_MODE_DOMAIN)) {

hashCode = URLUtil.getDomainName(url).hashCode();

} else { // MODE IP

InetAddress address = InetAddress.getByName(url.getHost());

hashCode = address.getHostAddress().hashCode();

}

// make hosts wind up in different partitions on different runs

hashCode ^= seed;

return (hashCode & Integer.MAX_VALUE) % numReduceTasks;

org.apache.nutch.crawl.TestURLPartitioner

FetcherJob

Fetches all urls which have been marked by the generator with a given batch ID (or optionally fetch all urls)

How FetcherReducer works?

Parse

Parses all webpages from a given batch id.

parseUtil.process(key, page);

from Google Plus RSS Feed for 101157854606139706613 https://wiki.apache.org/nutch/Nutch2Crawling

via LifeLong Community

Life Long Programmer's Community Log

Nutch2Crawling - Nutch Wiki

No comments:

Post a Comment