Please Visit: http://lifelongprogrammer.blogspot.com
Nutch2Crawlinghttps://wiki.apache.org/nutch/Nutch2CrawlingIndexorJob
currentJob.setOutputFormatClass(GoraOutputFormat.class);
DataStore<String, WebPage> store = StorageUtils.createWebStore(currentJob.getConfiguration(),
String.class, WebPage.class);
GoraOutputFormat.setOutput(currentJob, store, true);
GeneratorJob
getConf().set(URLPartitioner.PARTITION_MODE_KEY, URLPartitioner.PARTITION_MODE_HOST);
org.apache.nutch.crawl.URLPartitioner.getPartition(String, int)
if (mode.equals(PARTITION_MODE_HOST)) {
hashCode = url.getHost().hashCode();
} else if (mode.equals(PARTITION_MODE_DOMAIN)) {
hashCode = URLUtil.getDomainName(url).hashCode();
} else { // MODE IP
InetAddress address = InetAddress.getByName(url.getHost());
hashCode = address.getHostAddress().hashCode();
}
// make hosts wind up in different partitions on different runs
hashCode ^= seed;
return (hashCode & Integer.MAX_VALUE) % numReduceTasks;
org.apache.nutch.crawl.TestURLPartitioner
FetcherJob
Fetches all urls which have been marked by the generator with a given batch ID (or optionally fetch all urls)
How FetcherReducer works?
Parse
Parses all webpages from a given batch id.
parseUtil.process(key, page);
from Google Plus RSS Feed for 101157854606139706613 https://wiki.apache.org/nutch/Nutch2Crawling
via
LifeLong Community