Nutch2Crawling - Nutch Wiki

Please Visit: http://lifelongprogrammer.blogspot.com



Nutch2Crawling

https://wiki.apache.org/nutch/Nutch2Crawling

IndexorJob

currentJob.setOutputFormatClass(GoraOutputFormat.class);

DataStore<String, WebPage> store = StorageUtils.createWebStore(currentJob.getConfiguration(),

String.class, WebPage.class);

GoraOutputFormat.setOutput(currentJob, store, true);



GeneratorJob

getConf().set(URLPartitioner.PARTITION_MODE_KEY, URLPartitioner.PARTITION_MODE_HOST);

org.apache.nutch.crawl.URLPartitioner.getPartition(String, int)

if (mode.equals(PARTITION_MODE_HOST)) {

hashCode = url.getHost().hashCode();

} else if (mode.equals(PARTITION_MODE_DOMAIN)) {

hashCode = URLUtil.getDomainName(url).hashCode();

} else { // MODE IP

InetAddress address = InetAddress.getByName(url.getHost());

hashCode = address.getHostAddress().hashCode();

}



// make hosts wind up in different partitions on different runs

hashCode ^= seed;

return (hashCode & Integer.MAX_VALUE) % numReduceTasks;



org.apache.nutch.crawl.TestURLPartitioner



FetcherJob

Fetches all urls which have been marked by the generator with a given batch ID (or optionally fetch all urls)

How FetcherReducer works?



Parse

Parses all webpages from a given batch id.

parseUtil.process(key, page);



from Google Plus RSS Feed for 101157854606139706613 https://wiki.apache.org/nutch/Nutch2Crawling

via LifeLong Community

No comments:

Post a Comment