nutch搏斗之一-白红宇

nutch搏斗之一

阅读量：6847 次

发布时间：2019-06-26

本文共 1187 字，大约阅读时间需要 3 分钟。

nutch搏斗之一

问题描述：

在用nutch1.0做generate 包括5亿url的crawldb时，它默认按照64M分块，分成777个map task，在运行的后期出现

Could not find taskTracker/jobcache/job_200903231519_0017/attempt_200903231519_0017_r_000051_0/output/file.out in any of the configured local directories

异常。

解决办法：

减小task数目，改成按照crawldb里面文件个数划分的策略：

Java代码

public static class InputFormat extends SequenceFileInputFormat<WritableComparable, Writable> {

/** Don't split inputs, to keep things polite. */

public InputSplit[] getSplits(JobConf job, int nSplits)

throws IOException {

FileStatus[] files = listStatus(job);

FileSystem fs = FileSystem.get(job);

InputSplit[] splits = new InputSplit[files.length];

for (int i = 0; i < files.length; i++) {

FileStatus cur = files[i];

splits[i] = new FileSplit(cur.getPath(), 0,

cur.getLen(), (String[])null);

}

return splits;

}

}

这次出现了新问题，有数个task因为十分钟无反应而导致整个任务failed

解决办法：

修改hadoop-site.xml

Java代码

<property>

<name>mapred.task.timeout</name>

<value>3600000</value>

<description>The number of milliseconds before a task will be

terminated if it neither reads an input, writes an output, nor

updates its status string.

</description>

</property>

总结：

大与小，多与少，长与短，在不同的情况下是不断变化的，对于大数据量而言，更要跟具实际情况灵活变化，所谓运用之刀，存乎一心是也！

转载地址：http://hjoul.baihongyu.com/

你可能感兴趣的文章