你好,游客 登录 注册 搜索
背景:
阅读新闻

Hadoop元数据合并异常及解决方法

[日期:2014-04-25] 来源:过往记忆  作者:过往记忆 [字体: ]

 这几天观察了一下Standby NN上面的日志,发现每次Fsimage合并完之后,Standby NN通知Active NN来下载合并好的Fsimage的过程中会出现以下的异常信息:

01 2014-04-23 14:42:54,964 ERROR org.apache.hadoop.hdfs.server.namenode.ha.
02     StandbyCheckpointer: Exception in doCheckpoint
03 java.net.SocketTimeoutException: Read timed out
04         at java.net.SocketInputStream.socketRead0(Native Method)
05         at java.net.SocketInputStream.read(SocketInputStream.java:152)
06         at java.net.SocketInputStream.read(SocketInputStream.java:122)
07         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
08         at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
09         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
10         at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
11         at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
12         at sun.net.www.protocol.http.HttpURLConnection.getInputStream
13 (HttpURLConnection.java:1323)
14         at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
15         at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl
16 (TransferFsImage.java:268)
17         at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient
18 (TransferFsImage.java:247)
19         at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.
20 uploadImageFromStorage(TransferFsImage.java:162)
21         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.
22 doCheckpoint(StandbyCheckpointer.java:174)
23         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.
24 access$1100(StandbyCheckpointer.java:53)
25         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
26 $CheckpointerThread.doWork(StandbyCheckpointer.java:297)
27         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
28 $CheckpointerThread.access$300(StandbyCheckpointer.java:210)
29         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
30 $CheckpointerThread$1.run(StandbyCheckpointer.java:230)
31         at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal
32 (SecurityUtil.java:456)
33         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
34 $CheckpointerThread.run(StandbyCheckpointer.java:226)

  上面的代码贴出来有点乱啊,可以看下下面的图片截图:

StandbyCheckpointer

StandbyCheckpointer

  于是习惯性的去Google了一下,找了好久也没找到类似的信息。只能自己解决。我们通过分析日志发现更奇怪的问题,上次Checkpoint的时间一直都不变(一直都是Standby NN启动的时候第一次Checkpoint的时间),如下:

1 2014-04-23 14:50:54,429 INFO
2 org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer:
3 Triggering checkpoint because it has been 70164 seconds since the last checkpoint,
4 which exceeds the configured interval 600

难道这是Hadoop的bug?于是我就根据上面的错误信息去查看源码,经过仔细的分析,发现上述的问题都是由StandbyCheckpointer类输出的:

01 private void doWork() {
02       // Reset checkpoint time so that we don't always checkpoint
03       // on startup.
04       lastCheckpointTime = now();
05       while (shouldRun) {
06         try {
07           Thread.sleep(1000 * checkpointConf.getCheckPeriod());
08         } catch (InterruptedException ie) {
09         }
10         if (!shouldRun) {
11           break;
12         }
13         try {
14           // We may have lost our ticket since last checkpoint, log in again,
15           // just in case
16           if (UserGroupInformation.isSecurityEnabled()) {
17             UserGroupInformation.getCurrentUser().checkTGTAndReloginFromKeytab();
18           }
19            
20           long now = now();
21           long uncheckpointed = countUncheckpointedTxns();
22           long secsSinceLast = (now - lastCheckpointTime)/1000;
23            
24           boolean needCheckpoint = false;
25           if (uncheckpointed >= checkpointConf.getTxnCount()) {
26             LOG.info("Triggering checkpoint because there have been " +
27                 uncheckpointed + " txns since the last checkpoint, which " +
28                 "exceeds the configured threshold " +
29                 checkpointConf.getTxnCount());
30             needCheckpoint = true;
31           } else if (secsSinceLast >= checkpointConf.getPeriod()) {
32             LOG.info("Triggering checkpoint because it has been " +
33                 secsSinceLast + " seconds since the last checkpoint, which " +
34                 "exceeds the configured interval " + checkpointConf.getPeriod());
35             needCheckpoint = true;
36           }
37            
38           synchronized (cancelLock) {
39             if (now < preventCheckpointsUntil) {
40               LOG.info("But skipping this checkpoint since we are about"+
41               " to failover!");
42               canceledCount++;
43               continue;
44             }
45             assert canceler == null;
46             canceler = new Canceler();
47           }
48            
49           if (needCheckpoint) {
50             doCheckpoint();
51             lastCheckpointTime = now;
52           }
53         } catch (SaveNamespaceCancelledException ce) {
54           LOG.info("Checkpoint was cancelled: " + ce.getMessage());
55           canceledCount++;
56         } catch (InterruptedException ie) {
57           // Probably requested shutdown.
58           continue;
59         } catch (Throwable t) {
60           LOG.error("Exception in doCheckpoint", t);
61         } finally {
62           synchronized (cancelLock) {
63             canceler = null;
64           }
65         }
66       }
67     }
68   }

  上面的异常信息是由 doCheckpoint()函数执行的过程中出现问题而抛出来的,这样导致lastCheckpointTime = now;语句永远执行不到。那么为什么doCheckpoint()执行过程会出现异常??根据上述堆栈信息的跟踪,发现是由TransferFsImage类的doGetUrl函数中的下面语句导致的:

1 if (connection.getResponseCode() != HttpURLConnection.HTTP_OK) {

  由于connection无法得到对方的响应码而超时。于是我就想到是否是我的集群socket超时设置的有问题??后来经过各种分析发现不是。于是我只能再看看代码,我发现了上述代码的前面有如下设置:

01 if (timeout <= 0) {
02       Configuration conf = new HdfsConfiguration();
03       timeout = conf.getInt(DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_KEY,
04           DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_DEFAULT);
05 }
06  
07 if (timeout > 0) {
08       connection.setConnectTimeout(timeout);
09       connection.setReadTimeout(timeout);
10 }
11  
12 if (connection.getResponseCode() != HttpURLConnection.HTTP_OK) {
13       throw new HttpGetFailedException(
14           "Image transfer servlet at " + url +
15           " failed with status code " + connection.getResponseCode() +
16           "\nResponse message:\n" + connection.getResponseMessage(),
17           connection);