Monday, May 30, 2016

[Big Data] Purge Leftovers on Hadoop - High Block Count warnings & Some Practices

Hi everyone,

Today,  i will talk about scraps on Hadoop system. We all know that Hadoop deletes remainders after operations succesfully finishes!:) But you should always check block counts. It caused by small size files and causes poor performance issues.

It is highly possible that you see High Block Count warnings on Cloudera Manager main page,
You can check block counts per datanode from following on CM -> HDFS service -> Active NameNode Web UI-> Live Nodes

Or this link http://#ACTIVE_NAME_NODE_IP#:50070/dfshealth.html#tab-datanode

In this work, i will point some practices about how to get rid of small files, ( most of them :) )

First step, check file counts on HDFS files,

I see /tmp directory is always painful if you arrange well /data directory.

$hadoop fs -count /tmp
        1477133       3142850      5463560687372 /tmp

I see this on Prod system! I think lots of small files exist there.

So when i check deep i see 2 directory are bigger.

1- /tmp/logs/hive/logs .... hive MR job logs
2- /tmp/hive/hive        .....hive query/staging logs

For the 2nd directory , we know that hive automatically purge query logs , aka staging logs , after operation (select,insert...) ends successfully.!! But if operation is cancelled or killed, we should manually delete files under here.

Is it OK for me :) I will do it with following ( you should define a cron for this ?)

$ hadoop fs -rm -r /tmp/hive/hive -skipTrash

>>> Important Point Comes  Here :)

For the 1st directory, it is full of application_#ID... logs , i wonder why it is not deleted?

As they are related with YARN, you should check Job History server does its job or not?

I checked log duration on CM and see this

It says logs are not kept after 7 days. Then why /tmp/logs/hive/logs directory gets bigger?

I checked logs of Job History server and see those errors:

PriviledgedActionException as:mapred/YARN_HOST@YOUR_REALM(auth:KERBEROS) java.lang.IllegalArgumentException: Server has invalid Kerberos principal: yarn/YARN_HOST@YOUR_REALM

To correct this error, do not use YARN_RESOURCE_MANAGER hostname for _> YARN_HOST, 

just use _HOST at JobHistory Server Advanced Configuration Snippet (Safety Valve) for yarn-site.xml

at the end it should look like this


After restarting Job History Server, it deletes aggregation logs succesfully.

And hive operation logs directory gets smaller dramatically, it gets smaller 1/100 ! :)

$hadoop fs -count /tmp/logs/hive/logs
        3207         8746          828867258 /tmp/logs/hive/logs

Lastly check Trash directory under /user/hive.

When a user drop a table without purge command, hive deletes table , but data moves to .Trash directory not permanently deleted.

Ok, here is end of story. I hope it works for you.

Thanks for reading.

Enjoy & share.


1 comment :

  1. After reading this blog i very strong in this topics and this blog really helpful to all.Big Data Hadoop Online Course