Monday, February 1, 2016

[HIVE] How to move a table, hard and easy way ?

Hi ,

After some break, i am ready for new posts about authorization issues. In this post, i will try to move a table between HIVE databases in two ways. Choose which one works for you :)

Once your authorization plan is set, you can create new databases and should move old working tables into new locations. Also you will be asked for moving a table or renaming a table in development phases, or you will need copy-backup of a table in production.

In this work, i am working with Oracle BigData Appliance 4.2 with CDH 5.4.0 which includes HIVE 1.1.0

There is a command to rename a table.

ALTER TABLE table_name RENAME TO new_table_name;

First create & populate test table in default database.

hive> create table rn1 ( id int);
hive> insert into rn1 values ( 100);
hive> insert into rn1 values ( 200);
hive> select * from rn1;

Ok, lets rename it.

hive> alter table rn1 rename to rn2;

hive> select * from rn2;

Query runs ok, but let me move table outside of default db.

hive> alter table rn2 rename to NEWDB.rn3;

Error appears in HUE >>

Your query has the following error(s):
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Unable to access old location hdfs://CLUSTER/user/hive/warehouse/rn2 for table default.rn2

Check hive.log >>

2016-02-01 15:22:13,985 ERROR org.apache.hive.service.cli.operation.Operation: Error running hive query: 
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Unable to access old location hdfs://CLUSTER/user/hive/warehouse/rn2 for table default.rn2
        at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:315)
        at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:147)
        at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:70)
        at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
        at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:209)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. Unable to access old location hdfs://CLUSTER/user/hive/warehouse/rn2 for table default.rn2
        at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:466)
        at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3354)
        at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:332)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
        at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1638)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1397)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1181)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1047)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1042)
        at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:145)
        ... 11 more
Caused by: InvalidOperationException(message:Unable to access old location hdfs://CLUSTER/user/hive/warehouse/rn2 for table default.rn2)
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_table_with_cascade_result$alter_table_with_cascade_resultStandardScheme.read(ThriftHiveMetastore.java:37952)
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_table_with_cascade_result$alter_table_with_cascade_resultStandardScheme.read(ThriftHiveMetastore.java:37938)
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_table_with_cascade_result.read(ThriftHiveMetastore.java:37880)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_table_with_cascade(ThriftHiveMetastore.java:1290)
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.alter_table_with_cascade(ThriftHiveMetastore.java:1274)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_table(HiveMetaStoreClient.java:329)
        at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.alter_table(SessionHiveMetaStoreClient.java:251)
        at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:91)
        at com.sun.proxy.$Proxy6.alter_table(Unknown Source)
        at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:462)
        ... 21 more

These errors do not sense, because in no way we can get permission error.
Let me query table in default table.

hive> select * from default.rn2; 

Table not found in default db.


hive> select * from NEWDB.rn3; 

no rows where is my data? :/ Let me describe hive metastore.


hive> desc formatted NEWDB.rn3
...
Location:           
hdfs://CLUSTER/user/hive/warehouse/NEWDB.db/rn3,
...

[HOST]~$hadoop fs -ls /user/hive/warehouse/NEWDB.db
ls: `/user/hive/warehouse/NEWDB.db': No such file or directory

What? There is no path like this, because NEWDB exists on different location.

Where is my data,  i think its gone.. !!!!

There is a bug with HIVE 1.1.0 for this error. You can check at this link.
http://issues.apache.org/jira/browse/HIVE-10719

As you see we got an metastore error for it. So DONT USE this method, it will be fixed in CDH 5.4.7

Now what we can do, DONT WORRY, there is a way for it, the easy way,

Hive Metadata Export / Import.

As you see, it is very handy.

EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
  TO 'export_target_path'
  
IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[, ...])]]
  FROM 'source_path' [LOCATION 'import_target_path']


It is very easy to use. Now create test case in a new database.

hive> use NEWDB;
hive> create table newdbexp (id int);
hive> insert into newdbexp values (790);
hive> insert into newdbexp values (1);
hive> insert into newdbexp values (60);
hive> select * from 

Ok, we have it. Lets query file location

[HOST]~$hadoop fs -ls  /PATH_TO_NEWDB/NEWDB/
Found 7 items
...
drwxrwx---+  - hive   supergroup          0 2016-01-29 16:20 /PATH_TO_NEWDB/NEWDB/newdbexp
...

[HOST]~$hadoop fs -ls  /PATH_TO_NEWDB/NEWDB/newdbexp
Found 3 items
-rwxrwx---+  3 hive supergroup          4 2016-01-29 16:22 /PATH_TO_NEWDB/NEWDB/newdbexp/000000_0
-rwxrwx---+  3 hive supergroup          2 2016-01-29 16:22 /PATH_TO_NEWDB/NEWDB/newdbexp/000000_0_copy_1
-rwxrwx---+  3 hive supergroup          3 2016-01-29 16:22 /PATH_TO_NEWDB/NEWDB/newdbexp/000000_0_copy_2

You see, files are created.

Now export the table with following command.

hive> export table newdbexp to '/tmp/newdbexp';

After this , export command creates metadata and data in diffrerent files.

[HOST]~$hadoop fs -ls /tmp/newdbexp
Found 5 items
drwxrwxrwx   - hive supergroup          0 2016-01-29 16:22 /tmp/newdbexp/.hive-staging_hive_2016-01-29_16-22-55_385_6263938611577166589-35
drwxrwxrwx   - hive supergroup          0 2016-01-29 16:23 /tmp/newdbexp/.hive-staging_hive_2016-01-29_16-23-41_894_2004760451112761053-35
drwxrwxrwx   - hive supergroup          0 2016-01-29 16:24 /tmp/newdbexp/.hive-staging_hive_2016-01-29_16-24-51_981_2863671248049702467-35
-rwxrwxrwx   3 hive supergroup       1165 2016-01-29 16:22 /tmp/newdbexp/_metadata
drwxrwxrwx   - hive supergroup          0 2016-01-29 16:22 /tmp/newdbexp/data

Lets query what is in _metadata file. As you see at following, it is DDL of the table.

[HOST]~$hadoop fs -cat /tmp/newdbexp/_metadata
{"version":"0.1","table":"{\"1\":{\"str\":\"newdbexp\"},\"2\":{\"str\":\"NEWDB\"},\"3\":{\"str\":\"hive\"},\"4\":{\"i32\":1454077197},\"5\":{\"i32\":0},\"6\":{\"i32\":0},\"7\":{\"rec\":{\"1\":{\"lst\":[\"rec\",1,{\"1\":{\"str\":\"id\"},\"2\":{\"str\":\"int\"}}]},\"2\":{\"str\":\"hdfs://CLUSTER/PATH_TO_NEWDB/NEWDB/newdbexp\"},\"3\":{\"str\":\"org.apache.hadoop.mapred.TextInputFormat\"},\"4\":{\"str\":\"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\"},\"5\":{\"tf\":0},\"6\":{\"i32\":-1},\"7\":{\"rec\":{\"2\":{\"str\":\"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe\"},\"3\":{\"map\":[\"str\",\"str\",1,{\"serialization.format\":\"1\"}]}}},\"8\":{\"lst\":[\"str\",0]},\"9\":{\"lst\":[\"rec\",0]},\"10\":{\"map\":[\"str\",\"str\",0,{}]},\"11\":{\"rec\":{\"1\":{\"lst\":[\"str\",0]},\"2\":{\"lst\":[\"lst\",0]},\"3\":{\"map\":[\"lst\",\"str\",0,{}]}}},\"12\":{\"tf\":0}}},\"8\":{\"lst\":[\"rec\",0]},\"9\":{\"map\":[\"str\",\"str\",6,{\"totalSize\":\"9\",\"numRows\":\"3\",\"rawDataSize\":\"6\",\"COLUMN_STATS_ACCURATE\":\"true\",\"numFiles\":\"3\",\"transient_lastDdlTime\":\"1454077242\"}]},\"12\":{\"str\":\"MANAGED_TABLE\"}}","partitions":[]}[HOST]~$

Now the import step.

hive> import table newdbexp2 from '/tmp/newdbexp';

Here it works,

[HOST]~$hadoop fs -ls  /PATH_TO_NEWDB/NEWDB/
Found 7 items
...
drwxrwx---+  - hive   supergroup          0 2016-01-29 16:20 /PATH_TO_NEWDB/NEWDB/newdbexp
drwxrwx---+  - hive   supergroup          0 2016-01-29 16:22 /PATH_TO_NEWDB/NEWDB/newdbexp2


You see it works and table is placed in correct location with data. Now try to move betwen databases.

hive> import table OTHERDB.newdbexp3 from '/tmp/newdbexp';

INFO  : Copying file: hdfs://CLUSTER/tmp/newdbexp/data/000000_0_copy_1
INFO  : Copying file: hdfs://CLUSTER/tmp/newdbexp/data/000000_0_copy_2
INFO  : Loading data to table NEWDB.OTHERDB.newdbexp3 from hdfs://CLUSTER/tmp/newdbexp/.hive-staging_hive_2016-01-29_16-23-41_894_2004760451112761053-35/-ext-10000
ERROR : Failed with exception Invalid table name NEWDB.OTHERDB.newdbexp3
org.apache.hadoop.hive.ql.parse.SemanticException: Invalid table name NEWDB.OTHERDB.newdbexp3
 at org.apache.hadoop.hive.ql.exec.Utilities.getDbTableName(Utilities.java:2321)
 at org.apache.hadoop.hive.ql.exec.Utilities.getDbTableName(Utilities.java:2307)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1031)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1019)
 at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:259)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
 at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1638)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1397)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1181)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1047)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1042)
 at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:145)
 at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:70)
 at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:197)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
 at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:209)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

There is an error, because hive adds in-use db name as a prefix.

So i have a workaround for it.

!!! before the import run this, use newdbname.

hive> use OTHERDB;
hive> import table newdbexp3 from '/tmp/newdbexp';

and it works with same structure.


[HOST]~$hadoop fs -ls  /PATH_TO_OTHERDB/OTHERDB/newdbexp3
Found 3 items
-rwxrwx---+  3 hive supergroup          4 2016-01-29 16:24 /PATH_TO_OTHERDB/OTHERDB/newdbexp3/000000_0
-rwxrwx---+  3 hive supergroup          2 2016-01-29 16:24 /PATH_TO_OTHERDB/OTHERDB/newdbexp3000000_0_copy_1
-rwxrwx---+  3 hive supergroup          3 2016-01-29 16:24 /PATH_TO_OTHERDB/OTHERDB/newdbexp3/000000_0_copy_2


In seconds, hive moved table through different databases. I think it is the fastest and safe way to move data in HDFS. Try it. I think you will need this very often.

Ok, That's all.

Thanks for reading.

Enjoy & share.

Source:

http://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_rn_new_in_540.html#concept_ds4_jt2_lr_unique_1

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport 

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RenameTable

http://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_rn_fixed_in_547.html

https://issues.apache.org/jira/browse/HIVE-10719


2 comments :

  1. it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly
    sharing with us that awesome article you have amazing blog.....
    Apache Spark and Scala Training

    ReplyDelete
  2. Hi Raju,

    Thanks for your comments :)

    Enjoy & Share

    Erkan

    ReplyDelete