Monday, November 30, 2015

[HIVE] Querying hive data dictionary

Hi,

In of my scripts i need column datatypes but i couldn't get easily from HiveCli (or Beeline). I used a trick to get column names, but datatypes or some other properties need parsing of hive output and it seems quite challening. In this post, i will talk about Hive Server2  metastore and show how to get table's specific properties with queries..



As we all know Hive uses a relational database for metastore. We can get this information from HIVE configuration from Cloudera Manager easily.

For this purpose, login to CM and navigate to hive service.
In the configuration page, search for "metastore database" and you will see following:



You can also get this settings from hive-site.xml.

After that, login to mysql database (in this example metastore runs in a mysql database)

[HOST]~$mysql -uroot -pPASS

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
...
| hive               |
...
| hue                |
| mysql              |
| oozie              |
....
| sentry             |
...
+--------------------+
16 rows in set (0.00 sec)

After that , we use "hive"  database as it is in the settings ("Hive Metastore Database Name")


mysql> use hive;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+---------------------------+
| Tables_in_hive            |
+---------------------------+
| BUCKETING_COLS            |
| CDS                       |
| COLUMNS_V2                |
| COMPACTION_QUEUE          |
| COMPLETED_TXN_COMPONENTS  |
| DATABASE_PARAMS           |
| DBS                       |
| DB_PRIVS                  |
| DELEGATION_TOKENS         |
| FUNCS                     |
| FUNC_RU                   |
| GLOBAL_PRIVS              |
| HIVE_LOCKS                |
| IDXS                      |
| INDEX_PARAMS              |
| MASTER_KEYS               |
| NEXT_COMPACTION_QUEUE_ID  |
| NEXT_LOCK_ID              |
| NEXT_TXN_ID               |
| NOTIFICATION_LOG          |
| NOTIFICATION_SEQUENCE     |
| NUCLEUS_TABLES            |
| PARTITIONS                |
| PARTITION_EVENTS          |
| PARTITION_KEYS            |
| PARTITION_KEY_VALS        |
| PARTITION_PARAMS          |
| PART_COL_PRIVS            |
| PART_COL_STATS            |
| PART_PRIVS                |
| ROLES                     |
| ROLE_MAP                  |
| SDS                       |
| SD_PARAMS                 |
| SEQUENCE_TABLE            |
| SERDES                    |
| SERDE_PARAMS              |
| SKEWED_COL_NAMES          |
| SKEWED_COL_VALUE_LOC_MAP  |
| SKEWED_STRING_LIST        |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES             |
| SORT_COLS                 |
| TABLE_PARAMS              |
| TAB_COL_STATS             |
| TBLS                      |
| TBL_COL_PRIVS             |
| TBL_PRIVS                 |
| TXNS                      |
| TXN_COMPONENTS            |
| TYPES                     |
| TYPE_FIELDS               |
| VERSION                   |
+---------------------------+
53 rows in set (0.00 sec)

mysql>

As you see , these are tables for hive metastore. Lets query some tables and find out :)

Database information :

You can query DBS table for database properties:

mysql> desc DBS;
+-----------------+---------------+------+-----+---------+-------+
| Field           | Type          | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+-------+
| DB_ID           | bigint(20)    | NO   | PRI | NULL    |       |
| DESC            | varchar(4000) | YES  |     | NULL    |       |
| DB_LOCATION_URI | varchar(4000) | NO   |     | NULL    |       |
| NAME            | varchar(128)  | YES  | UNI | NULL    |       |
| OWNER_NAME      | varchar(128)  | YES  |     | NULL    |       |
| OWNER_TYPE      | varchar(10)   | YES  |     | NULL    |       |
+-----------------+---------------+------+-----+---------+-------+
6 rows in set (0.00 sec)


mysql> select * from DBS;
+-------+-----------------------+-----------------------------------------------------------------+-----------------------+------------+------------+
| DB_ID | DESC                  | DB_LOCATION_URI                                                 | NAME                  | OWNER_NAME | OWNER_TYPE |
+-------+-----------------------+-----------------------------------------------------------------+-----------------------+------------+------------+
|     1 | Default Hive database | hdfs://CLUSTER/user/hive/warehouse                              | default               | public     | ROLE       |
| 15946 | NULL                  | hdfs://CLUSTER/user/hive/warehouse/test.db                      | test                  | hive       | USER       |
| 23764 | NULL                  | hdfs://CLUSTER/user/hive/warehouse/stage.db                     | stage                 | hive       | USER       |
| 40219 | NULL                  | hdfs://CLUSTER/user/hive/warehouse/etl.db                       | etl                   | hive       | USER       |
....
| 56887 | NULL                  | hdfs://CLUSTER/tmp/tmp1                                         | tmp1                  | hive       | USER       |
| 56888 | NULL                  | hdfs://CLUSTER/tmp/tmp2                                         | tmp2                  | hive       | USER       |
+-------+-----------------------+-----------------------------------------------------------------+-----------------------+------------+------------+
17 rows in set (0.00 sec)

Table Information:

You can query TBLS table for properties of table:

mysql> desc TBLS;
+--------------------+--------------+------+-----+---------+-------+
| Field              | Type         | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| TBL_ID             | bigint(20)   | NO   | PRI | NULL    |       |
| CREATE_TIME        | int(11)      | NO   |     | NULL    |       |
| DB_ID              | bigint(20)   | YES  | MUL | NULL    |       |
| LAST_ACCESS_TIME   | int(11)      | NO   |     | NULL    |       |
| OWNER              | varchar(767) | YES  |     | NULL    |       |
| RETENTION          | int(11)      | NO   |     | NULL    |       |
| SD_ID              | bigint(20)   | YES  | MUL | NULL    |       |
| TBL_NAME           | varchar(128) | YES  | MUL | NULL    |       |
| TBL_TYPE           | varchar(128) | YES  |     | NULL    |       |
| VIEW_EXPANDED_TEXT | mediumtext   | YES  |     | NULL    |       |
| VIEW_ORIGINAL_TEXT | mediumtext   | YES  |     | NULL    |       |
| LINK_TARGET_ID     | bigint(20)   | YES  | MUL | NULL    |       |
+--------------------+--------------+------+-----+---------+-------+
12 rows in set (0.00 sec)


mysql> select * from TBLS where tbl_name like 'test%';
+--------+-------------+-------+------------------+---------------------------+-----------+--------+----------------+----------------+--------------------+--------------------+----------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER                     | RETENTION | SD_ID  | TBL_NAME       | TBL_TYPE       | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
+--------+-------------+-------+------------------+---------------------------+-----------+--------+----------------+----------------+--------------------+--------------------+----------------+
|  39203 |  1441731750 |     1 |                0 | hive                      |         0 | 115435 | test           | MANAGED_TABLE  | NULL               | NULL               |           NULL |
|  59440 |  1447348984 | 56887 |                0 | hive                      |         0 | 179663 | test           | EXTERNAL_TABLE | NULL               | NULL               |           NULL |
|  64685 |  1448893214 |     1 |                0 | oracle                    |         0 | 197241 | test_ext       | MANAGED_TABLE  | NULL               | NULL               |           NULL |
|  64687 |  1448893437 |     1 |                0 | oracle                    |         0 | 197243 | test_ext_ext   | EXTERNAL_TABLE | NULL               | NULL               |           NULL |
|  24140 |  1437482500 | 23764 |                0 | hive                      |         0 |  72042 | test_parquet   | MANAGED_TABLE  | NULL               | NULL               |           NULL |
|  24139 |  1437482499 | 23764 |                0 | hive                      |         0 |  72041 | test_part_text | MANAGED_TABLE  | NULL               | NULL               |           NULL |
....
|  64663 |  1448887491 |     1 |                0 | hive                      |         0 | 197179 | testext_ext    | EXTERNAL_TABLE | NULL               | NULL               |           NULL |
+--------+-------------+-------+------------------+---------------------------+-----------+--------+----------------+----------------+--------------------+--------------------+----------------+
12 rows in set (0.00 sec)

As you can see you can get owner of the table , type of the table and etc...

Column Information:

You can query column informations from COLUMNS_V2 table:

mysql> desc COLUMNS_V2
    -> ;
+-------------+---------------+------+-----+---------+-------+
| Field       | Type          | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+-------+
| CD_ID       | bigint(20)    | NO   | PRI | NULL    |       |
| COMMENT     | varchar(256)  | YES  |     | NULL    |       |
| COLUMN_NAME | varchar(128)  | NO   | PRI | NULL    |       |
| TYPE_NAME   | varchar(4000) | YES  |     | NULL    |       |
| INTEGER_IDX | int(11)       | NO   |     | NULL    |       |
+-------------+---------------+------+-----+---------+-------+
5 rows in set (0.00 sec)

mysql> select * from COLUMNS_V2 limit 20;
+-------+---------+--------------------+--------------+-------------+
| CD_ID | COMMENT | COLUMN_NAME        | TYPE_NAME    | INTEGER_IDX |
+-------+---------+--------------------+--------------+-------------+
|   999 | NULL    | credit_balance     | int          |           2 |
|   999 | NULL    | credit_card_limits | int          |           1 |
....
|  1000 | NULL    | mortgage_sum       | float        |           1 |
|  7018 | NULL    | col_0              | string       |           0 |
|  7018 | NULL    | col_1              | smallint     |           1 |
|  7018 | NULL    | col_2              | bigint       |           2 |
|  7019 | NULL    | col_0              | string       |           0 |
|  7019 | NULL    | col_1              | smallint     |           1 |
|  7019 | NULL    | col_2              | bigint       |           2 |
| 15973 | NULL    | account_num        | varchar(200) |           1 |
| 15973 | NULL    | active_ind         | char(1)      |           4 |
| 15973 | NULL    | campaign_code      | char(7)      |           2 |
....
+-------+---------+--------------------+--------------+-------------+

Here you can see, it is different from Oracle  Data Dictionary views because it does not show table informations together, so how to link all together.??

Here there is another table named SDS which holds connections.

mysql> desc SDS;
+---------------------------+---------------+------+-----+---------+-------+
| Field                     | Type          | Null | Key | Default | Extra |
+---------------------------+---------------+------+-----+---------+-------+
| SD_ID                     | bigint(20)    | NO   | PRI | NULL    |       |
| CD_ID                     | bigint(20)    | YES  | MUL | NULL    |       |
| INPUT_FORMAT              | varchar(4000) | YES  |     | NULL    |       |
| IS_COMPRESSED             | bit(1)        | NO   |     | NULL    |       |
| IS_STOREDASSUBDIRECTORIES | bit(1)        | NO   |     | NULL    |       |
| LOCATION                  | varchar(4000) | YES  |     | NULL    |       |
| NUM_BUCKETS               | int(11)       | NO   |     | NULL    |       |
| OUTPUT_FORMAT             | varchar(4000) | YES  |     | NULL    |       |
| SERDE_ID                  | bigint(20)    | YES  | MUL | NULL    |       |
+---------------------------+---------------+------+-----+---------+-------+

Lets query it.

mysql> select * from SDS limit 10;
+-------+-------+------------------------------------------+---------------+---------------------------+------------------------------------------------------------------+-------------+------------------------------------------------------------+----------+
| SD_ID | CD_ID | INPUT_FORMAT                             | IS_COMPRESSED | IS_STOREDASSUBDIRECTORIES | LOCATION                                                         | NUM_BUCKETS | OUTPUT_FORMAT                                              | SERDE_ID |
+-------+-------+------------------------------------------+---------------+---------------------------+------------------------------------------------------------------+-------------+------------------------------------------------------------+----------+
| 20987 |  7018 | org.apache.hadoop.mapred.TextInputFormat |               |                           | hdfs://CLUSTER/user/hive/warehouse/credittesttable               |          -1 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |    20987 |
| 20988 |  7019 | org.apache.hadoop.mapred.TextInputFormat |               |                           | hdfs://CLUSTER/user/hive/warehouse/credittesttable_t             |          -1 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |    20988 |
| 47657 | 15973 | org.apache.hadoop.mapred.TextInputFormat |               |                           | hdfs://CLUSTER/user/hive/warehouse/test.db/t_campaign_header     |          -1 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |    47657 |
| 49411 | 16567 | org.apache.hadoop.mapred.TextInputFormat |               |                           | hdfs://CLUSTER/user/oracle/flumeData/sample.log                  |          -1 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |    49411 |
...
+-------+-------+------------------------------------------+---------------+---------------------------+------------------------------------------------------------------+-------------+------------------------------------------------------------+----------+
10 rows in set (0.00 sec)

Here you see it has SD_ID and CD_ID columns.
SD_ID information points to tables and CD_ID corresponds to column ids.

So far, we can query columns of a table in a spesific database with following query:
(change db_name and table_name)

mysql> select  c.column_name,c.TYPE_NAME 
from DBS d 
join TBLS t on d.db_id=t.db_id 
join SDS s on t.sd_id=s.sd_id 
join COLUMNS_V2 c on s.cd_id=c.cd_id 
where d.name='default' <----
and t.tbl_name='test_ext'; <----

+-------------+-----------+
| column_name | TYPE_NAME |
+-------------+-----------+
| ad          | string    |
| id          | int       |
| soyad       | string    |
+-------------+-----------+
3 rows in set (0.00 sec)

You can use other tables for more informations. For example if you want to get info about partitions use PARTITIONS table.

mysql> desc PARTITIONS;
+------------------+--------------+------+-----+---------+-------+
| Field            | Type         | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| PART_ID          | bigint(20)   | NO   | PRI | NULL    |       |
| CREATE_TIME      | int(11)      | NO   |     | NULL    |       |
| LAST_ACCESS_TIME | int(11)      | NO   |     | NULL    |       |
| PART_NAME        | varchar(767) | YES  | MUL | NULL    |       |
| SD_ID            | bigint(20)   | YES  | MUL | NULL    |       |
| TBL_ID           | bigint(20)   | YES  | MUL | NULL    |       |
| LINK_TARGET_ID   | bigint(20)   | YES  | MUL | NULL    |       |
+------------------+--------------+------+-----+---------+-------+
7 rows in set (0.00 sec)


Ok, that's all. I think this post gonna save more time for you :) 

Thanks for blogs i saw on Internet. Here are the links:)

http://bigdatadw.blogspot.com.tr/2012/05/hive-digging-deeper-into-metastore.html
http://www.programmershare.com/2571801/

Sharing is good :)

Thanks for reading.
Enjoy & share.










1 comment :