Tuesday, December 8, 2015

[Hive] Using Nested Queries


As you know working with SQL-like syntax in HIVE does not allow us all the capabilities as native-SQL or any commercial database improved SQL abilities. In this post, i will talk about nested queries in HIVE and rewriting an SQL query running in Exadata.

First of all , you should be aware of that Hadoop is created for big files, very big files :) HIVE is a tool which allows us using SQL-like syntax, we are very familiar, to explore those big files.

So, we got some limitations .  As an Oracle DBA, i wrote my SQL queries very easily thanks to the powerful Oracle optimizer engines, and many transformations are done on the fly. But here, HIVE transforms our SQL commands to file operations and it has some limitations.

For example , i have an correlated SQL query running on Exadata.

 WHERE     session_id IN
              (SELECT SESSION_ID
                 FROM SCHEMA1.T4 t4
                WHERE session_id IN
                         (SELECT t3.SESSION_ID
                            FROM SCHEMA1.T2 t2,
                                 SCHEMA1.T3 t3
                           WHERE     t2.VALUE IN (.........)
                                 AND t3.VALUE = '..............'
                                 AND t2.SESSION_ID = t3.SESSION_ID))
 AND T1.VALUE BETWEEN '2015-May-01 00:00:00' AND '2015-Dec-30 00:00:00'

when I run this query in Hive or Beeline , it gave me following error:

My Hive Version ->  Hive 1.1.0-cdh5.4.0

FAILED: SemanticException Line 1:190 Unsupported SubQuery Expression 'session_id' in definition of SubQuery sq_1 [
] used as sq_1 at Line 1:120: Nested SubQuery expressions are not supported

When you google it , you see hive limitations.
You can check following bug also : https://issues.apache.org/jira/browse/HIVE-784

From the Hive Language Manual , https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

-> There are a few limitations:
  • These subqueries are only supported on the right-hand side of an expression.
  • IN/NOT IN subqueries may only select a single column.
  • EXISTS/NOT EXISTS must have one or more correlated predicates.
  • References to the parent query are only supported in the WHERE clause of the subquery.
So, I rewrite the query with some modifications:

  FROM SCHEMA1.T1 as t1, SCHEMA1.T4 as t4
 WHERE     t1.session_id = t4.session_id
       AND t4.session_id IN
              (SELECT t3.SESSION_ID
                 FROM SCHEMA1.T2 as t2, SCHEMA1.T3 as t3
                WHERE     t2.VALUE IN
                      AND t3.VALUE = '........................'
                      AND t2.SESSION_ID = t3.SESSION_ID)
      AND t1.VALUE BETWEEN '2015-May-01 00:00:00' AND '2015-Dec-30 00:00:00';

After that, run query in beeline command line:

Query ID = oracle_20151208110909_f130fe8a-569a-48ea-a3fb-a25a43ade149
Total jobs = 4
Stage-1 is selected by condition resolver.
Stage-4 is selected by condition resolver.
Launching Job 1 out of 4
Number of reduce tasks not specified. Estimated from input data size: 1099
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1448973637348_0841, Tracking URL ....
2015-12-08 11:09:22,921 Stage-1 map = 0%,  reduce = 0%
2015-12-08 11:10:23,461 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 2926.71 sec
2015-12-08 11:11:23,767 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 8607.74 sec
2015-12-08 11:12:24,066 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 15968.42 se
MapReduce Total cumulative CPU time: 1 days 9 hours 32 minutes 45 seconds 750 msec
Ended Job = job_1448973637348_0843
Stage-14 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Execution log at: /tmp/oracle/oracle_20151208110909_f130fe8a-569a-48ea-a3fb-a25a43ade149.log
2015-12-08 01:03:24     Starting to launch local task to process map join;      maximum memory = 2058354688
2015-12-08 01:03:27     Processing rows:        200000  Hashtable size: 199999  Memory usage:   431841248       percentage:     0.21
2015-12-08 01:03:27     Processing rows:        300000  Hashtable size: 299999  Memory usage:   538041288       percentage:     0.261
2015-12-08 01:03:27     Dump the side-table for tag: 1 with group count: 300695 into file: file:/tmp/oracle/3caabaec-afdd-4eb7-be9d-963abadf8726/hive_2015-12-08_11-09-09_730_3462363179963451824-1/-local-10005/HashTable-Stage-6/MapJoin-mapfile01--.hashtable
2015-12-08 01:03:28     Uploaded 1 File to: file:/tmp/oracle/3caabaec-afdd-4eb7-be9d-963abadf8726/hive_2015-12-08_11-09-09_730_3462363179963451824-1/-local-10005/HashTable-Stage-6/MapJoin-mapfile01--.hashtable (7554340 bytes)
2015-12-08 01:03:28     End of local task; Time Taken: 4.188 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 4
2015-12-08 13:03:52,249 Stage-6 map = 94%,  reduce = 0%, Cumulative CPU 45.93 sec
2015-12-08 13:03:53,265 Stage-6 map = 100%,  reduce = 0%, Cumulative CPU 49.32 sec
MapReduce Total cumulative CPU time: 49 seconds 320 msec
Ended Job = job_1448973637348_0844
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2098  Reduce: 1099   Cumulative CPU: 1452028.98 sec   HDFS Read: 3916425613388 HDFS Write: 105504 SUCCESS
Stage-Stage-4: Map: 1982  Reduce: 1099   Cumulative CPU: 120765.75 sec   HDFS Read: 259305477973 HDFS Write: 7688479 SUCCESS
Stage-Stage-6: Map: 16   Cumulative CPU: 49.32 sec   HDFS Read: 482421 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 18 days 4 hours 54 minutes 4 seconds 50 msec
Time taken: 6884.603 seconds

In this example, table stats :

T1 -> numRows 902.942.362.399 , on disk   3.5 T
T2 -> numRows  28.861.733.076 , on disk 129.8 G
T3->  numRows  29.576.675.998 , on disk 111.6 G
T4- > numRows  29.577.122.203 , on disk  66.2 G

So, it was a good practice for us.

You can label nesteq queries with "AS" word when running correlated queries.

SELECT count(*) FROM SCHEMA1.T1 as t1
  WHERE t1.session_id  IN (SELECT session_id FROM SCHEMA1.T2 as  t2 WHERE t2.value = t1.value);

In the above query you may hit bug https://issues.apache.org/jira/browse/HIVE-9734
So you need to use t1.session_id after WHERE clause to explicitly use columns.

Ok, that's all.
Thanks for reading.
Enjoy & share.


No comments :

Post a Comment