Wednesday, November 18, 2015

[BigData] HDFS file formats comparison

Hi,

In this port, i will compare HDFS file formats in Oracle BDA cluster. I will move data from default TEXT format to ORC and Parquet file formats and run aggregation queries in order to obtain file usage gain and performance.


Lets examine current table and see structure and file usage .

From beeline command, you can get table properties with this:

0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> analyze table ORGGB.ORGTABLE compute statistics;
0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> desc formatted ORGDB.ORGTABLE;
+-------------------------------+---------------------------------------
...
| # col_name                | data_type                                               |
| session_id                | double                                                  |
| key                       | string                                                  |
| value                     | string                                                  |
...
| # Storage Information         | NULL                                                           |
| SerDe Library:                | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe             |
| InputFormat:                  | org.apache.hadoop.mapred.TextInputFormat                       |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat     |
...


You can use hadoop file command to get file usage.

[HOST]~$hadoop fs -count /user/hive/warehouse/ORGDB.db/ORGTABLE
           1         1107      3837624053923 /user/hive/warehouse/ORGDB.db/ORGTABLE
[HOST]~$hadoop fs -du -h /user/hive/warehouse/ORGDB.db/
3.5 T  10.5 T  /user/hive/warehouse/ORGDB.db/ORGTABLE

As you see, we have 1107 files under table definition , lets see file extensions.

[HOST]~$hadoop fs -ls -h /user/hive/warehouse/ORGDB.db/ORGTABLE
….
-rwxrwxr-x   3 oracle hive      2.4 G 2015-11-04 18:35 /user/hive/warehouse/ORGDB.db/ORGTABLE/part-m-00299.gz
-rwxrwxr-x   3 oracle hive      2.9 G 2015-11-04 19:42 /user/hive/warehouse/ORGDB.db/ORGTABLE/part-m-00299_copy_1.gz
-rwxrwxr-x   3 oracle hive      2.9 G 2015-11-04 21:22 
...

You see that table is zipped. So You can use this work as a simulation in your prod systems.

Let me give some information about ORC files, then we will create&populate a table with ORC file format.

ORC files are created to improve storage efficiency of data with speeding up HIVE query performance. Now, ORC is top-level Project of Apache and it gives best results at compression. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. Big companies like Yahoo uses ORC at their production data. [1] https://orc.apache.org/docs/

To create a table ORC file, i will use following.

0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> create table ORCTABLE comment "orc_format_table" STORED AS ORC TBLPROPERTIES ("orc.compress"="ZLIB") as select * from  ORGTABLE where 1=2;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
...
INFO  : Table NEWDB.ORCTABLE stats: [numFiles=1, numRows=0, totalSize=49, rawDataSize=0]
No rows affected (18.92 seconds)



To load data from orginial table i will use following:

0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> insert into ORCTABLE select * from ORGDB.ORGTABLE;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1107
INFO  : Submitting tokens for job: job_1446726705482_0409
….
INFO  : 2015-11-16 15:34:30,920 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 4529286.13 sec
INFO  : 2015-11-16 15:35:20,469 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4529340.54 sec
...
INFO  : Table NEWDB.ORCTABLE stats: [numFiles=1107, numRows=902942362399, totalSize=1852045769505, rawDataSize=172856650104150]
No rows affected (6002.509 seconds)

Let me check table properties.

0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> desc formatted ORCTABLE;
+-------------------------------+---------------------------------------
...
| # Storage Information         | NULL                                                      |
| SerDe Library:                | org.apache.hadoop.hive.ql.io.orc.OrcSerde                 |
| InputFormat:                  | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat           |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat          |
...

As you see table uses ZLIB as compression codec. Lets check filesystem usage.

[HOST]~$hadoop fs -count /user/hive/warehouse/NEWDB.db/ORCTABLE
,           1         1107      1852045769505 /user/hive/warehouse/NEWDB.db/ORCTABLE
[HOST]~$hadoop fs -du -h /user/hive/warehouse/NEWDB.db/
1.7 T  5.1 T  /user/hive/warehouse/NEWDB.db/ORCTABLE

You see it takes 3.5T file usage to 1.7 T. J))

Now, before the queries i will create parquet table. Let me talk about parquet files.
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. Parquet allows compression schemes to be specified on a per-column level [2] https://parquet.apache.org/


You can create table  with following:

0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> CREATE TABLE PARQUETTABLE STORED AS PARQUET as select * from ORGDB.ORGTABLE where 1=2;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
...
INFO  : Table NEWDB.PARQUETTABLE stats: [numFiles=1, numRows=0, totalSize=101, rawDataSize=0]
No rows affected (14.073 seconds)

Loading data to parquet table:

0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> insert into PARQUETTABLE select * from ORGDB.ORGTABLE;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1107
INFO  : 2015-11-16 18:07:16,002 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 3867610.48 sec
INFO  : 2015-11-16 18:08:08,616 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3867671.23 sec
...
INFO  : Table NEWDB.PARQUETTABLE stats: [numFiles=1108, numRows=902942362399, totalSize=1817915522707, rawDataSize=2708827087197]
No rows affected (5117.417 seconds)

Ok, lets check table properties:

0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> desc formatted PARQUETTABLE;
+-------------------------------+---------------------------------------
...
| # Storage Information         | NULL                                                            |
| SerDe Library:                | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe     |
| InputFormat:                  | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat   |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat  |
...

Notice that there are difference for #Storage Information tab.

Lets check filesystem usage for parquet table.

[HOST]~$hadoop fs -du -h /user/hive/warehouse/NEWDB.db/
1.7 T  5.1 T  /user/hive/warehouse/NEWDB.db/ORCTABLE
1.7 T  5.0 T  /user/hive/warehouse/NEWDB.db/PARQUETTABLE


[HOST]~$hadoop fs -count /user/hive/warehouse/NEWDB.db/PARQUETTABLE
           1         1108      1817915522707 /user/hive/warehouse/NEWDB.db/PARQUETTABLE

So far, i show properties of tables created with ORC and Parquet file format and file system usage. As you see, they gave near results but with %50 space saving.

I want to show how insert operation use system resources:

Insert time durations are below in my example.

ORC table insert süresi : 1h:39m:54s
Parquet table insert süresi : 1h:25m:2s

Now login to Cloudera Manager and see how resources are used : Green -> ORC , Red -> PARQUET

You see that CPU usage are near same in my example.


I can say that Parquet operations took more IO usage in my example.








Here, i am giving a break J

After all, i can say that these graphs are excellent, because i did copy original table without zip option, and it almostly unzipped so here is the result for that operation. J

See it is terrible :/ It almost crushed cluster and took likely 3 hours to complete. So dont forget to use zip option.

Here :/






[HOST]~$hadoop fs -count /user/hive/warehouse/NEWDB.db/TESTTABLE
           1         1107     31010091576597 /user/hive/warehouse/NEWDB.db/TESTTABLE

[HOST]~$hadoop fs -du -h /user/hive/warehouse/NEWDB.db/
28.2 T  84.6 T  /user/hive/warehouse/NEWDB.db/TESTTABLE

Lets go back to our work , and check query running durations:

Original table (TEXT format)

0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> select key,count(*) from  ORGDB.ORGTABLE group by key;
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1099
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1107
INFO  : 2015-11-16 16:07:28,657 Stage-1 map = 100%,  reduce = 69%, Cumulative CPU 1086549.83 sec
INFO  : 2015-11-16 16:07:29,673 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1087693.04 sec
INFO  : MapReduce Total cumulative CPU time: 12 days 14 hours 8 minutes 13 seconds 40 msec
INFO  : Ended Job = job_1446726705482_0413
60 rows selected (1475.258 seconds)

ORC file format:

0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> select key,count(*) from  NEWDB.ORCTABLE group by key ;
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1099
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:6700
INFO  : 2015-11-16 16:14:21,085 Stage-1 map = 100%,  reduce = 95%, Cumulative CPU 204127.92 sec
INFO  : 2015-11-16 16:14:22,127 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 204235.15 sec
INFO  : 2015-11-16 16:14:23,141 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 204273.11 sec
INFO  : MapReduce Total cumulative CPU time: 2 days 8 hours 44 minutes 33 seconds 110 msec
INFO  : Ended Job = job_1446726705482_0415
….
60 rows selected (380.346 seconds)

And lastly ,Parquet file format:

0: jdbc:hive2://HIVEHOST:HIVEPORT/HIVEDB> select key,count(*) from  NEWDB.PARQUETTABLE group by key;    
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1099
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:6530
INFO  : 2015-11-16 18:23:38,107 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 664717.88 sec
INFO  : 2015-11-16 18:23:39,120 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 664744.83 sec
INFO  : MapReduce Total cumulative CPU time: 7 days 16 hours 39 minutes 4 seconds 830 msec
INFO  : Ended Job = job_1446726705482_0425
60 rows selected (757.26 seconds)

Briefly, i got following results, you see i got better performance with ORC file format:

TEXT     ->     60 rows selected (1475.258 seconds)    ~25min
ORC      ->     60 rows selected (380.346 seconds)     ~7min
PARQUET  ->     60 rows selected (757.26 seconds)      ~12min

What i see is, the better results are coming with ORC file format. It gains more performance with best file compression rate. But ORC file format has some problems when working with Cloudera - Oracle distro. So parquet format is still a well candidate with better parameter settings.

Wish , this work helps.

Ok, that is all
Thanks for reading.
Enjoy & share.

Source:
https://orc.apache.org/docs/
https://parquet.apache.org/





No comments :

Post a Comment