I'm inserting into the table db.A_full using this query:
INSERT OVERWRITE db.A_full
PARTITION (year, month, day)
SELECT DISTINCT
A.ID
, A.doc_type
, A.city
, A.gender
, A.age
, A.year
, A.month
, A.day
FROM db.A_parcial as A
db.A_parcial only contains a single date and it's a relatively small table that should be saved in a single file of 148Mb in db.A_full; but when I look in the clouse SHOW PARTITIONS db.A_full
the single partition was saved as 138 files. How can I do the insert avoiding small files creation?
I'm inserting into the table db.A_full using this query:
INSERT OVERWRITE db.A_full
PARTITION (year, month, day)
SELECT DISTINCT
A.ID
, A.doc_type
, A.city
, A.gender
, A.age
, A.year
, A.month
, A.day
FROM db.A_parcial as A
db.A_parcial only contains a single date and it's a relatively small table that should be saved in a single file of 148Mb in db.A_full; but when I look in the clouse SHOW PARTITIONS db.A_full
the single partition was saved as 138 files. How can I do the insert avoiding small files creation?
1 Answer
Reset to default 0There is a way to compact an insert by setting some gonfig options:
-- this next command tells impala to use 1 node
-- it has to be set JUST for the insert
set num_nodes = 1 ;
set PARQUET_FILE_SIZE = 256m ; -- this tells impala the filesize of result
insert into table my_table as select fields from other_table ;
set num_nodes = 0 ; -- this has to be set immediatelly after the insert
compute stats my_table ; -- this is just good manners
This is an easy way to set an insert to reduce the number of files on an insert. However, letting the set num_nodes = 1 will impact your performance, so it only needs to be set on an insert.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745202072a4616384.html
评论列表(0条)