https://pinot.apache.org/ logo
k

kish

06/03/2020, 10:54 PM
Hi: Is row groups supported in parquet file format ingestion?
x

Xiang Fu

06/04/2020, 5:40 PM
i will take a look into it
I’ve created a parquet with row groups and it works
Copy code
➜ parquet-tools meta /tmp/data.parquet | grep "row group" | head -n 3
row group 1:   RC:93898 TS:131168051 OFFSET:4
row group 2:   RC:93921 TS:131157859 OFFSET:131168055
row group 3:   RC:93899 TS:131162316 OFFSET:262325914
Copy code
➜ parquet-tools meta /tmp/data.parquet
file:          file:/tmp/data.parquet
creator:       parquet-mr version 1.8.0 (build 0fda28af84b9746396014ad6a415b90592a98b3b)
extra:         parquet.avro.schema = {"type":"record","name":"record","fields":[{"name":"met_float","type":"float"},{"name":"met_double","type":"double"},{"name":"dim_mv_long","type":{"type":"array","items":"long"}},{"name":"dim_sv_double","type":"double"},{"name":"met_long","type":"long"},{"name":"dim_sv_string","type":"string"},{"name":"dim_mv_int","type":{"type":"array","items":"int"}},{"name":"dim_mv_string","type":{"type":"array","items":"string"}},{"name":"dim_sv_int","type":"int"},{"name":"dim_mv_double","type":{"type":"array","items":"double"}},{"name":"dim_mv_float","type":{"type":"array","items":"float"}},{"name":"dim_sv_float","type":"float"},{"name":"met_int","type":"int"},{"name":"dim_sv_long","type":"long"}]}

file schema:   record
--------------------------------------------------------------------------------
met_float:     REQUIRED FLOAT R:0 D:0
met_double:    REQUIRED DOUBLE R:0 D:0
dim_mv_long:   REQUIRED F:1
.array:        REPEATED INT64 R:1 D:1
dim_sv_double: REQUIRED DOUBLE R:0 D:0
met_long:      REQUIRED INT64 R:0 D:0
dim_sv_string: REQUIRED BINARY O:UTF8 R:0 D:0
dim_mv_int:    REQUIRED F:1
.array:        REPEATED INT32 R:1 D:1
dim_mv_string: REQUIRED F:1
.array:        REPEATED BINARY O:UTF8 R:1 D:1
dim_sv_int:    REQUIRED INT32 R:0 D:0
dim_mv_double: REQUIRED F:1
.array:        REPEATED DOUBLE R:1 D:1
dim_mv_float:  REQUIRED F:1
.array:        REPEATED FLOAT R:1 D:1
dim_sv_float:  REQUIRED FLOAT R:0 D:0
met_int:       REQUIRED INT32 R:0 D:0
dim_sv_long:   REQUIRED INT64 R:0 D:0

row group 1:   RC:93898 TS:131168051 OFFSET:4
--------------------------------------------------------------------------------
met_float:      FLOAT UNCOMPRESSED DO:0 FPO:4 SZ:204570/204570/1.00 VC:93898 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.2111664E-4, max: 0.9999642, num_nulls: 0]
met_double:     DOUBLE UNCOMPRESSED DO:0 FPO:204574 SZ:244586/244586/1.00 VC:93898 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.7748871805833843E-5, max: 0.9999900631857935, num_nulls: 0]
dim_mv_long:
.array:         INT64 UNCOMPRESSED DO:0 FPO:449160 SZ:19375790/19375790/1.00 VC:2386112 ENC:RLE,PLAIN ST:[min: -9223315962872288953, max: 9223235375325574819, num_nulls: 0]
dim_sv_double:  DOUBLE UNCOMPRESSED DO:0 FPO:19824950 SZ:244586/244586/1.00 VC:93898 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 8.108521868788188E-5, max: 0.9999897173166792, num_nulls: 0]
met_long:       INT64 UNCOMPRESSED DO:0 FPO:20069536 SZ:244586/244586/1.00 VC:93898 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: -9221656197292706803, max: 9223232869150310363, num_nulls: 0]
dim_sv_string:  BINARY UNCOMPRESSED DO:0 FPO:20314122 SZ:460339/460339/1.00 VC:93898 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[no stats for this column]
dim_mv_int:
.array:         INT32 UNCOMPRESSED DO:0 FPO:20774461 SZ:9740584/9740584/1.00 VC:2363940 ENC:RLE,PLAIN ST:[min: -2147462202, max: 2147465086, num_nulls: 0]
dim_mv_string:
.array:         BINARY UNCOMPRESSED DO:0 FPO:30515045 SZ:70269656/70269656/1.00 VC:2373041 ENC:RLE,PLAIN ST:[no stats for this column]
dim_sv_int:     INT32 UNCOMPRESSED DO:0 FPO:100784701 SZ:204578/204578/1.00 VC:93898 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: -2146586718, max: 2146995692, num_nulls: 0]
dim_mv_double:
.array:         DOUBLE UNCOMPRESSED DO:0 FPO:100989279 SZ:19618358/19618358/1.00 VC:2416477 ENC:RLE,PLAIN ST:[min: 3.2332416607383507E-6, max: 0.9999963662410412, num_nulls: 0]
dim_mv_float:
.array:         FLOAT UNCOMPRESSED DO:0 FPO:120607637 SZ:9906684/9906684/1.00 VC:2404955 ENC:RLE,PLAIN ST:[min: 4.529953E-6, max: 0.9999939, num_nulls: 0]
dim_sv_float:   FLOAT UNCOMPRESSED DO:0 FPO:130514321 SZ:204570/204570/1.00 VC:93898 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.579523E-5, max: 0.99991333, num_nulls: 0]
met_int:        INT32 UNCOMPRESSED DO:0 FPO:130718891 SZ:204578/204578/1.00 VC:93898 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: -2146612524, max: 2147058892, num_nulls: 0]
dim_sv_long:    INT64 UNCOMPRESSED DO:0 FPO:130923469 SZ:244586/244586/1.00 VC:93898 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: -9221351713267308050, max: 9223334243910011623, num_nulls: 0]
can you run
parquet-tools
to see the file metadata?
@kish
k

kish

06/04/2020, 9:53 PM
Hi, thanks for looking
Could you name your parquet file as data_foo_bar_baz_quiz_131168051.parquet and check again?
I have a feeling that underscores in the file name is causing issue
x

Xiang Fu

06/04/2020, 10:00 PM
Copy code
➜ parquet-tools meta /tmp/data_foo_bar_baz_quiz_131168051.parquet
file:          file:/tmp/data_foo_bar_baz_quiz_131168051.parquet
creator:       parquet-mr version 1.8.0 (build 0fda28af84b9746396014ad6a415b90592a98b3b)
extra:         parquet.avro.schema = {"type":"record","name":"record","fields":[{"name":"met_float","type":"float"},{"name":"met_double","type":"double"},{"name":"dim_mv_long","type":{"type":"array","items":"long"}},{"name":"dim_sv_double","type":"double"},{"name":"met_long","type":"long"},{"name":"dim_sv_string","type":"string"},{"name":"dim_mv_int","type":{"type":"array","items":"int"}},{"name":"dim_mv_string","type":{"type":"array","items":"string"}},{"name":"dim_sv_int","type":"int"},{"name":"dim_mv_double","type":{"type":"array","items":"double"}},{"name":"dim_mv_float","type":{"type":"array","items":"float"}},{"name":"dim_sv_float","type":"float"},{"name":"met_int","type":"int"},{"name":"dim_sv_long","type":"long"}]}

file schema:   record
--------------------------------------------------------------------------------
met_float:     REQUIRED FLOAT R:0 D:0
met_double:    REQUIRED DOUBLE R:0 D:0
dim_mv_long:   REQUIRED F:1
.array:        REPEATED INT64 R:1 D:1
dim_sv_double: REQUIRED DOUBLE R:0 D:0
met_long:      REQUIRED INT64 R:0 D:0
dim_sv_string: REQUIRED BINARY O:UTF8 R:0 D:0
dim_mv_int:    REQUIRED F:1
.array:        REPEATED INT32 R:1 D:1
dim_mv_string: REQUIRED F:1
.array:        REPEATED BINARY O:UTF8 R:1 D:1
dim_sv_int:    REQUIRED INT32 R:0 D:0
dim_mv_double: REQUIRED F:1
.array:        REPEATED DOUBLE R:1 D:1
dim_mv_float:  REQUIRED F:1
.array:        REPEATED FLOAT R:1 D:1
dim_sv_float:  REQUIRED FLOAT R:0 D:0
met_int:       REQUIRED INT32 R:0 D:0
dim_sv_long:   REQUIRED INT64 R:0 D:0

row group 1:   RC:93898 TS:131168051 OFFSET:4
--------------------------------------------------------------------------------
seems no difference
I was using
ParquetRecordReaderTest.java
to create a test parquet file
i 100x the records written to create row groups
a row group is like 512MB
so my data file is 1.3gb then it has 3 row groups
could you check the schema for that file
or you can try to use parquet Record reader to read the file in the test
Copy code
@Override
  protected RecordReader createRecordReader()
      throws Exception {
    ParquetRecordReader recordReader = new ParquetRecordReader();
    recordReader.init(_dataFile, _sourceFields, null);
    return recordReader;
  }
just like this
then try to iterate on the reader to see if there is anything being returned
k

kish

06/04/2020, 10:07 PM
thanks
x

Xiang Fu

06/05/2020, 12:22 AM
please let me know if you need any help on this. Also if you could share a parquet file with me, I could check it further on my side
k

kish

06/05/2020, 1:23 AM
sure, thx