Hello -- I am trying to load some 100M records int...
# general
a
Hello -- I am trying to load some 100M records into an offline table. At first attemp, it was a simple table with no additional indexes other than what was in the tutorial doc.... that went fine. Now I am trying to add a star tree index on it and the loading is going on for 30+ mins (last time it tokk 12 min)... This is where it is for the last 20 mins... Is there anyway to monitor progress of this ??
Copy code
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Creating an executor service with 1 threads(Job parallelism: 0, available cores: 6.)
Submitting one Segment Generation Task for file:/opt/pinot/ai/weather/global_weather100M.csv
Using class: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader to read segment, ignoring configured file format: AVRO
RecordReaderSegmentCreationDataSource is used
Finished building StatsCollector!
Collected stats for 100000000 documents
Created dictionary for INT column: date with cardinality: 30, range: 0 to 29
Using fixed length dictionary for column: country, size: 110
Created dictionary for STRING column: country with cardinality: 10, max length in bytes: 11, range: Australia to USA
Created dictionary for INT column: pincode with cardinality: 10, range: 12324 to 3243678
Created dictionary for INT column: week with cardinality: 53, range: 0 to 52
Using fixed length dictionary for column: city, size: 80
Created dictionary for STRING column: city with cardinality: 10, max length in bytes: 8, range: AMD to SRI
Created dictionary for INT column: year with cardinality: 50, range: 1970 to 2019
Created dictionary for INT column: temperature with cardinality: 50, range: 0 to 49
Using fixed length dictionary for column: state, size: 20
Created dictionary for STRING column: state with cardinality: 10, max length in bytes: 2, range: AS to WB
Using fixed length dictionary for column: day, size: 63
Created dictionary for STRING column: day with cardinality: 7, max length in bytes: 9, range: Friday to Wednesday
Created dictionary for LONG column: ts with cardinality: 530768, range: 1620214278776 to 1620214809690
Start building IndexCreator!
Finished records indexing in IndexCreator!
Finished segment seal!
Converting segment: /tmp/pinot-00edd913-441c-4958-8555-9b380f12991b/output/weather_1_OFFLINE_1620214278776_1620214809690_0 to v3 format
v3 segment location for segment: weather_1_OFFLINE_1620214278776_1620214809690_0 is /tmp/pinot-00edd913-441c-4958-8555-9b380f12991b/output/weather_1_OFFLINE_1620214278776_1620214809690_0/v3
Deleting files in v1 segment directory: /tmp/pinot-00edd913-441c-4958-8555-9b380f12991b/output/weather_1_OFFLINE_1620214278776_1620214809690_0
Skip creating default columns for segment: weather_1_OFFLINE_1620214278776_1620214809690_0 without schema
Successfully loaded segment weather_1_OFFLINE_1620214278776_1620214809690_0 with readMode: mmap
Starting building 1 star-trees with configs: [StarTreeV2BuilderConfig[splitOrder=[country, state, city, pincode, day, date, week],skipStarNodeCreation=[],functionColumnPairs=[max__temperature, minMaxRange__temperature, avg__temperature, min__temperature],maxLeafRecords=1000]] using OFF_HEAP builder
Starting building star-tree with config: StarTreeV2BuilderConfig[splitOrder=[country, state, city, pincode, day, date, week],skipStarNodeCreation=[],functionColumnPairs=[max__temperature, minMaxRange__temperature, avg__temperature, min__temperature],maxLeafRecords=1000]



Generated 65977917 star-tree records from 100000000 segment records
k
I don’t know about monitoring progress, but I have run into slow segment builds when the heap size being used wasn’t big enough.
m
This ^^. Probably it is gc'ing. Also, can you share the query you are planning to make (and the data size), so we can propose if Startree or some other indexing is better
a
the data is synthetically generated for the test
Copy code
country,state,city,pincode,day,date,week,year,temperature,ts
Australia,GJ,AMD,560037,Wednesday,8,13,1972,27,1620214278776
Russia,UP,MUM,098678,Sunday,18,47,2010,35,1620214278786
USA,MP,CAL,380015,Sunday,5,34,1999,43,1620214278787
Australia,MP,GOA,12324,Wednesday,25,17,2009,18,1620214278787
India,GJ,SHILLONG,120934,Friday,10,39,1974,26,1620214278787
I am trying out multiple queries with group by on these dimensions.. Without star tree index also the query performance was quite good.. but i wanted to check if star tree would make it any better.. Schema :
Copy code
{
  "schemaName": "weather",
  "dimensionFieldSpecs": [
    {
      "name": "country",
      "dataType": "STRING"
    },
    {
      "name": "state",
      "dataType": "STRING"
    },
    {
      "name": "city",
      "dataType": "STRING"
    },
    {
      "name": "pincode",
      "dataType": "INT"
    },
    {
      "name": "day",
      "dataType": "STRING"
    },
    {
      "name": "date",
      "dataType": "INT"
    },
    {
      "name": "week",
      "dataType": "INT"
    },
    {
      "name": "year",
      "dataType": "INT"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "temperature",
      "dataType": "INT"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "ts",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "10:MINUTES"
    }
  ]
}
m
Star tree really helps if you want to preaggregate certain dimension combinations for large number of records (>> millions). For small test data it is not worth it
a
ok.. i have 100M records. (5.8 GB) .. At what threshold should i look at using star tree .. is there any guidance ?
m
Not overall records but records a query selects
a
oh.. so if i had a heavily skewed data set, then perhaps i could use star tree ?
m
Yeah
a
got it !! Thank you !