Hello Is it possible to generate segment names fol...
# troubleshooting
j
Hello Is it possible to generate segment names following the input file names ? Say I generate 10 files for 10 "ids", I'd want segments to contain these ids, so that they can be replaced later by generating another segment with the same name. e.g.
ID1.parquet -> prefix_ID1.segment
Anyway to make this work using
segmentNameGeneratorSpec.type
? Maybe using a particular file structure like
data/ID/file.parquet
? Thanks !
m
The default naming scheme already generates names friendly to overwrite at a later point in time, right?
For example, <tableName>_<minTime>_<maxTime>_<id>
If you regenerate data for a date partitioned folder, you will get consistent names, as long as the number of files is unchanged.
j
What if I don't have a time column ? 😄
Can I use an "id" partitionned folder then ?
m
I think for refresh use case the convention is <tableName>_<id>
j
So using a structure like so
Copy code
basedir/id1/file.parquet
basedir/id2/file.parquet
would generate segments with names
Copy code
<table_name>_id1.segment
<table_name>_id2.segment
? 🙂
Meaning that regenerating those files would easily replace previous segments for the same "ids"
m
No. the id I am referring to is just a sequence number generated on the fly.
Just curious, why don't you have a time column? Is it a pure refresh use case?
j
Ah I see.. Any way to make it easy to replace segments based on a user provided id ?
Just curious, why don't you have a time column? Is it a pure refresh use case?
It is a table I'm using with
IN_SUBQUERY
🙂
So it's more of a dimension table
Hence why no time column
I guess I can cheat and map ids -> time, but that sounds kind of hacky ^^
m
I'll have to check the code. But in the worst case ,the name generator is a very simple interface, and your use case seems like a one others might need, so might be good to implement, if not already supported.
Care to take a look?
j
Looks pretty simple indeed Can't seem to find a "suitable" strategy for my use case though
m
Ah, the interface does not provide a way to specify input file name.
Might be worth discussing in a broader forum via a github issue.
âž• 1
I do see your use case to be a good one to support.
j
Interesting, I'll do that