Hey can anyone recommend other materials related to the Raw Apache Pinot #general

Hey, can anyone recommend other materials related ...

Neil Teng

06/18/2021, 9:36 PM

Hey, can anyone recommend other materials related to the "Raw value forward index" I am having a really difficult time understanding the Raw value forward index example .

Mayank

06/18/2021, 9:37 PM

What are you looking for? It just stores raw data chunk compressed, as opposed to dictionary encoding

Neil Teng

06/18/2021, 9:49 PM

where does the chunk size come from? And the "chunkoffset = docId % chunkSize" is hard to understand in the example. if the chunk is compressed, what is the difference between it and the compression on disk as a column-oriented DB? If purpose is to improve large sequential scan, do you mean a scan on this col without any where clause? If there is where clause, I think we still need to check each value.

Neil Teng

06/18/2021, 9:49 PM

The example I am referring to: https://docs.pinot.apache.org/basics/indexing/forward-index#raw-value-forward-index

Mayank

06/18/2021, 9:54 PM

The math (modulo etc) is on uncompressed chunk. Compression is for on-disk index.

Mayank

06/18/2021, 9:57 PM

Say you wanted to read docId 1 to 1000. In case of dictionary, the dict encoding may scatter these 1000 values all over the disk (in the worst case requiring 1000 disk seeks). In case of raw index, there is no dictionary, and all 1000 values would be contiguous on disk (minimizing disk seeks)

Mayank

06/18/2021, 9:58 PM

Typically, you want to use this for high cardinality string columns, where dictionary encoding does not provide much compression.

Neil Teng

06/18/2021, 10:17 PM

I think I missed a point here -- the indexed column is always sorted.

Mayank

06/18/2021, 10:18 PM

No, only the sorted column is sorted. And dictionaries are sorted.

Neil Teng

06/18/2021, 10:20 PM

OK. What does those pointer from colA to colB trying to say?

Mayank

06/18/2021, 10:21 PM

So consider this:

Copy code

Your use case has queries mostly for a primary column (eg where customerId = xxx).

If you sort on customerId, then you will always pick contiguous docIds for a given query.

Now consider you have a high cardinality string column that you project in the query.

With dictionary, the fwd index will have dictionary ids, that may point to different disk blocks.

Without dictionary for this high cardinality column, the contiguous docIds will correspond to contiguous disk blocks.

Mayank

06/18/2021, 10:21 PM

Hopefully that makes sense?

Neil Teng

06/18/2021, 10:32 PM

OK, I think I understand it. I have a question about "sort on customerId", do we mean all the columns are sorted with the same order as customerId. how do we config that all the records sorted according to one columns in the disk?

Mayank

06/18/2021, 10:33 PM

Yes, that is implicit. A docId represents a row in the table and has to match across columns, nothing special needs to be done for that

Neil Teng

06/18/2021, 10:38 PM

Is docId a theoretical auto-incremental UUID in pinot or a primary key we actual specify? But I dont see pinot has a concept of primary key.

Neil Teng

06/18/2021, 10:41 PM

because "A docId represents a row in the table and has to match across columns", I think for a column-oriented DB, every column is sorted with this docId and compressed in default. That is the way data lay out in the disk.

Mayank

06/18/2021, 10:42 PM

docId is just a contiguous integer (0, 1, 2, 3...) in the scope of a Pinot segment

Mayank

06/18/2021, 10:42 PM

I think for a column-oriented DB, every column is sorted with this docId and compressed in default. That is the way data lay out in the disk.

Mayank

06/18/2021, 10:43 PM

Hmm, then how do you identify a row across columns. If you sort each column independently you will loose which value in colA corresponds to which value in colB. I am not sure what other column oriented DBs do, but Pinot does not do this

Neil Teng

06/18/2021, 10:48 PM

By sorted, I just mean layer out in the order as the docID does.

Neil Teng

06/18/2021, 10:48 PM

I think we mean the same thing.

Mayank

06/18/2021, 10:48 PM

Yes, seems so

Neil Teng

06/18/2021, 10:50 PM

wait, "dictionaries are sorted", do you mean the docID is sorted according to the indexed column?

Mayank

06/18/2021, 10:51 PM

dictionary is separate from docId

Neil Teng

06/18/2021, 10:51 PM

I am sorry, it is not.

Neil Teng

06/18/2021, 10:52 PM

OK, back to the raw value forward index, if the data in disk are already in the order of docId, what is the meaning of it?

Mayank

06/18/2021, 10:53 PM

From docId you get dictionaryId

Mayank

06/18/2021, 10:53 PM

The actual data for that dictionary id can be anywhere on disk

Mayank

06/18/2021, 10:54 PM

https://docs.pinot.apache.org/basics/indexing/forward-index

Mayank

06/18/2021, 10:54 PM

Look at the forward index section to understand docId -> dictId -> rawData

Neil Teng

06/18/2021, 10:55 PM

thanks, I understand the Dictionary-encoded forward index.

Neil Teng

06/18/2021, 10:56 PM

Just I am not sure why we need to specify the "raw value forward index" because the data in the disk is already in that way.

Mayank

06/18/2021, 10:56 PM

It is not

Mayank

06/18/2021, 10:57 PM

The dictionary of a column is generated by "sorting values of the column". dictId = 0 is the first sorted value and so on.

Neil Teng

06/18/2021, 10:57 PM

yes.

Neil Teng

06/18/2021, 11:02 PM

Do you mean the data in the same column are not sitting next to each other in the disk in some cases?

Open in Slack

Previous Next