This message was deleted.
# help
s
This message was deleted.
f
ah…! was going to say my formula works only for linear scales 😄 You can see the code starting at line 240.
👍 1
y
Thanks!
Yeah,the log scale makes this tricky (also makes interpretation tricky which is why we wanted to try normalizing by the dx)
Reading the code I noticed it samples at grid centers (I had assumed corners) so my previous approach was also not right in other ways
f
"normalizing" a log scale is bonkers like dividing by the value? since dln x/dx = 1/x?
y
In this example I’m plotting a histogram heatmap/raster of counts on a log axis, which there are a few good reasons to try with this particular use case – but which means that the evenly-sized screen space bins represent different lengths in data space. This means that, for example, if your data was uniformly distributed, it wouldn't show up as visually uniform – the longer bins would have higher counts. So the idea is to normalize each "cell" by the "width" of its log bin and plot the width-normalized counts, so that a uniform distribution would appear visually uniform (all else being equal, wider bins have more of a chance for data to fall into them, which the normalization controls for)
the reason it's a heatmap and not a bar chart is because I'm plotting a bunch of these histograms over time, to show the time evolution of a distribution, with each time slice visualized the same way as the example above. That also got a bit weird since my time axis is ordinal (it's an array of histograms). Not sure if faceting would help, since it would probably still require multiple samples at the same latency for each histogram.
Another way to put it is that I'm using the raster mark as a way to do screen-space-dependent binning, but with a nonlinear axis the bins become varying-length so I want a way for the sampling function to be aware of the "area" that each sample represents so that I can control for it. This interpretation might get very fiddly in the more abstract setting where you have other kinds of interpolation, but I've been using it with
imageRendering: pixelated
where each sample turns into a rectangular area on the screen
f
OK, so in this case it's "easy", since the width of a log bin (in data space) is proportional to its value.
y
...hah
that is a very astute observation
or is it... let me think about it a bit more
😄
reminds me of the joke about a mathematician who was lecturing in front of a class, remarking that some fact is "obvious", then ends up having to think about it for an hour before concluding that yes, it is obvious
ok, if I understood your point, then 1. if the samples don't need to be interpretable then I can divide each sample value by log(x) before returning it. This will normalize each bin by a factor that grows at the same rate as the bin width. 2. if I want the samples to be interpretable, then I need to invert the specific log scale used by the chart, since otherwise there's still a remaining constant factor that scales all the bins after the log(x) normalization. 3. since the underlying function that I am sampling is a CDF but I want to visualize the PDF, I would still need to invert by specific x scale in order to get the precise bin edges in order to compute
pdf(x) = cdf(x + binwidth/2) - cdf(x - binwidth/2)
on a slightly different but related note, a possible API idea would be if there was a "postprocessing" hook that accepted a 2D raster grid and did whatever transformations it wanted after all of those counts were available. Then I could just sample the CDF at the precise sample values (though there is another subtlety to do with the 0.5 sample offset, which is that I would need to somehow ensure that I sample the CDF at its extremes, and not just in the middle of the grid cells), and compute the PDF in post
I think my density plot mark prototype had this kind of postprocessing hook, which was useful when you wanted to normalize the rows or columns of the heatmap by a property such as the total weight in that row or column. edit: though maybe there's a way to use
normalizeX
and friends instead
f
I think you should divide each sample by x not log(x)
(I may be wrong—it's kind-a difficult to think about this in the abstract)
y
ah, hm - i'll make an example notebook in a bit - with a uniform dataset it should be easy to tell if the normalization is off (Edit: still planning to look into this. I think you might be right, though - if eg. a square near the data value 100 on a log 10 axis covers 100 data units in 1 unit of screen space while only 1 unit would be covered near the data value 1…)