statistics question can anyone point me in the direction eit cfml #cfml-general

statistics question: can anyone point me in the d...

websolete

07/27/2022, 3:47 PM

statistics question: can anyone point me in the direction, either a name or the formula itself, for determining the value of the following scenario: say you have four input values, they can be NULL or a number 1 - 10. i want to derive the 'significance' value of the total of those four numbers. for example, say you have the following values: NULL, NULL, 6, and 10. it's not an average of those four values, or the average of the two non-null values, but some result that takes the non-null values and applies the influence of the two NULL values. i'm sure there is a formula or approach for this, but i don't know the name of it or what to search for

bhartsfield

07/27/2022, 3:50 PM

are you SURE it's not just an average? ...once you determine the weight of a NULL that is.

bhartsfield

07/27/2022, 3:50 PM

I see your other comment now (that the number of nulls carries its own weight)

websolete

07/27/2022, 3:52 PM

well, NULL doesn't figure into aggregates, so in my example the average would be 8, but that doesn't reflect that two of the four values 'don't care about whatever is being evaluated'. like, if you had four people trying to decide whether some particular task is worth doing, two of them don't and the other two say it's a 6 or a 10. overall i'm trying to yield a value that accurately reflects everyone as a whole, even those that don't have a horse in the race

rstewart

07/27/2022, 3:52 PM

treat null as zero?

websolete

07/27/2022, 3:53 PM

would inappropriately skew the total importance value down

bhartsfield

07/27/2022, 3:53 PM

well, yeah... thats the answer if you dont give weight to the NULLs

websolete

07/27/2022, 3:53 PM

those two literally have no opinion

websolete

07/27/2022, 3:53 PM

you could also say NULL == 5, but that's also 'wrong'

websolete

07/27/2022, 3:53 PM

again, the scale is 1-10

websolete

07/27/2022, 3:54 PM

there has to be some formula for expressing this, but i was probably out doing bong hits when we covered it in class in high school

bhartsfield

07/27/2022, 3:54 PM

You could ride around the local college dorms trying all the different algorithms you see written on the windows until you find one that works

websolete

07/27/2022, 3:55 PM

or maybe hit up a harvard bar. i hear they have equations and shit written on the walls

Tim

07/27/2022, 3:55 PM

if you think of it as voting, where null is "abstain," then you essentially just throw them out. an abstention is a vote for the winning value, whatever it happens to be.

Tim

07/27/2022, 3:56 PM

So I think just averaging hte non-nulls is the right thing to do. in your example, a null is essentially a vote for "8"

bhartsfield

07/27/2022, 3:56 PM

so what value would you expect to get out of NULL, 6, NULL, 10 ?

rstewart

07/27/2022, 3:57 PM

it's why knowing "n" the size of the group of responses is important to understanding any statistic: we sent 12 surveys, got responses to this question on 4 of those surveys, and the average/min/max/mean/mode/stdev of those 4 is ...

websolete

07/27/2022, 3:57 PM

ok, so consider this then (in response to tim): four voters total. task1 votes: null, null, 6, 10 == 8 total. task2 votes: 2, 2, 6, 10 == 5 total (if averaging). i would think that task2 should be greater than one, because more people care about it, even if they rank it low individually

websolete

07/27/2022, 3:58 PM

bhartsfield, that's what i'm trying to determine

websolete

07/27/2022, 3:58 PM

like applying some 'percentage of respondents' vector to the final value

bhartsfield

07/27/2022, 4:00 PM

multiply by non-null responses?

bhartsfield

07/27/2022, 4:01 PM

making it 16 and 20 instead of 8 and 5

websolete

07/27/2022, 4:01 PM

i'll create a sample, maybe it will become obvious, will post in a sec

websolete

07/27/2022, 4:01 PM

maybe

bhartsfield

07/27/2022, 4:02 PM

in the end, that isn't much different than the original suggestion of giving NULLs a predefined weight (likely a negative one)

bhartsfield

07/27/2022, 4:03 PM

NULL, NULL, 6, 10 => -1, -1, 6, 10 => 3.5

rstewart

07/27/2022, 4:03 PM

I think it also may depend on the implication of a null: does it mean that question doesn't apply to the respondent? or does it mean they couldn't be arsed to tic a box?

websolete

07/27/2022, 4:06 PM

that's what's not clear to me rstewart. here's a basic thing in try cf: https://trycf.com/gist/90e7a40b35496692f3c4ec14d1d4432f/lucee5?theme=monokai

websolete

07/27/2022, 4:06 PM

i'm just assuming that the second nullAvg should have the non-voter count somehow applied in there

rstewart

07/27/2022, 4:07 PM

I think you sort of need some understanding of what a null means to know how to best treat them. (this is part of the reason it can be a challenge to design surveys in a manner that actually give you data you can really work with?)

websolete

07/27/2022, 4:08 PM

something along these lines, but i don't know if i'm on sound footing here: https://trycf.com/gist/deb3c0079e0751080a3c925a33e82f28/lucee5?theme=monokai

websolete

07/27/2022, 4:10 PM

i feel like the first one should be higher, which it is by a bit, but i also feel like it should be higher than it actually is relative to the second one

Adam Cameron

07/27/2022, 4:14 PM

Full disclosure: I am not a statistician. To me it's down to you to decide on a weighting for no-votes, and that could legitimately be zero or [not counted]. I think if you start saying "I'm going to weight it as a 3/10, then you need to be able to explain why it's 3 and not 4 etc. So sounds a bit hokey to me. Or you could weight them as the median or mode with some legitimacy. TBH, I think just not counting them is going to be more correct more of the time and doesn't smell of artifically skewing the results Or maybe the median is the correct answer... but yer back to having to decide whether the nulls are treated as

[not counted]

or perhaps

[average]

. I bet there's some manner of statistics.stackexchange.com out there somewhere to ask ppl who actually have a clue.

rstewart

07/27/2022, 4:14 PM

if this is a "how important is this thing?" sort of set of questions, and "1" means "not at all, stop bothering me", then it seems safe to treat null (no answer" as being the same as a "1". but if "1" means something else, you do have probably want to account for it differently in trying to compare answers. (again, this is why survey design is important)

Adam Cameron

07/27/2022, 4:15 PM

If it was me, my v1.0 would be to use the average or median of the ones where ppl have scored, and only think about doing a v2.0 if the client said "yeah, this doesn't seem to be working because reasons"

Adam Cameron

07/27/2022, 4:15 PM

Wait until you have a problem to solve before trying to find a solution for it.

websolete

07/27/2022, 4:16 PM

yes, it is a 'how important is this thing vs that thing' when four stakeholders are involved

Adam Cameron

07/27/2022, 4:16 PM

"not important enough to even answer" sounds like a zero to me.

Adam Cameron

07/27/2022, 4:17 PM

Given the set NULL, NULL, 6, 10, why are you perceiving that either 4 or 8 are not already the correct answer?

websolete

07/27/2022, 4:17 PM

in this context it means 'this thing doesn't impact my department', whereas a 1 would mean 'this thing impacts my department only slightly'

websolete

07/27/2022, 4:19 PM

another way of looking at it is 'why didn't i pay attention in school'

slcronin

07/27/2022, 4:19 PM

Your second gist link is exactly the same as setting the nulls to zero and averaging all the numbers, mathematically. And from what you've said so far, that still makes the most sense to me as a clean way to handle it.

rstewart

07/27/2022, 4:20 PM

i know someone who works in survey design and dealing with the analytics that come from them to help companies make decisions. i'll ask...

websolete

07/27/2022, 4:20 PM

perhaps

websolete

07/27/2022, 4:20 PM

appreciate it

websolete

07/27/2022, 4:20 PM

1-800-HOT-STATS

Adam Cameron

07/27/2022, 4:21 PM

Did you find this when you googled? https://stats.stackexchange.com/q/183257

Adam Cameron

07/27/2022, 4:21 PM

Tellingly:

There is no single correct way to deal with missing values when calculating an average

websolete

07/27/2022, 4:22 PM

didn't see that, most of the results i was seeing talked about how AVG() functions ignore NULLs (in sql)

slcronin

07/27/2022, 4:23 PM

I got a bunch of excel info when I searched. 🙂

websolete

07/27/2022, 4:24 PM

what about this approach, i'm not sure what the implications are: https://trycf.com/gist/dad487ce0abd4a48f6cf62f2002940a1/lucee5?theme=monokai

websolete

07/27/2022, 4:24 PM

the first one is 'more important' since more people voted

websolete

07/27/2022, 4:25 PM

and i guess it IS the same as NULL = 0

slcronin

07/27/2022, 4:25 PM

Yup!

slcronin

07/27/2022, 4:25 PM

The way you laid it out there makes it pretty blatant. 🙂

websolete

07/27/2022, 4:25 PM

but it should be more complex and difficult, needlessly difficult if possible

Adam Cameron

07/27/2022, 4:26 PM

Use tag islands.

Adam Cameron

07/27/2022, 4:27 PM

so disappointed trycf still doesn't support them

Adam Cameron

07/27/2022, 4:28 PM

I think possibly creating a microservice to handle the null value is something to do

Adam Cameron

07/27/2022, 4:28 PM

This way you can make sure you use the most up-to-date implementation of null, and don't need to write it yourself.

websolete

07/27/2022, 4:29 PM

maybe a SOAP endpoint to validate and return the correct NULL

Adam Cameron

07/27/2022, 4:29 PM

Exactly.

websolete

07/27/2022, 4:30 PM

nullvalidator.com is available. i see a business opportunity here

Adam Cameron

07/27/2022, 4:35 PM

Then have a statistics server that has it's own DSL for expressing algebraic expressions. So instead of a very unreliable

basicAvg = ( ( a + b + c + d ) / 4 ) * (4/4)

(this leaves the web-apps server to do maths! Terrible!!) you could write the expression using a more declarative DSL, and pass that string to the stats server API:

Copy code

basicAvg = sserver.calc('arg("{a}").plus().arg("{b}").plus().arg("{c}").plus().arg("{d}").asGroup().divide().literal(4).asGroup().multiply().asGroup(literal(4).divide().literal(4)).withArgs([a,b,c,d])')

websolete

07/27/2022, 4:37 PM

just add fusebox2 and coldspring to it and i think we've got it

websolete

07/27/2022, 4:38 PM

ok, well i appreciate all the input, i think i'm over this topic. i think there is something there to be examined with respect to number of actual votes counted, but here's my final testing: https://trycf.com/gist/42738b72b96899e29c0d06e5e015d5d3/lucee5?theme=monokai

websolete

07/27/2022, 4:39 PM

the results comport with my expectations, which means nothing other than i feel like the first one should have the highest value

gavinbaumanis

07/28/2022, 5:48 AM

I sincerely know some really nerdy - statistical geniuses. (Oxford PhD) - I'll ask them - and get back to you - as soon as I get a response. Also on the Melbourne Scala Slack group is a sub-channel (maths-weeds) - where I have also asked. It's a really interesting question and by the 66 replies - looks like it holds an interest for quite a few!

gavinbaumanis

07/29/2022, 3:33 AM

Unsurprisingly, The advice was; "It depends". It comes down to the "intent" of the question. So in your example of "Null is the same as I do not care / have no preference", in this instance - NULLS should not be included in any way and therefore the average would be based on the valid numeric responses only. If the business case was place the "I don't cares" in the middle... you might create a Median from the real numbers - and then add in xx Nulls as XX median values - then get an average. From a programming perspective, I would have a nullValue parameter that would allow you to use the same function everywhere, with the ability to provide special treatment for NULLs - depending on the use-case.

websolete

07/29/2022, 12:30 PM

appreciate the feedback. ultimately, the scenario isn't a 'vote for or against', it's 'vote for, if you want, and express how strongly you support the idea', and then to try and include the number of votes and non-votes in some meaningful way that can demonstrate how strongly the group feels about proposition A relative to proposition B, C, and D.

gavinbaumanis

07/30/2022, 4:20 AM

if "I" was the consumer of the data for that scenario, I would do both, methods... • Completely ignore NULLs - result-set "A" • Assign NULLs a value from the available choices that is "the most like" - I don't care. (which only "I" as the consumer can know) And then use whichever result-set makes the most sense for my use-case. Realising that if I am comparing multiple questions of the data - then the "set" of questions MUST be asked of the SAME result-set.

Jonas Eriksson

08/18/2022, 7:19 PM

re "How to count votes..." @websolete if you still want ideas on how to add this up, maybe give this lady a call, she was quite experimental with her mathematics 🤭. Edit: I meant to also write "cool thread with interesting discussions"!

websolete

08/18/2022, 7:21 PM

i fear i may beat her to death with a 12 pack of diet dr. pepper before i got my answer from her

Jonas Eriksson

08/19/2022, 9:07 AM

Oh I would join you!!

4 Views

Open in Slack

Previous Next