statistics question: can anyone point me in the d...
# cfml-general
w
statistics question: can anyone point me in the direction, either a name or the formula itself, for determining the value of the following scenario: say you have four input values, they can be NULL or a number 1 - 10. i want to derive the 'significance' value of the total of those four numbers. for example, say you have the following values: NULL, NULL, 6, and 10. it's not an average of those four values, or the average of the two non-null values, but some result that takes the non-null values and applies the influence of the two NULL values. i'm sure there is a formula or approach for this, but i don't know the name of it or what to search for
b
are you SURE it's not just an average? ...once you determine the weight of a NULL that is.
I see your other comment now (that the number of nulls carries its own weight)
w
well, NULL doesn't figure into aggregates, so in my example the average would be 8, but that doesn't reflect that two of the four values 'don't care about whatever is being evaluated'. like, if you had four people trying to decide whether some particular task is worth doing, two of them don't and the other two say it's a 6 or a 10. overall i'm trying to yield a value that accurately reflects everyone as a whole, even those that don't have a horse in the race
r
treat null as zero?
w
would inappropriately skew the total importance value down
b
well, yeah... thats the answer if you dont give weight to the NULLs
w
those two literally have no opinion
you could also say NULL == 5, but that's also 'wrong'
again, the scale is 1-10
there has to be some formula for expressing this, but i was probably out doing bong hits when we covered it in class in high school
b
You could ride around the local college dorms trying all the different algorithms you see written on the windows until you find one that works
w
or maybe hit up a harvard bar. i hear they have equations and shit written on the walls
t
if you think of it as voting, where null is "abstain," then you essentially just throw them out. an abstention is a vote for the winning value, whatever it happens to be.
So I think just averaging hte non-nulls is the right thing to do. in your example, a null is essentially a vote for "8"
b
so what value would you expect to get out of NULL, 6, NULL, 10 ?
r
it's why knowing "n" the size of the group of responses is important to understanding any statistic: we sent 12 surveys, got responses to this question on 4 of those surveys, and the average/min/max/mean/mode/stdev of those 4 is ...
w
ok, so consider this then (in response to tim): four voters total. task1 votes: null, null, 6, 10 == 8 total. task2 votes: 2, 2, 6, 10 == 5 total (if averaging). i would think that task2 should be greater than one, because more people care about it, even if they rank it low individually
bhartsfield, that's what i'm trying to determine
like applying some 'percentage of respondents' vector to the final value
b
multiply by non-null responses?
making it 16 and 20 instead of 8 and 5
w
i'll create a sample, maybe it will become obvious, will post in a sec
maybe
b
in the end, that isn't much different than the original suggestion of giving NULLs a predefined weight (likely a negative one)
NULL, NULL, 6, 10 => -1, -1, 6, 10 => 3.5
r
I think it also may depend on the implication of a null: does it mean that question doesn't apply to the respondent? or does it mean they couldn't be arsed to tic a box?
w
that's what's not clear to me rstewart. here's a basic thing in try cf: https://trycf.com/gist/90e7a40b35496692f3c4ec14d1d4432f/lucee5?theme=monokai
i'm just assuming that the second nullAvg should have the non-voter count somehow applied in there
r
I think you sort of need some understanding of what a null means to know how to best treat them. (this is part of the reason it can be a challenge to design surveys in a manner that actually give you data you can really work with?)
w
something along these lines, but i don't know if i'm on sound footing here: https://trycf.com/gist/deb3c0079e0751080a3c925a33e82f28/lucee5?theme=monokai
i feel like the first one should be higher, which it is by a bit, but i also feel like it should be higher than it actually is relative to the second one
a
Full disclosure: I am not a statistician. To me it's down to you to decide on a weighting for no-votes, and that could legitimately be zero or [not counted]. I think if you start saying "I'm going to weight it as a 3/10, then you need to be able to explain why it's 3 and not 4 etc. So sounds a bit hokey to me. Or you could weight them as the median or mode with some legitimacy. TBH, I think just not counting them is going to be more correct more of the time and doesn't smell of artifically skewing the results Or maybe the median is the correct answer... but yer back to having to decide whether the nulls are treated as
0
or
[not counted]
or perhaps
[average]
. I bet there's some manner of statistics.stackexchange.com out there somewhere to ask ppl who actually have a clue.
r
if this is a "how important is this thing?" sort of set of questions, and "1" means "not at all, stop bothering me", then it seems safe to treat null (no answer" as being the same as a "1". but if "1" means something else, you do have probably want to account for it differently in trying to compare answers. (again, this is why survey design is important)
a
If it was me, my v1.0 would be to use the average or median of the ones where ppl have scored, and only think about doing a v2.0 if the client said "yeah, this doesn't seem to be working because reasons"
Wait until you have a problem to solve before trying to find a solution for it.
w
yes, it is a 'how important is this thing vs that thing' when four stakeholders are involved
a
"not important enough to even answer" sounds like a zero to me.
Given the set NULL, NULL, 6, 10, why are you perceiving that either 4 or 8 are not already the correct answer?
w
in this context it means 'this thing doesn't impact my department', whereas a 1 would mean 'this thing impacts my department only slightly'
another way of looking at it is 'why didn't i pay attention in school'
s
Your second gist link is exactly the same as setting the nulls to zero and averaging all the numbers, mathematically. And from what you've said so far, that still makes the most sense to me as a clean way to handle it.
r
i know someone who works in survey design and dealing with the analytics that come from them to help companies make decisions. i'll ask...
w
perhaps
appreciate it
1-800-HOT-STATS
a
Did you find this when you googled? https://stats.stackexchange.com/q/183257
Tellingly:
There is no single correct way to deal with missing values when calculating an average
w
didn't see that, most of the results i was seeing talked about how AVG() functions ignore NULLs (in sql)
s
I got a bunch of excel info when I searched. 🙂
w
what about this approach, i'm not sure what the implications are: https://trycf.com/gist/dad487ce0abd4a48f6cf62f2002940a1/lucee5?theme=monokai
the first one is 'more important' since more people voted
and i guess it IS the same as NULL = 0
s
Yup!
The way you laid it out there makes it pretty blatant. 🙂
w
but it should be more complex and difficult, needlessly difficult if possible
a
Use tag islands.
so disappointed trycf still doesn't support them
I think possibly creating a microservice to handle the null value is something to do
This way you can make sure you use the most up-to-date implementation of null, and don't need to write it yourself.
w
maybe a SOAP endpoint to validate and return the correct NULL
a
Exactly.
w
nullvalidator.com is available. i see a business opportunity here
a
Then have a statistics server that has it's own DSL for expressing algebraic expressions. So instead of a very unreliable
basicAvg = ( ( a + b + c + d ) / 4 ) * (4/4)
(this leaves the web-apps server to do maths! Terrible!!) you could write the expression using a more declarative DSL, and pass that string to the stats server API:
Copy code
basicAvg = sserver.calc('arg("{a}").plus().arg("{b}").plus().arg("{c}").plus().arg("{d}").asGroup().divide().literal(4).asGroup().multiply().asGroup(literal(4).divide().literal(4)).withArgs([a,b,c,d])')
w
just add fusebox2 and coldspring to it and i think we've got it
ok, well i appreciate all the input, i think i'm over this topic. i think there is something there to be examined with respect to number of actual votes counted, but here's my final testing: https://trycf.com/gist/42738b72b96899e29c0d06e5e015d5d3/lucee5?theme=monokai
the results comport with my expectations, which means nothing other than i feel like the first one should have the highest value
g
I sincerely know some really nerdy - statistical geniuses. (Oxford PhD) - I'll ask them - and get back to you - as soon as I get a response. Also on the Melbourne Scala Slack group is a sub-channel (maths-weeds) - where I have also asked. It's a really interesting question and by the 66 replies - looks like it holds an interest for quite a few!
Unsurprisingly, The advice was; "It depends". It comes down to the "intent" of the question. So in your example of "Null is the same as I do not care / have no preference", in this instance - NULLS should not be included in any way and therefore the average would be based on the valid numeric responses only. If the business case was place the "I don't cares" in the middle... you might create a Median from the real numbers - and then add in xx Nulls as XX median values - then get an average. From a programming perspective, I would have a nullValue parameter that would allow you to use the same function everywhere, with the ability to provide special treatment for NULLs - depending on the use-case.
w
appreciate the feedback. ultimately, the scenario isn't a 'vote for or against', it's 'vote for, if you want, and express how strongly you support the idea', and then to try and include the number of votes and non-votes in some meaningful way that can demonstrate how strongly the group feels about proposition A relative to proposition B, C, and D.
g
if "I" was the consumer of the data for that scenario, I would do both, methods... • Completely ignore NULLs - result-set "A" • Assign NULLs a value from the available choices that is "the most like" - I don't care. (which only "I" as the consumer can know) And then use whichever result-set makes the most sense for my use-case. Realising that if I am comparing multiple questions of the data - then the "set" of questions MUST be asked of the SAME result-set.
j
re "How to count votes..." @websolete if you still want ideas on how to add this up, maybe give this lady a call, she was quite experimental with her mathematics 🤭. Edit: I meant to also write "cool thread with interesting discussions"!
w
i fear i may beat her to death with a 12 pack of diet dr. pepper before i got my answer from her
j
Oh I would join you!!