Today's fun exercise: We have a vendor to whom we ...
# cfml-general
s
Today's fun exercise: We have a vendor to whom we send off lots of user data. Turns out they can't accept any unicode characters. So somebody whose name is Tomáš breaks the universe. Anybody ever have to convert Tomáš to Tomas in CF before?
😲 1
b
I've seen some libraries that basically map a huge list of unicode chars to the closest ascii equiv, but I don't recall where they were at the moment
s
Ugh. Yeah I guess that's what we need. I'll check out Normalize too. Thanks guys
b
You can probably find a list somewhere online and just place them i a struct with the uyicode char as the key and the ascii replacement as the value and just do a lookup and replacement char by char.
s
love it when the billion dollar corporations flunk data architecture 101
b
lol, yeah it's strange you have to deal with this in 2022
Probably a mainframe
i've dealt with a couple places using AS/400's and their DB2 DB wouldn't do unicode at all
Let me guess, do they store everything in upper case? 🙂
s
not a mainframe. SOAP-based API built in 2014!
who in 2014 was like 'yeah XML is the best'
a
No excuse even for a SOAP API built in 2014. Supporting unicode was well-trod ground even then. It's just shit implementation.
s
I've rewritten by email acknowledgment of 'no unicode characters allowed' several times to try and make my reply something other than 'it's just shit implementation' but not having much luck yet
api docs are even better. it's a windows help file
🤢 1
a
hahaha
😞
b
We still have one of those floating around here somewhere too (predates me)
You could always do something like...
string.reReplace( "[^\x00-\x7F]+", "__ONLY SUPPORTED IN THE FUTURE__", "all" )
s
people gonna be mad when whey find out they just signed up their kid, ONLY SUPPORTED IN THE FUTURE, to play soccer
plus what does 'the future' even mean when you're dealing with SOAP in 2022
b
Little Bobby Tables is playing goalie this year, I hear
👍 2
b
lol
"Tomáš".reReplace( "[^\x00-\x7F]+", "[NOT 'MERICAN]", "all" )
s
the best part is where they claimed this limitation was due to ...PCI-DSS compliance
eye roll 1
"deploy acronym buzzword defense shield"
I kind of want to write back and say 'hey guys, I'm literally on a COLDFUSION SLACK CHANNEL and we're talking shit about your ancient, dumb ways, make of that what you will'
😆 2
thanks to James Moberg and https://dev.to/gamesover/convert-unicode-strings-to-ascii-with-coldfusion-junidecode-lhf - figures the solution would be a Java port of a damn PERL module
😀 1
m
we have had ok luck using java.text.Normalizer
s
that's the first step in James' solution above
m
Copy code
public string function stripAccents(string input="") {
	var pattern = createObject("java","java.util.regex.Pattern").compile("\p{InCombiningDiacriticalMarks}+");
	var Normalizer = createObject("java","java.text.Normalizer");
	var decomposed = Normalizer.normalize(trim(input), createObject("java","java.text.Normalizer$Form").NFD);
	return trim(pattern.matcher(decomposed).replaceAll(""));
}
j
The first question would be "which charsets do they accept?". The example Tomáš would fit in win-1251 I think (basically the East European default Latin charset).
s
ASCII.
j
Which ascii? 6 bit? 7 bit? 8 bit?
s
Looks like 7
they don't exactly want to talk about it at length. Can't imagine why
j
My guess: because they have several different systems that each use their own 8-bit charset, and then they decided that using just the ascii half of it would work because that was common between all the systems.
But in that case I don't have any suggestions that haven't been mentioned yet.