Hi All. Kind of a strange issue here. We're usin...
# cfml-general
n
Hi All. Kind of a strange issue here. We're using the "encodeForHTML" function in ACF 2016 and finding that with upper case international characters (like "A" with an accent), it seems to change them to lower case (while keeping the accent). Everything else seems fine but we can't figure out why it is changing the special characters to lower case. Anyone else run across this?
t
https://trycf.com/gist/94fd8c298baf64423e956eb88be95c61/acf2021?theme=monokai Seems like it's working right to me... Are you doing something to the code afterward to make it lower case?
n
@Tim that's what we're trying to figure out. it looks correct in the db, but the characters on the email screen are "a" (with accent), not "á)
so, we don't see the &aacute code at all on screen (although we do in the db)
so, it seems like we are doing something wrong in processing
t
right, you wouldn't, because when you output the
´
as HTML, it becomes
á
. But somewhere, it seems that what should be
Á
is becoming
á
and you need to try and track that down. But I don't think that place is the
encodeForHTML
function itself.
d
As a side note, you generally shouldn't html encode stuff before inserting it into the database. You should store it in its original format, and then html encode it when you output it to the user's screen.
👍 1
a
á doesn't need to be encoded in the first place, so the encoding part of this is a red herring.
n
@David Buck thanks for the tip
a
Your mention this is in an email. Are you seeing the encoding of the email correctly? EG telling the mail server its content is UTF8
n
@Adam Cameron I was wondering that. Basically we just decodeforhtml the whole output from the db to the html email message when sending. So, this isn't just for intl characters - that just happens to be where we see a problem (uppercase intl characters).
a
Are you saying that á is actually being turned into an HTML entity somewhere in this process?
n
@Adam Cameron Yes we are using utf8 in the email declaration / meta tags. I initially thought that might be the issue, but maybe not
a
Meta tags just tell the user agent what to use. You will need to also actually use that encoding in the mail message too. Otherwise the mail server might be sending it using some other encoding scheme
n
I think that's what I am saying. When we edit and submit to the db, we see Aacute;. But when it gets received as an email, or even displayed as a preview (with decodeforHTML function), it shows as just <p>a</p>. When we remove the decodeforHTML function, it looks right - no problems.
a
Be mindful that there are two different encoding concepts here. Data encoding (which charset the data should be interpreted as), and security encoding (like
encodeForHtml
etc)
n
ok. my encoding knowledge is weak
although i'm not the main coder in this case
a
Look at the underlying email message (like in GMail "show original"), rather than how the mail client renders it.
n
yes, been doing that. it can be pretty messy - the email clients seem to add a lot of crazy characters, but we can get a sense of what it is displaying
at the top of the email, we include this line: <meta http-equiv='Content-Type' content='text/html; charset=UTF-8'>
but, as we taken the content from the db and display it in the email body, we apply decodeforhtml - that seems to be where the problem happens. when we remove that function, it seems to work fine.
so, we may be doing something else wrong and it's just emerging in this situation.
@David Buck I guess I am wondering if there is guidance about when to use encodeforhtml and decodeforhtml. we've used these over the years, but perhaps not as systematically as needed. when I look at the cf docs, they don't usually get into that kind of guidance. I'm googling as well.
a
If you actually have the sequence of characters
&Aacute;
and it's rendering in the email client as
a
(not
á
or
Á
) then this is not an encoding issue. It's just the mail client not doing a good job of what it's been told.
d
I have never had a need to use
decodeForHTML()
. You might not either.
a
Sidebar: why are you messing around with things like
&Aacute;
when you could just use
Á
?
d
The guidance is simple: when sending data to your database, secure it with queryparam. When sending data to a user, secure it with one of the "encodeFor" functions.
There was a pretty good thread about it on this channel a few days ago. Might still be there if you scroll up.
n
@Adam Cameron great question. our client is using the web editor to create email messages for his contacts (sent from our platform). The ckeditor 4 web editor that we use is inserting special characters as &Aacute;
So, perhaps that might be a change that we could make (in how the editor is rendering these in the first place)
@David Buck Thanks! I missed that.
@David Buck @Adam Cameron Our take on the encode, was that we have used encode to prevent xss from getting into the database and if it did, render it inert. Sounds like that is different from your thinking (above) about just using cfqueryparam for insert.
a
Yeah you don't know what encoding to use when storing it. You only know that when coming to use it. So the general guidance these days is store it as-is, then use the appropriate encoding strategy for when it's being used. But encoding it on entry into the DB is better than not encoding it at all ever 😉
d
Your database isn't vulnerable to XSS, so you may as well store the data in its original form instead of corrupting it with html encoding. That will make things easier if you ever need to search it, or view it in anything besides a web browser.
a
@nickg are you seeing this incorrect rendering of
&Aacute;
in all email clients, or just a specific one?
I wondered if it was a new HTML5 one, and some email clients might not support the newer entities, but that one has been around since HTML2.
n
@Adam Cameron all clients tested so far including gmail, outlook.com, and smartermail, so i think it's on our end
d
So, if I understand correctly, ckeditor is html encoding the messages that you're storing in the database (sorry, I assumed you were doing that via
encodeForHTML
). If so, that may suggest a good use case for
decodeForHTML
. I would probably decode those messages before inserting them into the database. Then use
encodeForHTML
when outputting them to a web page or emailing them.
m
when you collect data using a wysiwyg like editor, such as ckeditor or tinymce, it will regularly encode some things that may or may not be desirable. The intent from those are to provide you raw html that can be dropped into a page as is, you wouldn't use encodeForXxx() when dropping into the page, and you would store into the db as is. When placing back into the textarea or into div to load into wysiwyg, you would use encodeForHTML(). Depending on your target output, the escaping the wysiwyg applies can make it pretty painful since it is creating content to output as HTML. These should be running through either antisamy, java htmlsanitizer or jsoup with a ruleset that matches the allowed elements. Unfortunately, those also sometimes apply encoding to characters (like " becomes &#34; or &quot; in java htmlsanitizer and antisamy respectively). If you have cleaned up your markup through one of the sanitizers, then you could probably canonicalize() the result to get the characters to go back to normal. If your content contains encoded scripting that is counter to the output environment though, it could go from safe to risk. I honestly have never used, or seen decodeForHtml() used.
2
a
I did not even know
decodeForHtml
existed until now.
d
Me neither. When I first saw nickg mention it, I thought he was confused about the name of encodeForHTML().
It seems what we have here is a legitimate need to store html in the database, and output it later, as html. That's not something I've ever done, or know how to do properly, so best ignore everything I've said up to this point.
a
I think @Matt Jones’s advice is pretty solid, with the "These should be running through either antisamy, java htmlsanitizer or jsoup [...]"
👍 1
n
@David Buck Haha. Well I was treating your words as gospel until you walked it all back. We have a CMS / Email tool here, so user defined html is being sent to the db, but in this case, that part seems fine - the Aacute; etc. look find in the db. It's when we bring it out for display that we see problems. But, we're going to do some experiments based on feedback here.
@Matt Jones @Adam Cameron when it comes to antisamy, it the CF function GetSafeHTML a reasonable way to do it? Link here: https://helpx.adobe.com/coldfusion/cfml-reference/coldfusion-functions/functions-e-g/getsafehtml.html another implementation that I have seen is Pete Freitag's, here: https://www.petefreitag.com/item/760.cfm
m
@nickg yes, getSafeHTML() would be a reasonable way. I use lucee, which doesn't have that available. While Pete's article is old (it predated getSafeHTML() being available), it is essentially what we do today, just with newer jars than the article. The other two may seem a little more daunting because they aren't built into CF, but they are all reasonably easy to use once you set them up. They all have pro's and cons. AntiSamy is an owasp project, covers the most clean items allowed use cases, including getting into css and such, but is the slowest (in most cases, not slow enough to be noticable), and in my opinion is hardest to build the rules (mostly because the policy file has so many possibilities, but once you get into it, it isn't hard to build), this is what drives GetSafeHTML(). This also can result in giving you well formed xhtml out, and you can use that to extract data pretty well, if you care about, or need that. Java HTML Sanitizer is another owasp project, you construct a policy by allowing elements, attributes and such, then sanitize the untrusted string against the policy. It is pretty easy to look at a policy and figure out what it is doing, but if the policy is very complicated, you may find use cases that this one doesn't cover. It's only use is really for sanitizing markup against a policy. JSoup has a few more use cases, all dealing with html in general, beyond the cleaning, you can use it for extraction, manipulation, it is also the best of the 3 for stripping everything down to just text (the others tend to end up with words running together because of tags). There are a few built in policies, or you can construct your own in a similar manner to html sanitizer. https://owasp.org/www-project-antisamy/ https://owasp.org/www-project-java-html-sanitizer/ https://jsoup.org/cookbook/cleaning-html/safelist-sanitizer
For getting used to rules with antiSamy, I would start with their prebuilt tinymce config, strip stuff you don't want, and if it is missing anything you aren't sure how to add, go to the anything goes config to find how to add it. for java html sanitizer, you start with an empty policy (which allows no html markup) and add what you want to identify as safe adding rules looks like this:
.allowElements(["a","br","p","i","b","em","strong"])
.allowAttributes(["title"]).globally()
.allowStandardUrlProtocols()
.allowAttributes(["href"]).onElements(["a"]).requireRelNofollowOnLinks()
for getting start with rules on jsoup, you start with one of their basic rulesets then add to it. none() is essentially the same as the java html sanitizer with nothing allowed. https://jsoup.org/apidocs/org/jsoup/safety/Safelist.html adding rules looks like this
.addTags(["br","p","i","b","em","strong"])
.addAttributes(":all", ["title"])
.addAttributes("a", ["href"])
.addEnforcedAttribute("a", "rel", "nofollow")
n
@Matt Jones Huge thanks for the info. We have built something that does some of these things but should probably upgrade to one of these. One thing that we never quite got around to perfecting is when you need to whitelist a tag (eg. iframe) periodically. Guessing these approaches have ways to deal with those kinds of situations.
m
yes, all of them should be able to. The being capable to modify rules based off of conditions works easier html sanitizer and jsoup, since you build the rules in code rather than the stored xml file. I do include iframe in some of my html sanitizer policies, where I would include iframe in the allowElements() , and something like this for whatever attributes you need to allow on it
.allowAttributes(["allowfullscreen","allowtransparency","frameborder","height","scrolling","src","width"]).onElements(["iframe"])