FilegetMimeType() seems like it has history with o...
# lucee
b
FilegetMimeType() seems like it has history with office docs - https://luceeserver.atlassian.net/browse/LDEV-1549 - the issue I'm experiencing is similar, but not... A recent security scan pointed out that we are accepting files with incorrect extensions, like a .pdf with a .png extension. I found FilegetMimeType() and 'fixed' the issue but 'proper' testing has found that for office documents labelled as something else FilegetMimeType() will return the mime type of the file extension rather than of the file itself. So, for a .docx labelled as a .pdf "application/pdf" and as a .jpg "image/jpeg". I've tried .docx, .doc, .pptx, .ppt, .xls and .xlsx and all exhibit the same issue. Does anybody have any ideas? Or is it simply a bug? Thanks!
z
did you try the strict option?
b
Yes... Once I found it in the docs, but the result was the same.
Also double checked I was passing the full path and not just the name of the file.
z
underneat, it's using the https://tika.apache.org/ 1.28.3
b
I also tried opening the file and using the file object but again the result was the same.
z
lucee only bundles the tika core, not all the parser modules due to the extra size (1.5mb) https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/1.28.3
b
What does that mean? It doesn't support Office files? And for files that it doesn't support it returns the mime type of the file extension?
Or is there a way for me to load those parser modules myself?
z
you could try dropping that jar in the lib folder, it's an osgi bundle
b
Umm... Which lib folder? 🤔
[lucee-home]\tomcat\webapps\ROOT\WEB-INF\lucee\lib seems to have done the trick according to the Bundle report... Now to test the parsing..
Sadly... No change in the output from my tests...
@zackster - So? Any further suggestions on how/if this can be made to work? I have no choice other than to find a way as 'security' has a rather high profile now... If I have to dig into the Java myself then so-be-it...
z
I've asked @micha if he has any ideas
hang on, does the bundle show up as active in the admin?
b
Yes - it shows up as active in the admin. I'll try the alternative install...
z
probably need to roll your own solution here, it's not just the 1.5mb, there are a shit load of Compile Dependencies (additional jars) for the parsers, then you need to use a a slightly different approach to call tika
b
Then it would be really useful if that were stated in the documentation... 😉
z
feel like getting your hands dirty?
b
Happy to help if I can... Sure...
And my tests don't show any different results after manually loading the tika jar
In the list of bundles it shows up as 'loaded' rather than 'active' - is that significant?
z
as far as i know, that just means felix knows it's there, but it's still missing all the dependencies
b
ok... the other thing I can do is find a way to just get 'magic number' for the file using Java... I was researching that when I found the FilegetMimeType() method - but the search was proving difficult - maybe you know where I could find such a method instead?
z
it actually works a lot of the time, I'm just debugging some extra checking I added in at work for some FileGetMimeType checks, it's picking up quite a few files
👍 1
b
If I can work out a way to combine the results of Lucee and lucee-tika I might be able to come up with a complete solution - but it would be ugly and I expect unreliable. Lucee appears seems to work 100% with files that are properly identified - but when (say) a .docx file is given a .pdf extension then Lucee get's it wrong consistently with all wrong extensions. Lucee-tika gets it mostly right however. The thing that Lucee-tika gets wrong though is older Office documents (like .doc, .xls, .ppt etc) that all get labelled with MimeType : 'application/x-tika-msoffice' which is kindof ok but probably wouldn't pass a stringent security test. Also, Lucee-tike identifies .pptx files as MimeType : 'application/octet-stream' which would fail on a MimeType to file extension lookup.
z
did you try with lucee-tika with the latest bundle? https://mvnrepository.com/artifact/org.apache.tika/tika-app/2.6.0
b
no - 2.4.1 from cfsimplicity
Switched to 2.6.0 and retested but the results are the same.