How do I search extracted pdf text? I extracted a ...
# cfml-beginners
How do I search extracted pdf text? I extracted a PDF like so: <cfpdf action="extracttext" source="Mornings-with-Tozer.pdf" name="tozer" /> <cfcontent type="text/html" /> <cfoutput>#tozer#</cfoutput> How do I search for let's say March 2 and then display that content?
What's in that
It's the entire output of the pdf in text format
So it's a string. You are asking how to search in a string?
Or am I missing something?
Basically, but how do I find that particular substring, and then output the text after it? I'm trying to find a date in a PDF devotional I have, then display the text from that date.
Or searching a pdf magazine and printing articles from that particular date
possibly something like the listlast functiom?
then display the text from that date.
from that date... until... what? The end of the document?
What list?
Have you looked at the string functions in the docs? ie: have you... tried to find out how do to a find in a string in CFML? (hint: it's
) From there... you've found where your date is. How do you determine "the text after it". That is very vague, and you can't base your next step on that. Unless you mean "until the end of the document", in which case that's what you should tell us.
I would guess you want to find the text from that first date up until... what? The next... date? Until a paragraph break? 100 characters?
Or perhaps back up. Forget about the "how". What exactly is the real-world problem you are trying to solve. Use sufficient words so we - who don't know what you are trying to do, the contents of your PDF, etc - can understand.
"I need to pull all the text from the section of the PDF from a given date. The sections are delimited by page breaks, so just to the bottom of the page"
well the PDF isn't really delimited. At first I was asking generically, then as time progressed my needs changed. I guess I need stuff until the page break.
I may need to use a regEx. I did bad at those way back in my college days (80's!)
Sounds to me like you need to
a starting point (a date string?), and - starting from there -
a finishing point (the next page break), and then get the
part of the string between those two points. If it's as literal as you say, then no need for regexes. Is it a specifc date (ie: you have the exact string you need to find)? Or is it a string that follows a pattern like a date (eg: anything like yyyy-mm-dd)?
More llke MM DD
Right. You you need to find any pattern of chararcters that is two digits, a space, then another two digits? And that's your starting point?
I don't really want to tease every discrete detail out of you. I'm trying to get you to think more analytically about what you actually need to do.
You probably need to go and read through the docs and understand the string functions CFML has, and get a bit better handle on string manipulation first.
Actually, the dates are listed as March 13 for example. I wanted to brush up on string manipulation, and see where it got me? ;)