Embed arbitrary files into a PDF via "unoconv"

BlueWaterSailor · Dec 7, 2022

spy said:
I had text2pdf somewhere... it was the first thing I reached for, but couldn't find it when I needed it! Lol, that's how I stumbled on unoconv. It's more bloated I'm sure. But again:

Code:

> text2pdf Command 'text2pdf' not found, did you mean: command 'text2odf' from deb libopenoffice-oodoc-perl command 'texi2pdf' from deb texinfo Try: sudo apt install <deb name>

It's not part of any distro that I know of (licensing conflict, as I recall.) But then again, it's trivial to build:

Code:

wget http://www.eprg.org/pdfcorner/text2pdf/text2pdf.c -O text2pdf.c
wget http://www.eprg.org/pdfcorner/text2pdf/Makefile -O Makefile

# You can change the value of BINDIR in the Makefile to point to your own ~/bin
# instead of /usr/local/bin - that way, you won't need to use 'sudo' for 'make install'

make
make install

I'm actually a bit curious now to see how the PDFs generated would differ (between text2pdf and unoconv).

I would guess that the 'text2pdf' version would be smaller than the 'unoconv' - I actually wrote a trivial text-to-PDF creator for a one-off case years ago, and I recall it was mostly a matter of slapping on a header and a footer plus a little positioning for the lines. LibreOffice is likely to have all sorts of templating in there.

stochastix · Dec 7, 2022

spy said:
I came up with this hack in order to transfer Python code to Interactive Brokers tech support. Most support reps usually just tell you to reboot your computer. But sometimes, if you've found a legitimate problem in their code, they are willing to work closely with you; esp. if it's communicated clearly. It may take forever, and they may say it won't get fixed, but at least they're aware of the problem ;-)

As Python coders know, white-space indentation denotes code blocks. Coming from a C background there's a good reason this annoys me. With C, if your indentation gets mangled during a copy and paste (as is often the case) you can just re-indent the code with a "pretty print" program. After all, braces clearly denote code blocks in C. With Python though, if the same thing happens... you've introduced bugs into the code. There's no easy way to pretty print Python that I know of. This can be a PITA even with small code files.

Furthermore... IB restricts the types of files you can attach to a tech support ticket. Presumably the reason for this is security, which I can understand. Nonetheless, it means that neither a .txt nor .py file can be uploaded. They do allow image and PDF files however.

I don't exactly agree with this policy 100% since there are security vulnerabilities in image and PDF viewers too. But, we still have to live with the restrictions. So when I wanted to send them a Python file (which is basically plain text) and be sure white-space bugs wouldn't be introduced via copy/paste... I discovered "unoconv"; the universal office converter:

"unoconv" is a command line utility that can convert any file format that LibreOffice can import, to any file format that LibreOffice is capable of exporting.

With this we can easily and quickly create a PDF document from a simple text file.

We still can't just create a PDF of the Python source directly. Because, copy and pasting from the PDF may screw with the leading white-space :-( We want to be assured that copy/pasting won't ruin meaningful content i.e. the leading white-space. And, without requiring OCR software, embedding text in an image isn't a viable solution either.

Ultimately the recipient must be able to open the allowed file type (PDF), know what to do with it at a glance, and be able to copy/paste the contents as usable code. Just a little more creativity should get us there.

To achieve this, we gzip and base64 encode the Python. This becomes our "payload". Then, we put this payload in a bash script where we can insure leading white-space is non-existent or benign. The PDF can be created from that instead.

For example, let's say we have the following demo.py file:

Code:

#!/usr/bin/env python import sys print("You are using Python {}.{}.".\ format(sys.version_info.major, sys.version_info.minor)) if not sys.version_info.major == 3 and sys.version_info.minor >= 6: print("Python 3.6 or higher is required.") sys.exit(1)

Obviously the leading white-space matters here.

To create a PDF file that we can attach to our support ticket, we first generate a payload based off demo.py:

Code:

cat demo.py | gzip | base64 > payload.sh

The payload.sh file now contains a bunch of text, it may look like gibberish, but it's just our demo.py in a form that travels more easily through assorted tunnels.

Now, we can edit the payload.sh to add some simple human readable instructions. Put this at the very top of payload.sh:

Code:

#!/usr/bin/env bash cat > /dev/null <<NOTES Hi! Copy and paste everything you see in this PDF into a .sh file. Then, you can run it and our Python demo code will be printed out. NOTES (base64 -d | zcat) < /bin/cat << PAYLOAD

and place another line containing PAYLOAD at the very bottom to mark the end of the heredoc.

Your payload.sh file should ultimately look like this:

Code:

#!/usr/bin/env bash cat > /dev/null <<NOTES Hi! Copy and paste everything you see in this PDF into a .sh file. Then, you can run it and our Python demo code will be printed out. NOTES (base64 -d | zcat) < /bin/cat << PAYLOAD H4sIANvGj2MAA3WOwQqCUBBF9+8rbrZRiCchtAjsG9oGQRg+cwJnbN5TkujfU3FXDbOae89w1qu0 85peiVPHPdoh1MLGUNOKBvjBG9MqcYijk3Qo1KHzxDcc5yJebztuZM8G81SiTRHikbO9U0/CF+JK bFPcRTf4vhOLJomhCizhRz5xyHNkKLj8w+OQY7efDRbXxS6zO4xxTbfaKchD3aMjdaWNkrk+/XNP CvE2MR/PFp6BCgEAAA== PAYLOAD

Finally, convert the payload.sh into a PDF with unoconv:

Code:

unoconv -o message.pdf payload.sh

View the message.pdf you created to make sure the contents resembles the payload.sh file.

When your recipient opens message.pdf they'll be able to read it and copy/paste the contents into a shell script. Then they can execute that script to print the demo Python code (or redirect it to a file). With a little luck everything will unravel nicely. Obviously you can put other types of files in the payload too.

You may be thinking there isn't a need to wrap the payload with our shell code. After all, a PDF can be created from the payload directly. But most recipients won't know what to do with base64 encoded text alone; it's confusing gibberish without additional context.

Another alternative is to generate a PDF file from a shell archive. However, our tiny custom bash script has the benefit of providing some human readable content directly in-line (via heredoc or you could use a comment instead, but heredocs are a bit nicer IMHO). In this way the user can open the PDF and have some introduction to what's going on. This is a trade-off worth considering.

Finally... I know it's a somewhat complex process in order to merely transfer a file. Unfortunately, sometimes, working around questionable security restrictions requires this kind of rigmarole. Please don't shoot the messenger!

GL/HF with this fresh approach to a rather old endeavor... sending "a message in a bottle".

Sounds like a bunch of dumb fucks running microshaft windoze , what a massive waste

spy · Dec 7, 2022

Hint to the Mac user:
When talking to a Windows user, be sure to always adopt a condescending
tone. If you have trouble with this, ask a Unix user.

spy · Dec 7, 2022

BlueWaterSailor said:
It's not part of any distro that I know of (licensing conflict, as I recall.) But then again, it's trivial to build:

This is a shame and I'm guessing it has something to do with University of Nottingham policy. I can understand the university wanting to get ROI for their human capital. But, preventing the distribution of a 500 line program that's over 25 years old just sounds crazy. It seems long overdue they change the license to something reasonable IMHO. Of course, IDK... text2pdf is probably embedded in some commercial print control software and the university gets a couple cents every time a few 100k pages roll off a press somewhere. Academic researchers need to get paid somehow, right?

As far as actually switching to use text2pdf in my workflow (as opposed to unoconv), I would in a heartbeat... if it was in the system package repositories. However, ever since I left the embedded software development world, I tend to prioritize convenience over efficiency (within reason). The machines my stuff runs on lately are generally overkill for most tasks. Not sure if that's a good reflection on the hardware or a bad one on me

BlueWaterSailor · Dec 7, 2022

spy said:
Hint to the Mac user:
When talking to a Windows user, be sure to always adopt a condescending
tone. If you have trouble with this, ask a Unix user.

There's a hierarchy. When I was teaching for Sun, the old Solaris users - easily recognizable by their bushy beards, thick glasses, suspenders, and the pasty complexions from never going outside - used to sneer at us Linux guys because it was a "temporary fad." And there was only one OS that was bulletproof enough to be sent into outer space, so haha, junior!

Sic transit gloria mundi...

apdxyk · Dec 7, 2022

Linux is a result of a failure; the kid couldn't make it work, so he asked the bushy-beardy types for help, and they made it work overnight. Only after IBM 'shock' with their 2 billion announcement and real salaried dudes from SUN, Oracle, IBM and many other evil entities it can now claim anything close to usability. Linus himself is a re-packager and a prick. I remember his interviews about '...unlike Mr Gates. I will...' and 'SUN must die!'. As an abortion survivor Linux came to be quite a vigorous albeit ugly child. Mutants can sometimes be innovative, but this one is not. I hope you will enjoy this.

BlueWaterSailor · Dec 7, 2022

apdxyk said:
Linux is a result of a failure

spy · Dec 8, 2022

spy said:
I mention sharutils as an alternative towards the bottom of the post.

However, I just tried the analog of the trick by generating a .shar of demo.py (instead of using the tiny custom shell script I present) and guess what... it didn't work!

Code:

$  ./test.shar
./test.shar: 14: ./test.shar: name: not found
./test.shar: 17: ./test.shar: 266: not found
./test.shar: 59: ./test.shar: Syntax error: "then" unexpected (expecting "done")

The reason: copy and pasting from the PDF back into a local script mangled the .shar code

The moral of this story seems to be a common one; K.I.S.S.

The outstanding question I still have is... am I using a lousy PDF viewer (evince)? Lol, maybe there are others that have more robust copy/paste functionality?

BlueWaterSailor · Dec 9, 2022

spy said:
However, I just tried the analog of the trick by generating a .shar of demo.py (instead of using the tiny custom shell script I present) and guess what... it didn't work!

OK, now you've got me curious.

Tested it out, and - yup, the round trip broke it (looks like the fonts in the PDF result in a "smart quote" which then breaks the shell scripting. That's why I don't like backquotes and prefer the $() syntax instead.)

But here's a version of your solution that does it all in one shot and lets you use whatever PDF converter you like:

Code:

#!/usr/bin/env bash

[ -z "$1" ] && { printf "Usage: ${0##*/} <input_file>\n"; exit 1; }

cat<<!
#!/usr/bin/env bash

<<%

Hi! Copy and paste everything you see in this PDF into a .sh file.
Then, you can run it and our Python demo code will be printed out.

%

base64 -d <<% |gunzip -
!
gzip -c $1|base64
echo %

Assuming you save this as 'pdfgen', run it like this:

Code:

./pdfgen demo.py|text2pdf > demo.pdf

P.S. I just noticed: your Python code has a bug in it.

The precedence of 'not' vs. 'and' means that e.g., Python3.5 won't throw an error when tested. You need to put parens around the two conditionals to ensure that both are evaluated.

spy · Dec 9, 2022

BlueWaterSailor said:
Tested it out, and - yup, the round trip broke it (looks like the fonts in the PDF result in a "smart quote" which then breaks the shell scripting. That's why I don't like backquotes and prefer the $() syntax instead.)

I'll still catch myself using backquotes sometimes but I like $() better too. Of course, all this has me curious, lol. I was seeing extraneous new lines; not smart quotes. So I tried a few more viewers; the simpler ones don't support copy/paste but I digress. What I noticed is that evince, okular, and atril all put in extraneous new lines. Why are you seeing smart quotes? Well...

I'm just guessing now, because I haven't actually tried text2pdf, but it's probably safe to say that our PDF creation tools tickle different issues with the shar. Could text2pdf be adding the smart quotes? I couldn't help myself

yes... I just built text2pdf and checked; it's changing backticks to smart quotes. I'm going to take a wild guess and say this is a unicode/UTF-8/encoding/code-page shortcoming somewhere (in text2pdf?). Which is interesting.

Also, for whaterver reason, okular actually inserted fewer new lines when copy/pasting, the errors produced were benign, and the process came full circle (again, using shar). So the most robust workflow appears to be okular with unoconv (as this can handle use of a shar). I still wonder which viewer you're using, but that's probably moot. Anyway...

BlueWaterSailor said:
But here's a version of your solution that does it all in one shot and lets you use whatever PDF converter you like:

In fairness, there was nothing ever keeping anyone from using text2pdf instead of unoconv. I could have chosen a better name for the thread though. And, in retrospect, should've left the color commentary regarding the IB bug out also; that's generated a little too much noise on the thread for my liking. I'm still a little pissed that bug isn't getting traction with them (noisy digression again, sorry).

On the very bright side, the meta-programming features you've added are great finishing touches

The little script may be worthy of people's ~/bin or /usr/local/bin now. Rewording the "our Python demo code will be printed" message into something regarding a generic payload being dumped to stdout is probably in order. Or maybe parametrized?

Might the tool go viral? That's a rhetorical question, lol.

BlueWaterSailor said:
P.S. I just noticed: your Python code has a bug in it. The precedence of 'not' vs. 'and' means that e.g., Python3.5 won't throw an error when tested. You need to put parens around the two conditionals to ensure that both are evaluated.

Good catch! But... I was focusing on outlining the process more than anything so don't blame me too much for this. I pilfered the sample Python code verbatim from the web. Admittedly I didn't look closely enough to notice the problem, so I'm happy to hand all the glory to you. I just want to be abdicated from some of the blame ;-) Again, nice eye!

All in all this has turned out to be a great exercise so far... can it go further? It can... I noticed something I don't like about unoconv! Presumably, since it can take a number of file types as input (not just .txt), it couldn't figure out how to interpret the data fed in via pipe.

This is very strange since it has a --stdin option! Process substitution didn't help either

Well, it's basically the same thing anyway. So, I ended up having to make a temporary file. Maybe this has been fixed in newer versions (but I think the tool's been deprecated).

Finally, it seems teamwork's paid off a lot on this one captain. So far we've found problems with basically... everything, lol! Except maybe this latest version of the script you've put forth